* [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion
@ 2017-08-17 23:01 Richard Henderson
2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
` (8 more replies)
0 siblings, 9 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, alex.bennee
When Alex and I started talking about this topic, this is the direction
I was thinking. The primary difference from Alex's version is that the
interface on the target/cpu/ side uses offsets and not a faux temp. The
secondary difference is that, for smaller vector sizes at least, I will
expand to inline host vector operations. The use of explicit offsets
aids that.
There are a number of things that are missing in the host vector support,
including register spill/fill. But in this example conversion we will
never have more than 2 vector registers live at any point, and so we do
not run across those issues.
Some of this infrastructure cannot be exercised with existing front-ends.
It will require support for ARM SVE to be written to get there.
Or to add support for AVX2/AVX512 within target/i386. ;-)
Unfortunately, the built-in disassembler is too old to handle AVX.
So for testing purposes I disabled the built-in disas so that I could
run the output assembly through an external objdump.
For a trivial test case via aarch64-linux-user:
IN:
0x0000000000400078: 4e208400 add v0.16b, v0.16b, v0.16b
0x000000000040007c: 4e648462 add v2.8h, v3.8h, v4.8h
0x0000000000400080: 4ea48462 add v2.4s, v3.4s, v4.4s
0x0000000000400084: 4ee48462 add v2.2d, v3.2d, v4.2d
0x0000000000400088: 0ea28462 add v2.2s, v3.2s, v2.2s
0x000000000040008c: 00000000 unallocated (Unallocated)
OP after optimization and liveness analysis:
ld_i32 tmp0,env,$0xffffffffffffffec dead: 1
movi_i32 tmp1,$0x0
brcond_i32 tmp0,tmp1,lt,$L0 dead: 0 1
---- 0000000000400078 0000000000000000 0000000000000000
ld_v128 tmp2,env,$0x850
add8_v128 tmp2,tmp2,tmp2 dead: 1 2
st_v128 tmp2,env,$0x850 dead: 0
---- 000000000040007c 0000000000000000 0000000000000000
ld_v128 tmp2,env,$0x880
ld_v128 tmp3,env,$0x890
add16_v128 tmp2,tmp2,tmp3 dead: 1 2
st_v128 tmp2,env,$0x870 dead: 0
---- 0000000000400080 0000000000000000 0000000000000000
ld_v128 tmp2,env,$0x880
ld_v128 tmp3,env,$0x890
add32_v128 tmp2,tmp2,tmp3 dead: 1 2
st_v128 tmp2,env,$0x870 dead: 0
---- 0000000000400084 0000000000000000 0000000000000000
ld_v128 tmp2,env,$0x880
ld_v128 tmp3,env,$0x890
add64_v128 tmp2,tmp2,tmp3 dead: 1 2
st_v128 tmp2,env,$0x870 dead: 0
---- 0000000000400088 0000000000000000 0000000000000000
ld_v64 tmp4,env,$0x880
ld_v64 tmp5,env,$0x870
add32_v64 tmp4,tmp4,tmp5 dead: 1 2
st_v64 tmp4,env,$0x870 dead: 0
movi_i64 tmp6,$0x0
st_i64 tmp6,env,$0x878 dead: 0
---- 000000000040008c 0000000000000000 0000000000000000
movi_i64 pc,$0x40008c sync: 0 dead: 0
movi_i32 tmp0,$0x1
movi_i32 tmp1,$0x2000000
movi_i32 tmp7,$0x1
call exception_with_syndrome,$0x0,$0,env,tmp0,tmp1,tmp7 dead: 0 1 2 3
set_label $L0
exit_tb $0x521c86683
OUT: [size=220]
521c86740: 41 8b 6e ec mov -0x14(%r14),%ebp
521c86744: 85 ed test %ebp,%ebp
521c86746: 0f 8c c4 00 00 00 jl 0x521c86810
521c8674c: c4 c1 7a 6f 86 50 08 00 00 vmovdqu 0x850(%r14),%xmm0
521c86755: c4 e1 79 fc c0 vpaddb %xmm0,%xmm0,%xmm0
521c8675a: c4 c1 7a 7f 86 50 08 00 00 vmovdqu %xmm0,0x850(%r14)
521c86763: c4 c1 7a 6f 86 80 08 00 00 vmovdqu 0x880(%r14),%xmm0
521c8676c: c4 c1 7a 6f 8e 90 08 00 00 vmovdqu 0x890(%r14),%xmm1
521c86775: c4 e1 79 fd c1 vpaddw %xmm1,%xmm0,%xmm0
521c8677a: c4 c1 7a 7f 86 70 08 00 00 vmovdqu %xmm0,0x870(%r14)
521c86783: c4 c1 7a 6f 86 80 08 00 00 vmovdqu 0x880(%r14),%xmm0
521c8678c: c4 c1 7a 6f 8e 90 08 00 00 vmovdqu 0x890(%r14),%xmm1
521c86795: c4 e1 79 fe c1 vpaddd %xmm1,%xmm0,%xmm0
521c8679a: c4 c1 7a 7f 86 70 08 00 00 vmovdqu %xmm0,0x870(%r14)
521c867a3: c4 c1 7a 6f 86 80 08 00 00 vmovdqu 0x880(%r14),%xmm0
521c867ac: c4 c1 7a 6f 8e 90 08 00 00 vmovdqu 0x890(%r14),%xmm1
521c867b5: c4 e1 79 d4 c1 vpaddq %xmm1,%xmm0,%xmm0
521c867ba: c4 c1 7a 7f 86 70 08 00 00 vmovdqu %xmm0,0x870(%r14)
521c867c3: c4 c1 7a 7e 86 80 08 00 00 vmovq 0x880(%r14),%xmm0
521c867cc: c4 c1 7a 7e 8e 70 08 00 00 vmovq 0x870(%r14),%xmm1
521c867d5: c4 e1 79 fe c1 vpaddd %xmm1,%xmm0,%xmm0
521c867da: c4 c1 79 d6 86 70 08 00 00 vmovq %xmm0,0x870(%r14)
521c867e3: 49 c7 86 78 08 00 00 movq $0x0,0x878(%r14)
521c867ea: 00 00 00 00
521c867ee: 49 c7 86 40 01 00 00 movq $0x40008c,0x140(%r14)
521c867f5: 8c 00 40 00
521c867f9: 49 8b fe mov %r14,%rdi
521c867fc: be 01 00 00 00 mov $0x1,%esi
521c86801: ba 00 00 00 02 mov $0x2000000,%edx
521c86806: b9 01 00 00 00 mov $0x1,%ecx
521c8680b: e8 90 40 c9 ff callq 0x52191a8a0
521c86810: 48 8d 05 6c fe ff ff lea -0x194(%rip),%rax
521c86817: e9 3c fe ff ff jmpq 0x521c86658
Because I already had some pending fixes to tcg/i386/ wrt VEX encoding,
I've based this on an existing tree. The compete tree can be found at
git://github.com/rth7680/qemu.git native-vector-registers-2
r~
Richard Henderson (8):
tcg: Add generic vector infrastructure and ops for add/sub/logic
target/arm: Use generic vector infrastructure for aa64 add/sub/logic
tcg: Add types for host vectors
tcg: Add operations for host vectors
tcg: Add tcg_op_supported
tcg: Add INDEX_op_invalid
tcg: Expand target vector ops with host vector ops
tcg/i386: Add vector operations
Makefile.target | 5 +-
tcg/i386/tcg-target.h | 46 +++-
tcg/tcg-op-gvec.h | 92 +++++++
tcg/tcg-opc.h | 91 +++++++
tcg/tcg-runtime.h | 16 ++
tcg/tcg.h | 37 ++-
target/arm/translate-a64.c | 137 +++++++----
tcg/i386/tcg-target.inc.c | 382 ++++++++++++++++++++++++++---
tcg/tcg-op-gvec.c | 583 +++++++++++++++++++++++++++++++++++++++++++++
tcg/tcg-runtime-gvec.c | 199 ++++++++++++++++
tcg/tcg.c | 323 ++++++++++++++++++++++++-
11 files changed, 1817 insertions(+), 94 deletions(-)
create mode 100644 tcg/tcg-op-gvec.h
create mode 100644 tcg/tcg-op-gvec.c
create mode 100644 tcg/tcg-runtime-gvec.c
--
2.13.5
^ permalink raw reply [flat|nested] 36+ messages in thread
* [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
2017-08-30 1:31 ` Philippe Mathieu-Daudé
2017-09-07 16:34 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic Richard Henderson
` (7 subsequent siblings)
8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, alex.bennee
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
Makefile.target | 5 +-
tcg/tcg-op-gvec.h | 88 ++++++++++
tcg/tcg-runtime.h | 16 ++
tcg/tcg-op-gvec.c | 443 +++++++++++++++++++++++++++++++++++++++++++++++++
tcg/tcg-runtime-gvec.c | 199 ++++++++++++++++++++++
5 files changed, 749 insertions(+), 2 deletions(-)
create mode 100644 tcg/tcg-op-gvec.h
create mode 100644 tcg/tcg-op-gvec.c
create mode 100644 tcg/tcg-runtime-gvec.c
diff --git a/Makefile.target b/Makefile.target
index 7f42c45db8..9ae3e904f7 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -93,8 +93,9 @@ all: $(PROGS) stap
# cpu emulator library
obj-y += exec.o
obj-y += accel/
-obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/optimize.o
-obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/tcg-runtime.o
+obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-common.o tcg/optimize.o
+obj-$(CONFIG_TCG) += tcg/tcg-op.o tcg/tcg-op-gvec.o
+obj-$(CONFIG_TCG) += tcg/tcg-runtime.o tcg/tcg-runtime-gvec.o
obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
obj-y += fpu/softfloat.o
diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
new file mode 100644
index 0000000000..10db3599a5
--- /dev/null
+++ b/tcg/tcg-op-gvec.h
@@ -0,0 +1,88 @@
+/*
+ * Generic vector operation expansion
+ *
+ * Copyright (c) 2017 Linaro
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/*
+ * "Generic" vectors. All operands are given as offsets from ENV,
+ * and therefore cannot also be allocated via tcg_global_mem_new_*.
+ * OPSZ is the byte size of the vector upon which the operation is performed.
+ * CLSZ is the byte size of the full vector; bytes beyond OPSZ are cleared.
+ *
+ * All sizes must be 8 or any multiple of 16.
+ * When OPSZ is 8, the alignment may be 8, otherwise must be 16.
+ * Operands may completely, but not partially, overlap.
+ */
+
+/* Fundamental operation expanders. These are exposed to the front ends
+ so that target-specific SIMD operations can be handled similarly to
+ the standard SIMD operations. */
+
+typedef struct {
+ /* "Small" sizes: expand inline as a 64-bit or 32-bit lane.
+ Generally only one of these will be non-NULL. */
+ void (*fni8)(TCGv_i64, TCGv_i64, TCGv_i64);
+ void (*fni4)(TCGv_i32, TCGv_i32, TCGv_i32);
+ /* Similarly, but load up a constant and re-use across lanes. */
+ void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
+ uint64_t extra_value;
+ /* Larger sizes: expand out-of-line helper w/size descriptor. */
+ void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
+} GVecGen3;
+
+void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz, const GVecGen3 *);
+
+#define DEF_GVEC_2(X) \
+ void tcg_gen_gvec_##X(uint32_t dofs, uint32_t aofs, uint32_t bofs, \
+ uint32_t opsz, uint32_t clsz)
+
+DEF_GVEC_2(add8);
+DEF_GVEC_2(add16);
+DEF_GVEC_2(add32);
+DEF_GVEC_2(add64);
+
+DEF_GVEC_2(sub8);
+DEF_GVEC_2(sub16);
+DEF_GVEC_2(sub32);
+DEF_GVEC_2(sub64);
+
+DEF_GVEC_2(and8);
+DEF_GVEC_2(or8);
+DEF_GVEC_2(xor8);
+DEF_GVEC_2(andc8);
+DEF_GVEC_2(orc8);
+
+#undef DEF_GVEC_2
+
+/*
+ * 64-bit vector operations. Use these when the register has been
+ * allocated with tcg_global_mem_new_i64. OPSZ = CLSZ = 8.
+ */
+
+#define DEF_VEC8_2(X) \
+ void tcg_gen_vec8_##X(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+
+DEF_VEC8_2(add8);
+DEF_VEC8_2(add16);
+DEF_VEC8_2(add32);
+
+DEF_VEC8_2(sub8);
+DEF_VEC8_2(sub16);
+DEF_VEC8_2(sub32);
+
+#undef DEF_VEC8_2
diff --git a/tcg/tcg-runtime.h b/tcg/tcg-runtime.h
index c41d38a557..f8d07090f8 100644
--- a/tcg/tcg-runtime.h
+++ b/tcg/tcg-runtime.h
@@ -134,3 +134,19 @@ GEN_ATOMIC_HELPERS(xor_fetch)
GEN_ATOMIC_HELPERS(xchg)
#undef GEN_ATOMIC_HELPERS
+
+DEF_HELPER_FLAGS_4(gvec_add8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_add16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_add32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_add64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(gvec_sub8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sub16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sub32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sub64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(gvec_and8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_or8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_xor8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_andc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_orc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
new file mode 100644
index 0000000000..6de49dc07f
--- /dev/null
+++ b/tcg/tcg-op-gvec.c
@@ -0,0 +1,443 @@
+/*
+ * Generic vector operation expansion
+ *
+ * Copyright (c) 2017 Linaro
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "cpu.h"
+#include "exec/exec-all.h"
+#include "tcg.h"
+#include "tcg-op.h"
+#include "tcg-op-gvec.h"
+#include "trace-tcg.h"
+#include "trace/mem.h"
+
+#define REP8(x) ((x) * 0x0101010101010101ull)
+#define REP16(x) ((x) * 0x0001000100010001ull)
+
+#define MAX_INLINE 16
+
+static inline void check_size_s(uint32_t opsz, uint32_t clsz)
+{
+ tcg_debug_assert(opsz % 8 == 0);
+ tcg_debug_assert(clsz % 8 == 0);
+ tcg_debug_assert(opsz <= clsz);
+}
+
+static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
+{
+ tcg_debug_assert(dofs % 8 == 0);
+ tcg_debug_assert(aofs % 8 == 0);
+ tcg_debug_assert(bofs % 8 == 0);
+}
+
+static inline void check_size_l(uint32_t opsz, uint32_t clsz)
+{
+ tcg_debug_assert(opsz % 16 == 0);
+ tcg_debug_assert(clsz % 16 == 0);
+ tcg_debug_assert(opsz <= clsz);
+}
+
+static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
+{
+ tcg_debug_assert(dofs % 16 == 0);
+ tcg_debug_assert(aofs % 16 == 0);
+ tcg_debug_assert(bofs % 16 == 0);
+}
+
+static inline void check_overlap_3(uint32_t d, uint32_t a,
+ uint32_t b, uint32_t s)
+{
+ tcg_debug_assert(d == a || d + s <= a || a + s <= d);
+ tcg_debug_assert(d == b || d + s <= b || b + s <= d);
+ tcg_debug_assert(a == b || a + s <= b || b + s <= a);
+}
+
+static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
+{
+ if (clsz > opsz) {
+ TCGv_i64 zero = tcg_const_i64(0);
+ uint32_t i;
+
+ for (i = opsz; i < clsz; i += 8) {
+ tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
+ }
+ tcg_temp_free_i64(zero);
+ }
+}
+
+static TCGv_i32 make_desc(uint32_t opsz, uint32_t clsz)
+{
+ tcg_debug_assert(opsz >= 16 && opsz <= 255 * 16 && opsz % 16 == 0);
+ tcg_debug_assert(clsz >= 16 && clsz <= 255 * 16 && clsz % 16 == 0);
+ opsz /= 16;
+ clsz /= 16;
+ opsz -= 1;
+ clsz -= 1;
+ return tcg_const_i32(deposit32(opsz, 8, 8, clsz));
+}
+
+static void expand_3_o(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz,
+ void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32))
+{
+ TCGv_ptr d = tcg_temp_new_ptr();
+ TCGv_ptr a = tcg_temp_new_ptr();
+ TCGv_ptr b = tcg_temp_new_ptr();
+ TCGv_i32 desc = make_desc(opsz, clsz);
+
+ tcg_gen_addi_ptr(d, tcg_ctx.tcg_env, dofs);
+ tcg_gen_addi_ptr(a, tcg_ctx.tcg_env, aofs);
+ tcg_gen_addi_ptr(b, tcg_ctx.tcg_env, bofs);
+ fno(d, a, b, desc);
+
+ tcg_temp_free_ptr(d);
+ tcg_temp_free_ptr(a);
+ tcg_temp_free_ptr(b);
+ tcg_temp_free_i32(desc);
+}
+
+static void expand_3x4(uint32_t dofs, uint32_t aofs,
+ uint32_t bofs, uint32_t opsz,
+ void (*fni)(TCGv_i32, TCGv_i32, TCGv_i32))
+{
+ TCGv_i32 t0 = tcg_temp_new_i32();
+ uint32_t i;
+
+ if (aofs == bofs) {
+ for (i = 0; i < opsz; i += 4) {
+ tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
+ fni(t0, t0, t0);
+ tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
+ }
+ } else {
+ TCGv_i32 t1 = tcg_temp_new_i32();
+ for (i = 0; i < opsz; i += 4) {
+ tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
+ tcg_gen_ld_i32(t1, tcg_ctx.tcg_env, bofs + i);
+ fni(t0, t0, t1);
+ tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
+ }
+ tcg_temp_free_i32(t1);
+ }
+ tcg_temp_free_i32(t0);
+}
+
+static void expand_3x8(uint32_t dofs, uint32_t aofs,
+ uint32_t bofs, uint32_t opsz,
+ void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64))
+{
+ TCGv_i64 t0 = tcg_temp_new_i64();
+ uint32_t i;
+
+ if (aofs == bofs) {
+ for (i = 0; i < opsz; i += 8) {
+ tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
+ fni(t0, t0, t0);
+ tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
+ }
+ } else {
+ TCGv_i64 t1 = tcg_temp_new_i64();
+ for (i = 0; i < opsz; i += 8) {
+ tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
+ tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
+ fni(t0, t0, t1);
+ tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
+ }
+ tcg_temp_free_i64(t1);
+ }
+ tcg_temp_free_i64(t0);
+}
+
+static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint64_t data,
+ void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64))
+{
+ TCGv_i64 t0 = tcg_temp_new_i64();
+ TCGv_i64 t2 = tcg_const_i64(data);
+ uint32_t i;
+
+ if (aofs == bofs) {
+ for (i = 0; i < opsz; i += 8) {
+ tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
+ fni(t0, t0, t0, t2);
+ tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
+ }
+ } else {
+ TCGv_i64 t1 = tcg_temp_new_i64();
+ for (i = 0; i < opsz; i += 8) {
+ tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
+ tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
+ fni(t0, t0, t1, t2);
+ tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
+ }
+ tcg_temp_free_i64(t1);
+ }
+ tcg_temp_free_i64(t0);
+ tcg_temp_free_i64(t2);
+}
+
+void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
+{
+ check_overlap_3(dofs, aofs, bofs, clsz);
+ if (opsz <= MAX_INLINE) {
+ check_size_s(opsz, clsz);
+ check_align_s_3(dofs, aofs, bofs);
+ if (g->fni8) {
+ expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
+ } else if (g->fni4) {
+ expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
+ } else if (g->fni8x) {
+ expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
+ } else {
+ g_assert_not_reached();
+ }
+ expand_clr(dofs, opsz, clsz);
+ } else {
+ check_size_l(opsz, clsz);
+ check_align_l_3(dofs, aofs, bofs);
+ expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
+ }
+}
+
+static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
+{
+ TCGv_i64 t1 = tcg_temp_new_i64();
+ TCGv_i64 t2 = tcg_temp_new_i64();
+ TCGv_i64 t3 = tcg_temp_new_i64();
+
+ tcg_gen_andc_i64(t1, a, m);
+ tcg_gen_andc_i64(t2, b, m);
+ tcg_gen_xor_i64(t3, a, b);
+ tcg_gen_add_i64(d, t1, t2);
+ tcg_gen_and_i64(t3, t3, m);
+ tcg_gen_xor_i64(d, d, t3);
+
+ tcg_temp_free_i64(t1);
+ tcg_temp_free_i64(t2);
+ tcg_temp_free_i64(t3);
+}
+
+void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .extra_value = REP8(0x80),
+ .fni8x = gen_addv_mask,
+ .fno = gen_helper_gvec_add8,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .extra_value = REP16(0x8000),
+ .fni8x = gen_addv_mask,
+ .fno = gen_helper_gvec_add16,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .fni4 = tcg_gen_add_i32,
+ .fno = gen_helper_gvec_add32,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .fni8 = tcg_gen_add_i64,
+ .fno = gen_helper_gvec_add64,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_vec8_add8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+ TCGv_i64 m = tcg_const_i64(REP8(0x80));
+ gen_addv_mask(d, a, b, m);
+ tcg_temp_free_i64(m);
+}
+
+void tcg_gen_vec8_add16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+ TCGv_i64 m = tcg_const_i64(REP16(0x8000));
+ gen_addv_mask(d, a, b, m);
+ tcg_temp_free_i64(m);
+}
+
+void tcg_gen_vec8_add32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+ TCGv_i64 t1 = tcg_temp_new_i64();
+ TCGv_i64 t2 = tcg_temp_new_i64();
+
+ tcg_gen_andi_i64(t1, a, ~0xffffffffull);
+ tcg_gen_add_i64(t2, a, b);
+ tcg_gen_add_i64(t1, t1, b);
+ tcg_gen_deposit_i64(d, t1, t2, 0, 32);
+
+ tcg_temp_free_i64(t1);
+ tcg_temp_free_i64(t2);
+}
+
+static void gen_subv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
+{
+ TCGv_i64 t1 = tcg_temp_new_i64();
+ TCGv_i64 t2 = tcg_temp_new_i64();
+ TCGv_i64 t3 = tcg_temp_new_i64();
+
+ tcg_gen_or_i64(t1, a, m);
+ tcg_gen_andc_i64(t2, b, m);
+ tcg_gen_eqv_i64(t3, a, b);
+ tcg_gen_sub_i64(d, t1, t2);
+ tcg_gen_and_i64(t3, t3, m);
+ tcg_gen_xor_i64(d, d, t3);
+
+ tcg_temp_free_i64(t1);
+ tcg_temp_free_i64(t2);
+ tcg_temp_free_i64(t3);
+}
+
+void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .extra_value = REP8(0x80),
+ .fni8x = gen_subv_mask,
+ .fno = gen_helper_gvec_sub8,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .extra_value = REP16(0x8000),
+ .fni8x = gen_subv_mask,
+ .fno = gen_helper_gvec_sub16,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .fni4 = tcg_gen_sub_i32,
+ .fno = gen_helper_gvec_sub32,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .fni8 = tcg_gen_sub_i64,
+ .fno = gen_helper_gvec_sub64,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_vec8_sub8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+ TCGv_i64 m = tcg_const_i64(REP8(0x80));
+ gen_subv_mask(d, a, b, m);
+ tcg_temp_free_i64(m);
+}
+
+void tcg_gen_vec8_sub16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+ TCGv_i64 m = tcg_const_i64(REP16(0x8000));
+ gen_subv_mask(d, a, b, m);
+ tcg_temp_free_i64(m);
+}
+
+void tcg_gen_vec8_sub32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+ TCGv_i64 t1 = tcg_temp_new_i64();
+ TCGv_i64 t2 = tcg_temp_new_i64();
+
+ tcg_gen_andi_i64(t1, b, ~0xffffffffull);
+ tcg_gen_sub_i64(t2, a, b);
+ tcg_gen_sub_i64(t1, a, t1);
+ tcg_gen_deposit_i64(d, t1, t2, 0, 32);
+
+ tcg_temp_free_i64(t1);
+ tcg_temp_free_i64(t2);
+}
+
+void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .fni8 = tcg_gen_and_i64,
+ .fno = gen_helper_gvec_and8,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .fni8 = tcg_gen_or_i64,
+ .fno = gen_helper_gvec_or8,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .fni8 = tcg_gen_xor_i64,
+ .fno = gen_helper_gvec_xor8,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .fni8 = tcg_gen_andc_i64,
+ .fno = gen_helper_gvec_andc8,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
+
+void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t clsz)
+{
+ static const GVecGen3 g = {
+ .fni8 = tcg_gen_orc_i64,
+ .fno = gen_helper_gvec_orc8,
+ };
+ tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
+}
diff --git a/tcg/tcg-runtime-gvec.c b/tcg/tcg-runtime-gvec.c
new file mode 100644
index 0000000000..9a37ce07a2
--- /dev/null
+++ b/tcg/tcg-runtime-gvec.c
@@ -0,0 +1,199 @@
+/*
+ * Generic vectorized operation runtime
+ *
+ * Copyright (c) 2017 Linaro
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/host-utils.h"
+#include "cpu.h"
+#include "exec/helper-proto.h"
+
+/* Virtually all hosts support 16-byte vectors. Those that don't
+ can emulate them via GCC's generic vector extension.
+
+ In tcg-op-gvec.c, we asserted that both the size and alignment
+ of the data are multiples of 16. */
+
+typedef uint8_t vec8 __attribute__((vector_size(16)));
+typedef uint16_t vec16 __attribute__((vector_size(16)));
+typedef uint32_t vec32 __attribute__((vector_size(16)));
+typedef uint64_t vec64 __attribute__((vector_size(16)));
+
+static inline intptr_t extract_opsz(uint32_t desc)
+{
+ return ((desc & 0xff) + 1) * 16;
+}
+
+static inline intptr_t extract_clsz(uint32_t desc)
+{
+ return (((desc >> 8) & 0xff) + 1) * 16;
+}
+
+static inline void clear_high(void *d, intptr_t opsz, uint32_t desc)
+{
+ intptr_t clsz = extract_clsz(desc);
+ intptr_t i;
+
+ if (unlikely(clsz > opsz)) {
+ for (i = opsz; i < clsz; i += sizeof(vec64)) {
+ *(vec64 *)(d + i) = (vec64){ 0 };
+ }
+ }
+}
+
+void HELPER(gvec_add8)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec8)) {
+ *(vec8 *)(d + i) = *(vec8 *)(a + i) + *(vec8 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_add16)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec16)) {
+ *(vec16 *)(d + i) = *(vec16 *)(a + i) + *(vec16 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_add32)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec32)) {
+ *(vec32 *)(d + i) = *(vec32 *)(a + i) + *(vec32 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_add64)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec64)) {
+ *(vec64 *)(d + i) = *(vec64 *)(a + i) + *(vec64 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_sub8)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec8)) {
+ *(vec8 *)(d + i) = *(vec8 *)(a + i) - *(vec8 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_sub16)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec16)) {
+ *(vec16 *)(d + i) = *(vec16 *)(a + i) - *(vec16 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_sub32)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec32)) {
+ *(vec32 *)(d + i) = *(vec32 *)(a + i) - *(vec32 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_sub64)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec64)) {
+ *(vec64 *)(d + i) = *(vec64 *)(a + i) - *(vec64 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_and8)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec64)) {
+ *(vec64 *)(d + i) = *(vec64 *)(a + i) & *(vec64 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_or8)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec64)) {
+ *(vec64 *)(d + i) = *(vec64 *)(a + i) | *(vec64 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_xor8)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec64)) {
+ *(vec64 *)(d + i) = *(vec64 *)(a + i) ^ *(vec64 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_andc8)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec64)) {
+ *(vec64 *)(d + i) = *(vec64 *)(a + i) &~ *(vec64 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
+
+void HELPER(gvec_orc8)(void *d, void *a, void *b, uint32_t desc)
+{
+ intptr_t opsz = extract_opsz(desc);
+ intptr_t i;
+
+ for (i = 0; i < opsz; i += sizeof(vec64)) {
+ *(vec64 *)(d + i) = *(vec64 *)(a + i) |~ *(vec64 *)(b + i);
+ }
+ clear_high(d, opsz, desc);
+}
--
2.13.5
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
2017-09-07 16:58 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
` (6 subsequent siblings)
8 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, alex.bennee
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/translate-a64.c | 137 ++++++++++++++++++++++++++++-----------------
1 file changed, 87 insertions(+), 50 deletions(-)
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 2200e25be0..025354f983 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -21,6 +21,7 @@
#include "cpu.h"
#include "exec/exec-all.h"
#include "tcg-op.h"
+#include "tcg-op-gvec.h"
#include "qemu/log.h"
#include "arm_ldst.h"
#include "translate.h"
@@ -82,6 +83,7 @@ typedef void NeonGenTwoDoubleOPFn(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_ptr);
typedef void NeonGenOneOpFn(TCGv_i64, TCGv_i64);
typedef void CryptoTwoOpEnvFn(TCGv_ptr, TCGv_i32, TCGv_i32);
typedef void CryptoThreeOpEnvFn(TCGv_ptr, TCGv_i32, TCGv_i32, TCGv_i32);
+typedef void GVecGenTwoFn(uint32_t, uint32_t, uint32_t, uint32_t, uint32_t);
/* initialize TCG globals. */
void a64_translate_init(void)
@@ -537,6 +539,21 @@ static inline int vec_reg_offset(DisasContext *s, int regno,
return offs;
}
+/* Return the offset info CPUARMState of the "whole" vector register Qn. */
+static inline int vec_full_reg_offset(DisasContext *s, int regno)
+{
+ assert_fp_access_checked(s);
+ return offsetof(CPUARMState, vfp.regs[regno * 2]);
+}
+
+/* Return the byte size of the "whole" vector register, VL / 8. */
+static inline int vec_full_reg_size(DisasContext *s)
+{
+ /* FIXME SVE: We should put the composite ZCR_EL* value into tb->flags.
+ In the meantime this is just the AdvSIMD length of 128. */
+ return 128 / 8;
+}
+
/* Return the offset into CPUARMState of a slice (from
* the least significant end) of FP register Qn (ie
* Dn, Sn, Hn or Bn).
@@ -9042,11 +9059,38 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
bool is_q = extract32(insn, 30, 1);
TCGv_i64 tcg_op1, tcg_op2, tcg_res[2];
int pass;
+ GVecGenTwoFn *gvec_op;
if (!fp_access_check(s)) {
return;
}
+ switch (size + 4 * is_u) {
+ case 0: /* AND */
+ gvec_op = tcg_gen_gvec_and8;
+ goto do_gvec;
+ case 1: /* BIC */
+ gvec_op = tcg_gen_gvec_andc8;
+ goto do_gvec;
+ case 2: /* ORR */
+ gvec_op = tcg_gen_gvec_or8;
+ goto do_gvec;
+ case 3: /* ORN */
+ gvec_op = tcg_gen_gvec_orc8;
+ goto do_gvec;
+ case 4: /* EOR */
+ gvec_op = tcg_gen_gvec_xor8;
+ goto do_gvec;
+ do_gvec:
+ gvec_op(vec_full_reg_offset(s, rd),
+ vec_full_reg_offset(s, rn),
+ vec_full_reg_offset(s, rm),
+ is_q ? 16 : 8, vec_full_reg_size(s));
+ return;
+ }
+
+ /* Note that we've now eliminated all !is_u. */
+
tcg_op1 = tcg_temp_new_i64();
tcg_op2 = tcg_temp_new_i64();
tcg_res[0] = tcg_temp_new_i64();
@@ -9056,47 +9100,27 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
read_vec_element(s, tcg_op1, rn, pass, MO_64);
read_vec_element(s, tcg_op2, rm, pass, MO_64);
- if (!is_u) {
- switch (size) {
- case 0: /* AND */
- tcg_gen_and_i64(tcg_res[pass], tcg_op1, tcg_op2);
- break;
- case 1: /* BIC */
- tcg_gen_andc_i64(tcg_res[pass], tcg_op1, tcg_op2);
- break;
- case 2: /* ORR */
- tcg_gen_or_i64(tcg_res[pass], tcg_op1, tcg_op2);
- break;
- case 3: /* ORN */
- tcg_gen_orc_i64(tcg_res[pass], tcg_op1, tcg_op2);
- break;
- }
- } else {
- if (size != 0) {
- /* B* ops need res loaded to operate on */
- read_vec_element(s, tcg_res[pass], rd, pass, MO_64);
- }
+ /* B* ops need res loaded to operate on */
+ read_vec_element(s, tcg_res[pass], rd, pass, MO_64);
- switch (size) {
- case 0: /* EOR */
- tcg_gen_xor_i64(tcg_res[pass], tcg_op1, tcg_op2);
- break;
- case 1: /* BSL bitwise select */
- tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_op2);
- tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_res[pass]);
- tcg_gen_xor_i64(tcg_res[pass], tcg_op2, tcg_op1);
- break;
- case 2: /* BIT, bitwise insert if true */
- tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
- tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_op2);
- tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
- break;
- case 3: /* BIF, bitwise insert if false */
- tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
- tcg_gen_andc_i64(tcg_op1, tcg_op1, tcg_op2);
- tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
- break;
- }
+ switch (size) {
+ case 1: /* BSL bitwise select */
+ tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_op2);
+ tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_res[pass]);
+ tcg_gen_xor_i64(tcg_res[pass], tcg_op2, tcg_op1);
+ break;
+ case 2: /* BIT, bitwise insert if true */
+ tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
+ tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_op2);
+ tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
+ break;
+ case 3: /* BIF, bitwise insert if false */
+ tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
+ tcg_gen_andc_i64(tcg_op1, tcg_op1, tcg_op2);
+ tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
+ break;
+ default:
+ g_assert_not_reached();
}
}
@@ -9370,6 +9394,7 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
int rn = extract32(insn, 5, 5);
int rd = extract32(insn, 0, 5);
int pass;
+ GVecGenTwoFn *gvec_op;
switch (opcode) {
case 0x13: /* MUL, PMUL */
@@ -9409,6 +9434,28 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
return;
}
+ switch (opcode) {
+ case 0x10: /* ADD, SUB */
+ {
+ static GVecGenTwoFn * const fns[4][2] = {
+ { tcg_gen_gvec_add8, tcg_gen_gvec_sub8 },
+ { tcg_gen_gvec_add16, tcg_gen_gvec_sub16 },
+ { tcg_gen_gvec_add32, tcg_gen_gvec_sub32 },
+ { tcg_gen_gvec_add64, tcg_gen_gvec_sub64 },
+ };
+ gvec_op = fns[size][u];
+ goto do_gvec;
+ }
+ break;
+
+ do_gvec:
+ gvec_op(vec_full_reg_offset(s, rd),
+ vec_full_reg_offset(s, rn),
+ vec_full_reg_offset(s, rm),
+ is_q ? 16 : 8, vec_full_reg_size(s));
+ return;
+ }
+
if (size == 3) {
assert(is_q);
for (pass = 0; pass < 2; pass++) {
@@ -9581,16 +9628,6 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
genfn = fns[size][u];
break;
}
- case 0x10: /* ADD, SUB */
- {
- static NeonGenTwoOpFn * const fns[3][2] = {
- { gen_helper_neon_add_u8, gen_helper_neon_sub_u8 },
- { gen_helper_neon_add_u16, gen_helper_neon_sub_u16 },
- { tcg_gen_add_i32, tcg_gen_sub_i32 },
- };
- genfn = fns[size][u];
- break;
- }
case 0x11: /* CMTST, CMEQ */
{
static NeonGenTwoOpFn * const fns[3][2] = {
--
2.13.5
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
2017-08-17 23:01 ` [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
2017-08-17 23:46 ` Philippe Mathieu-Daudé
2017-09-07 18:18 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
` (5 subsequent siblings)
8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, alex.bennee
Nothing uses or enables them yet.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg.h | 5 +++++
tcg/tcg.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index dd97095af5..1277caed3d 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -256,6 +256,11 @@ typedef struct TCGPool {
typedef enum TCGType {
TCG_TYPE_I32,
TCG_TYPE_I64,
+
+ TCG_TYPE_V64,
+ TCG_TYPE_V128,
+ TCG_TYPE_V256,
+
TCG_TYPE_COUNT, /* number of different types */
/* An alias for the size of the host register. */
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 787c8ba0f7..ea78d47fad 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -118,7 +118,7 @@ static TCGReg tcg_reg_alloc_new(TCGContext *s, TCGType t)
static bool tcg_out_ldst_finalize(TCGContext *s);
#endif
-static TCGRegSet tcg_target_available_regs[2];
+static TCGRegSet tcg_target_available_regs[TCG_TYPE_COUNT];
static TCGRegSet tcg_target_call_clobber_regs;
#if TCG_TARGET_INSN_UNIT_SIZE == 1
--
2.13.5
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
` (2 preceding siblings ...)
2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
2017-08-30 1:34 ` Philippe Mathieu-Daudé
2017-09-07 19:00 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
` (4 subsequent siblings)
8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, alex.bennee
Nothing uses or implements them yet.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tcg/tcg.h | 24 ++++++++++++++++
2 files changed, 113 insertions(+)
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 956fb1e9f3..9162125fac 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
#undef TLADDR_ARGS
#undef DATA64_ARGS
+
+/* Host integer vector operations. */
+/* These opcodes are required whenever the base vector size is enabled. */
+
+DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
+DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
+DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
+DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
+DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
+DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
+DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+
+DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+
+DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
+
+DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
+
+DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
+
+/* These opcodes are optional.
+ All element counts must be supported if any are. */
+
+DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
+DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
+DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
+
+DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
+DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
+DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
+
+DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
+DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
+DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
+
+DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
+DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
+DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
+
+DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
+DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
+DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
+DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
+
+DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
+DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
+DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
+DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
+
#undef IMPL
#undef IMPL64
#undef DEF
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 1277caed3d..b9e15da13b 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
#define TCG_TARGET_HAS_rem_i64 0
#endif
+#ifndef TCG_TARGET_HAS_v64
+#define TCG_TARGET_HAS_v64 0
+#define TCG_TARGET_HAS_andc_v64 0
+#define TCG_TARGET_HAS_orc_v64 0
+#define TCG_TARGET_HAS_not_v64 0
+#define TCG_TARGET_HAS_neg_v64 0
+#endif
+
+#ifndef TCG_TARGET_HAS_v128
+#define TCG_TARGET_HAS_v128 0
+#define TCG_TARGET_HAS_andc_v128 0
+#define TCG_TARGET_HAS_orc_v128 0
+#define TCG_TARGET_HAS_not_v128 0
+#define TCG_TARGET_HAS_neg_v128 0
+#endif
+
+#ifndef TCG_TARGET_HAS_v256
+#define TCG_TARGET_HAS_v256 0
+#define TCG_TARGET_HAS_andc_v256 0
+#define TCG_TARGET_HAS_orc_v256 0
+#define TCG_TARGET_HAS_not_v256 0
+#define TCG_TARGET_HAS_neg_v256 0
+#endif
+
/* For 32-bit targets, some sort of unsigned widening multiply is required. */
#if TCG_TARGET_REG_BITS == 32 \
&& !(defined(TCG_TARGET_HAS_mulu2_i32) \
--
2.13.5
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
` (3 preceding siblings ...)
2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
2017-08-17 23:44 ` Philippe Mathieu-Daudé
2017-09-07 19:02 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
` (3 subsequent siblings)
8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, alex.bennee
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg.h | 2 +
tcg/tcg.c | 310 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 312 insertions(+)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index b9e15da13b..b443143b21 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -962,6 +962,8 @@ do {\
#define tcg_temp_free_ptr(T) tcg_temp_free_i64(TCGV_PTR_TO_NAT(T))
#endif
+bool tcg_op_supported(TCGOpcode op);
+
void tcg_gen_callN(TCGContext *s, void *func,
TCGArg ret, int nargs, TCGArg *args);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index ea78d47fad..3c3cdda938 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -751,6 +751,316 @@ int tcg_check_temp_count(void)
}
#endif
+/* Return true if OP may appear in the opcode stream.
+ Test the runtime variable that controls each opcode. */
+bool tcg_op_supported(TCGOpcode op)
+{
+ switch (op) {
+ case INDEX_op_discard:
+ case INDEX_op_set_label:
+ case INDEX_op_call:
+ case INDEX_op_br:
+ case INDEX_op_mb:
+ case INDEX_op_insn_start:
+ case INDEX_op_exit_tb:
+ case INDEX_op_goto_tb:
+ case INDEX_op_qemu_ld_i32:
+ case INDEX_op_qemu_st_i32:
+ case INDEX_op_qemu_ld_i64:
+ case INDEX_op_qemu_st_i64:
+ return true;
+
+ case INDEX_op_goto_ptr:
+ return TCG_TARGET_HAS_goto_ptr;
+
+ case INDEX_op_mov_i32:
+ case INDEX_op_movi_i32:
+ case INDEX_op_setcond_i32:
+ case INDEX_op_brcond_i32:
+ case INDEX_op_ld8u_i32:
+ case INDEX_op_ld8s_i32:
+ case INDEX_op_ld16u_i32:
+ case INDEX_op_ld16s_i32:
+ case INDEX_op_ld_i32:
+ case INDEX_op_st8_i32:
+ case INDEX_op_st16_i32:
+ case INDEX_op_st_i32:
+ case INDEX_op_add_i32:
+ case INDEX_op_sub_i32:
+ case INDEX_op_mul_i32:
+ case INDEX_op_and_i32:
+ case INDEX_op_or_i32:
+ case INDEX_op_xor_i32:
+ case INDEX_op_shl_i32:
+ case INDEX_op_shr_i32:
+ case INDEX_op_sar_i32:
+ return true;
+
+ case INDEX_op_movcond_i32:
+ return TCG_TARGET_HAS_movcond_i32;
+ case INDEX_op_div_i32:
+ case INDEX_op_divu_i32:
+ return TCG_TARGET_HAS_div_i32;
+ case INDEX_op_rem_i32:
+ case INDEX_op_remu_i32:
+ return TCG_TARGET_HAS_rem_i32;
+ case INDEX_op_div2_i32:
+ case INDEX_op_divu2_i32:
+ return TCG_TARGET_HAS_div2_i32;
+ case INDEX_op_rotl_i32:
+ case INDEX_op_rotr_i32:
+ return TCG_TARGET_HAS_rot_i32;
+ case INDEX_op_deposit_i32:
+ return TCG_TARGET_HAS_deposit_i32;
+ case INDEX_op_extract_i32:
+ return TCG_TARGET_HAS_extract_i32;
+ case INDEX_op_sextract_i32:
+ return TCG_TARGET_HAS_sextract_i32;
+ case INDEX_op_add2_i32:
+ return TCG_TARGET_HAS_add2_i32;
+ case INDEX_op_sub2_i32:
+ return TCG_TARGET_HAS_sub2_i32;
+ case INDEX_op_mulu2_i32:
+ return TCG_TARGET_HAS_mulu2_i32;
+ case INDEX_op_muls2_i32:
+ return TCG_TARGET_HAS_muls2_i32;
+ case INDEX_op_muluh_i32:
+ return TCG_TARGET_HAS_muluh_i32;
+ case INDEX_op_mulsh_i32:
+ return TCG_TARGET_HAS_mulsh_i32;
+ case INDEX_op_ext8s_i32:
+ return TCG_TARGET_HAS_ext8s_i32;
+ case INDEX_op_ext16s_i32:
+ return TCG_TARGET_HAS_ext16s_i32;
+ case INDEX_op_ext8u_i32:
+ return TCG_TARGET_HAS_ext8u_i32;
+ case INDEX_op_ext16u_i32:
+ return TCG_TARGET_HAS_ext16u_i32;
+ case INDEX_op_bswap16_i32:
+ return TCG_TARGET_HAS_bswap16_i32;
+ case INDEX_op_bswap32_i32:
+ return TCG_TARGET_HAS_bswap32_i32;
+ case INDEX_op_not_i32:
+ return TCG_TARGET_HAS_not_i32;
+ case INDEX_op_neg_i32:
+ return TCG_TARGET_HAS_neg_i32;
+ case INDEX_op_andc_i32:
+ return TCG_TARGET_HAS_andc_i32;
+ case INDEX_op_orc_i32:
+ return TCG_TARGET_HAS_orc_i32;
+ case INDEX_op_eqv_i32:
+ return TCG_TARGET_HAS_eqv_i32;
+ case INDEX_op_nand_i32:
+ return TCG_TARGET_HAS_nand_i32;
+ case INDEX_op_nor_i32:
+ return TCG_TARGET_HAS_nor_i32;
+ case INDEX_op_clz_i32:
+ return TCG_TARGET_HAS_clz_i32;
+ case INDEX_op_ctz_i32:
+ return TCG_TARGET_HAS_ctz_i32;
+ case INDEX_op_ctpop_i32:
+ return TCG_TARGET_HAS_ctpop_i32;
+
+ case INDEX_op_brcond2_i32:
+ case INDEX_op_setcond2_i32:
+ return TCG_TARGET_REG_BITS == 32;
+
+ case INDEX_op_mov_i64:
+ case INDEX_op_movi_i64:
+ case INDEX_op_setcond_i64:
+ case INDEX_op_brcond_i64:
+ case INDEX_op_ld8u_i64:
+ case INDEX_op_ld8s_i64:
+ case INDEX_op_ld16u_i64:
+ case INDEX_op_ld16s_i64:
+ case INDEX_op_ld32u_i64:
+ case INDEX_op_ld32s_i64:
+ case INDEX_op_ld_i64:
+ case INDEX_op_st8_i64:
+ case INDEX_op_st16_i64:
+ case INDEX_op_st32_i64:
+ case INDEX_op_st_i64:
+ case INDEX_op_add_i64:
+ case INDEX_op_sub_i64:
+ case INDEX_op_mul_i64:
+ case INDEX_op_and_i64:
+ case INDEX_op_or_i64:
+ case INDEX_op_xor_i64:
+ case INDEX_op_shl_i64:
+ case INDEX_op_shr_i64:
+ case INDEX_op_sar_i64:
+ case INDEX_op_ext_i32_i64:
+ case INDEX_op_extu_i32_i64:
+ return TCG_TARGET_REG_BITS == 64;
+
+ case INDEX_op_movcond_i64:
+ return TCG_TARGET_HAS_movcond_i64;
+ case INDEX_op_div_i64:
+ case INDEX_op_divu_i64:
+ return TCG_TARGET_HAS_div_i64;
+ case INDEX_op_rem_i64:
+ case INDEX_op_remu_i64:
+ return TCG_TARGET_HAS_rem_i64;
+ case INDEX_op_div2_i64:
+ case INDEX_op_divu2_i64:
+ return TCG_TARGET_HAS_div2_i64;
+ case INDEX_op_rotl_i64:
+ case INDEX_op_rotr_i64:
+ return TCG_TARGET_HAS_rot_i64;
+ case INDEX_op_deposit_i64:
+ return TCG_TARGET_HAS_deposit_i64;
+ case INDEX_op_extract_i64:
+ return TCG_TARGET_HAS_extract_i64;
+ case INDEX_op_sextract_i64:
+ return TCG_TARGET_HAS_sextract_i64;
+ case INDEX_op_extrl_i64_i32:
+ return TCG_TARGET_HAS_extrl_i64_i32;
+ case INDEX_op_extrh_i64_i32:
+ return TCG_TARGET_HAS_extrh_i64_i32;
+ case INDEX_op_ext8s_i64:
+ return TCG_TARGET_HAS_ext8s_i64;
+ case INDEX_op_ext16s_i64:
+ return TCG_TARGET_HAS_ext16s_i64;
+ case INDEX_op_ext32s_i64:
+ return TCG_TARGET_HAS_ext32s_i64;
+ case INDEX_op_ext8u_i64:
+ return TCG_TARGET_HAS_ext8u_i64;
+ case INDEX_op_ext16u_i64:
+ return TCG_TARGET_HAS_ext16u_i64;
+ case INDEX_op_ext32u_i64:
+ return TCG_TARGET_HAS_ext32u_i64;
+ case INDEX_op_bswap16_i64:
+ return TCG_TARGET_HAS_bswap16_i64;
+ case INDEX_op_bswap32_i64:
+ return TCG_TARGET_HAS_bswap32_i64;
+ case INDEX_op_bswap64_i64:
+ return TCG_TARGET_HAS_bswap64_i64;
+ case INDEX_op_not_i64:
+ return TCG_TARGET_HAS_not_i64;
+ case INDEX_op_neg_i64:
+ return TCG_TARGET_HAS_neg_i64;
+ case INDEX_op_andc_i64:
+ return TCG_TARGET_HAS_andc_i64;
+ case INDEX_op_orc_i64:
+ return TCG_TARGET_HAS_orc_i64;
+ case INDEX_op_eqv_i64:
+ return TCG_TARGET_HAS_eqv_i64;
+ case INDEX_op_nand_i64:
+ return TCG_TARGET_HAS_nand_i64;
+ case INDEX_op_nor_i64:
+ return TCG_TARGET_HAS_nor_i64;
+ case INDEX_op_clz_i64:
+ return TCG_TARGET_HAS_clz_i64;
+ case INDEX_op_ctz_i64:
+ return TCG_TARGET_HAS_ctz_i64;
+ case INDEX_op_ctpop_i64:
+ return TCG_TARGET_HAS_ctpop_i64;
+ case INDEX_op_add2_i64:
+ return TCG_TARGET_HAS_add2_i64;
+ case INDEX_op_sub2_i64:
+ return TCG_TARGET_HAS_sub2_i64;
+ case INDEX_op_mulu2_i64:
+ return TCG_TARGET_HAS_mulu2_i64;
+ case INDEX_op_muls2_i64:
+ return TCG_TARGET_HAS_muls2_i64;
+ case INDEX_op_muluh_i64:
+ return TCG_TARGET_HAS_muluh_i64;
+ case INDEX_op_mulsh_i64:
+ return TCG_TARGET_HAS_mulsh_i64;
+
+ case INDEX_op_mov_v64:
+ case INDEX_op_movi_v64:
+ case INDEX_op_ld_v64:
+ case INDEX_op_st_v64:
+ case INDEX_op_and_v64:
+ case INDEX_op_or_v64:
+ case INDEX_op_xor_v64:
+ case INDEX_op_add8_v64:
+ case INDEX_op_add16_v64:
+ case INDEX_op_add32_v64:
+ case INDEX_op_sub8_v64:
+ case INDEX_op_sub16_v64:
+ case INDEX_op_sub32_v64:
+ return TCG_TARGET_HAS_v64;
+
+ case INDEX_op_mov_v128:
+ case INDEX_op_movi_v128:
+ case INDEX_op_ld_v128:
+ case INDEX_op_st_v128:
+ case INDEX_op_and_v128:
+ case INDEX_op_or_v128:
+ case INDEX_op_xor_v128:
+ case INDEX_op_add8_v128:
+ case INDEX_op_add16_v128:
+ case INDEX_op_add32_v128:
+ case INDEX_op_add64_v128:
+ case INDEX_op_sub8_v128:
+ case INDEX_op_sub16_v128:
+ case INDEX_op_sub32_v128:
+ case INDEX_op_sub64_v128:
+ return TCG_TARGET_HAS_v128;
+
+ case INDEX_op_mov_v256:
+ case INDEX_op_movi_v256:
+ case INDEX_op_ld_v256:
+ case INDEX_op_st_v256:
+ case INDEX_op_and_v256:
+ case INDEX_op_or_v256:
+ case INDEX_op_xor_v256:
+ case INDEX_op_add8_v256:
+ case INDEX_op_add16_v256:
+ case INDEX_op_add32_v256:
+ case INDEX_op_add64_v256:
+ case INDEX_op_sub8_v256:
+ case INDEX_op_sub16_v256:
+ case INDEX_op_sub32_v256:
+ case INDEX_op_sub64_v256:
+ return TCG_TARGET_HAS_v256;
+
+ case INDEX_op_not_v64:
+ return TCG_TARGET_HAS_not_v64;
+ case INDEX_op_not_v128:
+ return TCG_TARGET_HAS_not_v128;
+ case INDEX_op_not_v256:
+ return TCG_TARGET_HAS_not_v256;
+
+ case INDEX_op_andc_v64:
+ return TCG_TARGET_HAS_andc_v64;
+ case INDEX_op_andc_v128:
+ return TCG_TARGET_HAS_andc_v128;
+ case INDEX_op_andc_v256:
+ return TCG_TARGET_HAS_andc_v256;
+
+ case INDEX_op_orc_v64:
+ return TCG_TARGET_HAS_orc_v64;
+ case INDEX_op_orc_v128:
+ return TCG_TARGET_HAS_orc_v128;
+ case INDEX_op_orc_v256:
+ return TCG_TARGET_HAS_orc_v256;
+
+ case INDEX_op_neg8_v64:
+ case INDEX_op_neg16_v64:
+ case INDEX_op_neg32_v64:
+ return TCG_TARGET_HAS_neg_v64;
+
+ case INDEX_op_neg8_v128:
+ case INDEX_op_neg16_v128:
+ case INDEX_op_neg32_v128:
+ case INDEX_op_neg64_v128:
+ return TCG_TARGET_HAS_neg_v128;
+
+ case INDEX_op_neg8_v256:
+ case INDEX_op_neg16_v256:
+ case INDEX_op_neg32_v256:
+ case INDEX_op_neg64_v256:
+ return TCG_TARGET_HAS_neg_v256;
+
+ case NB_OPS:
+ break;
+ }
+ g_assert_not_reached();
+}
+
/* Note: we convert the 64 bit args to 32 bit and do some alignment
and endian swap. Maybe it would be better to do the alignment
and endian swap in tcg_reg_alloc_call(). */
--
2.13.5
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
` (4 preceding siblings ...)
2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
2017-08-17 23:45 ` Philippe Mathieu-Daudé
2017-09-08 9:30 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops Richard Henderson
` (2 subsequent siblings)
8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, alex.bennee
Add with value 0 so that structure zero initialization can
indicate that the field is not present.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-opc.h | 2 ++
tcg/tcg.c | 3 +++
2 files changed, 5 insertions(+)
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 9162125fac..b1445a4c24 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -26,6 +26,8 @@
* DEF(name, oargs, iargs, cargs, flags)
*/
+DEF(invalid, 0, 0, 0, TCG_OPF_NOT_PRESENT)
+
/* predefined ops */
DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT)
DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 3c3cdda938..879b29e81f 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -756,6 +756,9 @@ int tcg_check_temp_count(void)
bool tcg_op_supported(TCGOpcode op)
{
switch (op) {
+ case INDEX_op_invalid:
+ return false;
+
case INDEX_op_discard:
case INDEX_op_set_label:
case INDEX_op_call:
--
2.13.5
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
` (5 preceding siblings ...)
2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
2017-09-08 9:34 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
2017-09-08 13:49 ` [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Alex Bennée
8 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, alex.bennee
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-op-gvec.h | 4 +
tcg/tcg.h | 6 +-
tcg/tcg-op-gvec.c | 230 +++++++++++++++++++++++++++++++++++++++++++-----------
tcg/tcg.c | 8 +-
4 files changed, 197 insertions(+), 51 deletions(-)
diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
index 10db3599a5..99f36d208e 100644
--- a/tcg/tcg-op-gvec.h
+++ b/tcg/tcg-op-gvec.h
@@ -40,6 +40,10 @@ typedef struct {
/* Similarly, but load up a constant and re-use across lanes. */
void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
uint64_t extra_value;
+ /* Operations with host vector ops. */
+ TCGOpcode op_v256;
+ TCGOpcode op_v128;
+ TCGOpcode op_v64;
/* Larger sizes: expand out-of-line helper w/size descriptor. */
void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
} GVecGen3;
diff --git a/tcg/tcg.h b/tcg/tcg.h
index b443143b21..7f10501d31 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -825,9 +825,11 @@ int tcg_global_mem_new_internal(TCGType, TCGv_ptr, intptr_t, const char *);
TCGv_i32 tcg_global_reg_new_i32(TCGReg reg, const char *name);
TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name);
-TCGv_i32 tcg_temp_new_internal_i32(int temp_local);
-TCGv_i64 tcg_temp_new_internal_i64(int temp_local);
+int tcg_temp_new_internal(TCGType type, bool temp_local);
+TCGv_i32 tcg_temp_new_internal_i32(bool temp_local);
+TCGv_i64 tcg_temp_new_internal_i64(bool temp_local);
+void tcg_temp_free_internal(int arg);
void tcg_temp_free_i32(TCGv_i32 arg);
void tcg_temp_free_i64(TCGv_i64 arg);
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index 6de49dc07f..3aca565dc0 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -30,54 +30,73 @@
#define REP8(x) ((x) * 0x0101010101010101ull)
#define REP16(x) ((x) * 0x0001000100010001ull)
-#define MAX_INLINE 16
+#define MAX_UNROLL 4
-static inline void check_size_s(uint32_t opsz, uint32_t clsz)
+static inline void check_size_align(uint32_t opsz, uint32_t clsz, uint32_t ofs)
{
- tcg_debug_assert(opsz % 8 == 0);
- tcg_debug_assert(clsz % 8 == 0);
+ uint32_t align = clsz > 16 || opsz >= 16 ? 15 : 7;
+ tcg_debug_assert(opsz > 0);
tcg_debug_assert(opsz <= clsz);
+ tcg_debug_assert((opsz & align) == 0);
+ tcg_debug_assert((clsz & align) == 0);
+ tcg_debug_assert((ofs & align) == 0);
}
-static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
+static inline void check_overlap_3(uint32_t d, uint32_t a,
+ uint32_t b, uint32_t s)
{
- tcg_debug_assert(dofs % 8 == 0);
- tcg_debug_assert(aofs % 8 == 0);
- tcg_debug_assert(bofs % 8 == 0);
+ tcg_debug_assert(d == a || d + s <= a || a + s <= d);
+ tcg_debug_assert(d == b || d + s <= b || b + s <= d);
+ tcg_debug_assert(a == b || a + s <= b || b + s <= a);
}
-static inline void check_size_l(uint32_t opsz, uint32_t clsz)
+static inline bool check_size_impl(uint32_t opsz, uint32_t lnsz)
{
- tcg_debug_assert(opsz % 16 == 0);
- tcg_debug_assert(clsz % 16 == 0);
- tcg_debug_assert(opsz <= clsz);
+ uint32_t lnct = opsz / lnsz;
+ return lnct >= 1 && lnct <= MAX_UNROLL;
}
-static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
+static void expand_clr_v(uint32_t dofs, uint32_t clsz, uint32_t lnsz,
+ TCGType type, TCGOpcode opc_mv, TCGOpcode opc_st)
{
- tcg_debug_assert(dofs % 16 == 0);
- tcg_debug_assert(aofs % 16 == 0);
- tcg_debug_assert(bofs % 16 == 0);
-}
+ TCGArg t0 = tcg_temp_new_internal(type, 0);
+ TCGArg env = GET_TCGV_PTR(tcg_ctx.tcg_env);
+ uint32_t i;
-static inline void check_overlap_3(uint32_t d, uint32_t a,
- uint32_t b, uint32_t s)
-{
- tcg_debug_assert(d == a || d + s <= a || a + s <= d);
- tcg_debug_assert(d == b || d + s <= b || b + s <= d);
- tcg_debug_assert(a == b || a + s <= b || b + s <= a);
+ tcg_gen_op2(&tcg_ctx, opc_mv, t0, 0);
+ for (i = 0; i < clsz; i += lnsz) {
+ tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
+ }
+ tcg_temp_free_internal(t0);
}
-static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
+static void expand_clr(uint32_t dofs, uint32_t clsz)
{
- if (clsz > opsz) {
- TCGv_i64 zero = tcg_const_i64(0);
- uint32_t i;
+ if (clsz >= 32 && TCG_TARGET_HAS_v256) {
+ uint32_t done = QEMU_ALIGN_DOWN(clsz, 32);
+ expand_clr_v(dofs, done, 32, TCG_TYPE_V256,
+ INDEX_op_movi_v256, INDEX_op_st_v256);
+ dofs += done;
+ clsz -= done;
+ }
- for (i = opsz; i < clsz; i += 8) {
- tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
- }
- tcg_temp_free_i64(zero);
+ if (clsz >= 16 && TCG_TARGET_HAS_v128) {
+ uint16_t done = QEMU_ALIGN_DOWN(clsz, 16);
+ expand_clr_v(dofs, done, 16, TCG_TYPE_V128,
+ INDEX_op_movi_v128, INDEX_op_st_v128);
+ dofs += done;
+ clsz -= done;
+ }
+
+ if (TCG_TARGET_REG_BITS == 64) {
+ expand_clr_v(dofs, clsz, 8, TCG_TYPE_I64,
+ INDEX_op_movi_i64, INDEX_op_st_i64);
+ } else if (TCG_TARGET_HAS_v64) {
+ expand_clr_v(dofs, clsz, 8, TCG_TYPE_V64,
+ INDEX_op_movi_v64, INDEX_op_st_v64);
+ } else {
+ expand_clr_v(dofs, clsz, 4, TCG_TYPE_I32,
+ INDEX_op_movi_i32, INDEX_op_st_i32);
}
}
@@ -164,6 +183,7 @@ static void expand_3x8(uint32_t dofs, uint32_t aofs,
tcg_temp_free_i64(t0);
}
+/* FIXME: add CSE for constants and we can eliminate this. */
static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
uint32_t opsz, uint64_t data,
void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64))
@@ -192,28 +212,111 @@ static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
tcg_temp_free_i64(t2);
}
+static void expand_3_v(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t opsz, uint32_t lnsz, TCGType type,
+ TCGOpcode opc_op, TCGOpcode opc_ld, TCGOpcode opc_st)
+{
+ TCGArg t0 = tcg_temp_new_internal(type, 0);
+ TCGArg env = GET_TCGV_PTR(tcg_ctx.tcg_env);
+ uint32_t i;
+
+ if (aofs == bofs) {
+ for (i = 0; i < opsz; i += lnsz) {
+ tcg_gen_op3(&tcg_ctx, opc_ld, t0, env, aofs + i);
+ tcg_gen_op3(&tcg_ctx, opc_op, t0, t0, t0);
+ tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
+ }
+ } else {
+ TCGArg t1 = tcg_temp_new_internal(type, 0);
+ for (i = 0; i < opsz; i += lnsz) {
+ tcg_gen_op3(&tcg_ctx, opc_ld, t0, env, aofs + i);
+ tcg_gen_op3(&tcg_ctx, opc_ld, t1, env, bofs + i);
+ tcg_gen_op3(&tcg_ctx, opc_op, t0, t0, t1);
+ tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
+ }
+ tcg_temp_free_internal(t1);
+ }
+ tcg_temp_free_internal(t0);
+}
+
void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
{
+ check_size_align(opsz, clsz, dofs | aofs | bofs);
check_overlap_3(dofs, aofs, bofs, clsz);
- if (opsz <= MAX_INLINE) {
- check_size_s(opsz, clsz);
- check_align_s_3(dofs, aofs, bofs);
- if (g->fni8) {
- expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
- } else if (g->fni4) {
- expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
+
+ if (opsz > MAX_UNROLL * 32 || clsz > MAX_UNROLL * 32) {
+ goto do_ool;
+ }
+
+ /* Recall that ARM SVE allows vector sizes that are not a power of 2.
+ Expand with successively smaller host vector sizes. The intent is
+ that e.g. opsz == 80 would be expanded with 2x32 + 1x16. */
+ /* ??? For clsz > opsz, the host may be able to use an op-sized
+ operation, zeroing the balance of the register. We can then
+ use a cl-sized store to implement the clearing without an extra
+ store operation. This is true for aarch64 and x86_64 hosts. */
+
+ if (check_size_impl(opsz, 32) && tcg_op_supported(g->op_v256)) {
+ uint32_t done = QEMU_ALIGN_DOWN(opsz, 32);
+ expand_3_v(dofs, aofs, bofs, done, 32, TCG_TYPE_V256,
+ g->op_v256, INDEX_op_ld_v256, INDEX_op_st_v256);
+ dofs += done;
+ aofs += done;
+ bofs += done;
+ opsz -= done;
+ clsz -= done;
+ }
+
+ if (check_size_impl(opsz, 16) && tcg_op_supported(g->op_v128)) {
+ uint32_t done = QEMU_ALIGN_DOWN(opsz, 16);
+ expand_3_v(dofs, aofs, bofs, done, 16, TCG_TYPE_V128,
+ g->op_v128, INDEX_op_ld_v128, INDEX_op_st_v128);
+ dofs += done;
+ aofs += done;
+ bofs += done;
+ opsz -= done;
+ clsz -= done;
+ }
+
+ if (check_size_impl(opsz, 8)) {
+ uint32_t done = QEMU_ALIGN_DOWN(opsz, 8);
+ if (tcg_op_supported(g->op_v64)) {
+ expand_3_v(dofs, aofs, bofs, done, 8, TCG_TYPE_V64,
+ g->op_v64, INDEX_op_ld_v64, INDEX_op_st_v64);
+ } else if (g->fni8) {
+ expand_3x8(dofs, aofs, bofs, done, g->fni8);
} else if (g->fni8x) {
- expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
+ expand_3x8p1(dofs, aofs, bofs, done, g->extra_value, g->fni8x);
} else {
- g_assert_not_reached();
+ done = 0;
}
- expand_clr(dofs, opsz, clsz);
- } else {
- check_size_l(opsz, clsz);
- check_align_l_3(dofs, aofs, bofs);
- expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
+ dofs += done;
+ aofs += done;
+ bofs += done;
+ opsz -= done;
+ clsz -= done;
}
+
+ if (check_size_impl(opsz, 4)) {
+ uint32_t done = QEMU_ALIGN_DOWN(opsz, 4);
+ expand_3x4(dofs, aofs, bofs, done, g->fni4);
+ dofs += done;
+ aofs += done;
+ bofs += done;
+ opsz -= done;
+ clsz -= done;
+ }
+
+ if (opsz == 0) {
+ if (clsz != 0) {
+ expand_clr(dofs, clsz);
+ }
+ return;
+ }
+
+ do_ool:
+ expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
}
static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
@@ -240,6 +343,9 @@ void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
static const GVecGen3 g = {
.extra_value = REP8(0x80),
.fni8x = gen_addv_mask,
+ .op_v256 = INDEX_op_add8_v256,
+ .op_v128 = INDEX_op_add8_v128,
+ .op_v64 = INDEX_op_add8_v64,
.fno = gen_helper_gvec_add8,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -251,6 +357,9 @@ void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
static const GVecGen3 g = {
.extra_value = REP16(0x8000),
.fni8x = gen_addv_mask,
+ .op_v256 = INDEX_op_add16_v256,
+ .op_v128 = INDEX_op_add16_v128,
+ .op_v64 = INDEX_op_add16_v64,
.fno = gen_helper_gvec_add16,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -261,6 +370,9 @@ void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
{
static const GVecGen3 g = {
.fni4 = tcg_gen_add_i32,
+ .op_v256 = INDEX_op_add32_v256,
+ .op_v128 = INDEX_op_add32_v128,
+ .op_v64 = INDEX_op_add32_v64,
.fno = gen_helper_gvec_add32,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -271,6 +383,8 @@ void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
{
static const GVecGen3 g = {
.fni8 = tcg_gen_add_i64,
+ .op_v256 = INDEX_op_add64_v256,
+ .op_v128 = INDEX_op_add64_v128,
.fno = gen_helper_gvec_add64,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -328,6 +442,9 @@ void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
static const GVecGen3 g = {
.extra_value = REP8(0x80),
.fni8x = gen_subv_mask,
+ .op_v256 = INDEX_op_sub8_v256,
+ .op_v128 = INDEX_op_sub8_v128,
+ .op_v64 = INDEX_op_sub8_v64,
.fno = gen_helper_gvec_sub8,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -339,6 +456,9 @@ void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
static const GVecGen3 g = {
.extra_value = REP16(0x8000),
.fni8x = gen_subv_mask,
+ .op_v256 = INDEX_op_sub16_v256,
+ .op_v128 = INDEX_op_sub16_v128,
+ .op_v64 = INDEX_op_sub16_v64,
.fno = gen_helper_gvec_sub16,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -349,6 +469,9 @@ void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
{
static const GVecGen3 g = {
.fni4 = tcg_gen_sub_i32,
+ .op_v256 = INDEX_op_sub32_v256,
+ .op_v128 = INDEX_op_sub32_v128,
+ .op_v64 = INDEX_op_sub32_v64,
.fno = gen_helper_gvec_sub32,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -359,6 +482,8 @@ void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
{
static const GVecGen3 g = {
.fni8 = tcg_gen_sub_i64,
+ .op_v256 = INDEX_op_sub64_v256,
+ .op_v128 = INDEX_op_sub64_v128,
.fno = gen_helper_gvec_sub64,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -397,6 +522,9 @@ void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
{
static const GVecGen3 g = {
.fni8 = tcg_gen_and_i64,
+ .op_v256 = INDEX_op_and_v256,
+ .op_v128 = INDEX_op_and_v128,
+ .op_v64 = INDEX_op_and_v64,
.fno = gen_helper_gvec_and8,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -407,6 +535,9 @@ void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
{
static const GVecGen3 g = {
.fni8 = tcg_gen_or_i64,
+ .op_v256 = INDEX_op_or_v256,
+ .op_v128 = INDEX_op_or_v128,
+ .op_v64 = INDEX_op_or_v64,
.fno = gen_helper_gvec_or8,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -417,6 +548,9 @@ void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
{
static const GVecGen3 g = {
.fni8 = tcg_gen_xor_i64,
+ .op_v256 = INDEX_op_xor_v256,
+ .op_v128 = INDEX_op_xor_v128,
+ .op_v64 = INDEX_op_xor_v64,
.fno = gen_helper_gvec_xor8,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -427,6 +561,9 @@ void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
{
static const GVecGen3 g = {
.fni8 = tcg_gen_andc_i64,
+ .op_v256 = INDEX_op_andc_v256,
+ .op_v128 = INDEX_op_andc_v128,
+ .op_v64 = INDEX_op_andc_v64,
.fno = gen_helper_gvec_andc8,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
@@ -437,6 +574,9 @@ void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
{
static const GVecGen3 g = {
.fni8 = tcg_gen_orc_i64,
+ .op_v256 = INDEX_op_orc_v256,
+ .op_v128 = INDEX_op_orc_v128,
+ .op_v64 = INDEX_op_orc_v64,
.fno = gen_helper_gvec_orc8,
};
tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 879b29e81f..86eb4214b0 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -604,7 +604,7 @@ int tcg_global_mem_new_internal(TCGType type, TCGv_ptr base,
return temp_idx(s, ts);
}
-static int tcg_temp_new_internal(TCGType type, int temp_local)
+int tcg_temp_new_internal(TCGType type, bool temp_local)
{
TCGContext *s = &tcg_ctx;
TCGTemp *ts;
@@ -650,7 +650,7 @@ static int tcg_temp_new_internal(TCGType type, int temp_local)
return idx;
}
-TCGv_i32 tcg_temp_new_internal_i32(int temp_local)
+TCGv_i32 tcg_temp_new_internal_i32(bool temp_local)
{
int idx;
@@ -658,7 +658,7 @@ TCGv_i32 tcg_temp_new_internal_i32(int temp_local)
return MAKE_TCGV_I32(idx);
}
-TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
+TCGv_i64 tcg_temp_new_internal_i64(bool temp_local)
{
int idx;
@@ -666,7 +666,7 @@ TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
return MAKE_TCGV_I64(idx);
}
-static void tcg_temp_free_internal(int idx)
+void tcg_temp_free_internal(int idx)
{
TCGContext *s = &tcg_ctx;
TCGTemp *ts;
--
2.13.5
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
` (6 preceding siblings ...)
2017-08-17 23:01 ` [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops Richard Henderson
@ 2017-08-17 23:01 ` Richard Henderson
2017-08-22 13:15 ` Alex Bennée
2017-09-08 10:13 ` Alex Bennée
2017-09-08 13:49 ` [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Alex Bennée
8 siblings, 2 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-17 23:01 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, alex.bennee
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/i386/tcg-target.h | 46 +++++-
tcg/tcg-opc.h | 12 +-
tcg/i386/tcg-target.inc.c | 382 ++++++++++++++++++++++++++++++++++++++++++----
3 files changed, 399 insertions(+), 41 deletions(-)
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index e512648c95..147f82062b 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -30,11 +30,10 @@
#ifdef __x86_64__
# define TCG_TARGET_REG_BITS 64
-# define TCG_TARGET_NB_REGS 16
#else
# define TCG_TARGET_REG_BITS 32
-# define TCG_TARGET_NB_REGS 8
#endif
+# define TCG_TARGET_NB_REGS 24
typedef enum {
TCG_REG_EAX = 0,
@@ -56,6 +55,19 @@ typedef enum {
TCG_REG_R13,
TCG_REG_R14,
TCG_REG_R15,
+
+ /* SSE registers; 64-bit has access to 8 more, but we won't
+ need more than a few and using only the first 8 minimizes
+ the need for a rex prefix on the sse instructions. */
+ TCG_REG_XMM0,
+ TCG_REG_XMM1,
+ TCG_REG_XMM2,
+ TCG_REG_XMM3,
+ TCG_REG_XMM4,
+ TCG_REG_XMM5,
+ TCG_REG_XMM6,
+ TCG_REG_XMM7,
+
TCG_REG_RAX = TCG_REG_EAX,
TCG_REG_RCX = TCG_REG_ECX,
TCG_REG_RDX = TCG_REG_EDX,
@@ -79,6 +91,17 @@ extern bool have_bmi1;
extern bool have_bmi2;
extern bool have_popcnt;
+#ifdef __SSE2__
+#define have_sse2 true
+#else
+extern bool have_sse2;
+#endif
+#ifdef __AVX2__
+#define have_avx2 true
+#else
+extern bool have_avx2;
+#endif
+
/* optional instructions */
#define TCG_TARGET_HAS_div2_i32 1
#define TCG_TARGET_HAS_rot_i32 1
@@ -147,6 +170,25 @@ extern bool have_popcnt;
#define TCG_TARGET_HAS_mulsh_i64 0
#endif
+#define TCG_TARGET_HAS_v64 have_sse2
+#define TCG_TARGET_HAS_v128 have_sse2
+#define TCG_TARGET_HAS_v256 have_avx2
+
+#define TCG_TARGET_HAS_andc_v64 TCG_TARGET_HAS_v64
+#define TCG_TARGET_HAS_orc_v64 0
+#define TCG_TARGET_HAS_not_v64 0
+#define TCG_TARGET_HAS_neg_v64 0
+
+#define TCG_TARGET_HAS_andc_v128 TCG_TARGET_HAS_v128
+#define TCG_TARGET_HAS_orc_v128 0
+#define TCG_TARGET_HAS_not_v128 0
+#define TCG_TARGET_HAS_neg_v128 0
+
+#define TCG_TARGET_HAS_andc_v256 TCG_TARGET_HAS_v256
+#define TCG_TARGET_HAS_orc_v256 0
+#define TCG_TARGET_HAS_not_v256 0
+#define TCG_TARGET_HAS_neg_v256 0
+
#define TCG_TARGET_deposit_i32_valid(ofs, len) \
(have_bmi2 || \
((ofs) == 0 && (len) == 8) || \
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index b1445a4c24..b84cd584fb 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -212,13 +212,13 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
/* Host integer vector operations. */
/* These opcodes are required whenever the base vector size is enabled. */
-DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
-DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
-DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
+DEF(mov_v64, 1, 1, 0, TCG_OPF_NOT_PRESENT)
+DEF(mov_v128, 1, 1, 0, TCG_OPF_NOT_PRESENT)
+DEF(mov_v256, 1, 1, 0, TCG_OPF_NOT_PRESENT)
-DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
-DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
-DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
+DEF(movi_v64, 1, 0, 1, TCG_OPF_NOT_PRESENT)
+DEF(movi_v128, 1, 0, 1, TCG_OPF_NOT_PRESENT)
+DEF(movi_v256, 1, 0, 1, TCG_OPF_NOT_PRESENT)
DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index aeefb72aa0..0e01b54aa0 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -31,7 +31,9 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
"%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15",
#else
"%eax", "%ecx", "%edx", "%ebx", "%esp", "%ebp", "%esi", "%edi",
+ NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
#endif
+ "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7",
};
#endif
@@ -61,6 +63,14 @@ static const int tcg_target_reg_alloc_order[] = {
TCG_REG_EDX,
TCG_REG_EAX,
#endif
+ TCG_REG_XMM0,
+ TCG_REG_XMM1,
+ TCG_REG_XMM2,
+ TCG_REG_XMM3,
+ TCG_REG_XMM4,
+ TCG_REG_XMM5,
+ TCG_REG_XMM6,
+ TCG_REG_XMM7,
};
static const int tcg_target_call_iarg_regs[] = {
@@ -94,7 +104,7 @@ static const int tcg_target_call_oarg_regs[] = {
#define TCG_CT_CONST_I32 0x400
#define TCG_CT_CONST_WSZ 0x800
-/* Registers used with L constraint, which are the first argument
+/* Registers used with L constraint, which are the first argument
registers on x86_64, and two random call clobbered registers on
i386. */
#if TCG_TARGET_REG_BITS == 64
@@ -127,6 +137,16 @@ bool have_bmi1;
bool have_bmi2;
bool have_popcnt;
+#ifndef have_sse2
+bool have_sse2;
+#endif
+#ifdef have_avx2
+#define have_avx1 have_avx2
+#else
+static bool have_avx1;
+bool have_avx2;
+#endif
+
#ifdef CONFIG_CPUID_H
static bool have_movbe;
static bool have_lzcnt;
@@ -215,6 +235,10 @@ static const char *target_parse_constraint(TCGArgConstraint *ct,
/* With TZCNT/LZCNT, we can have operand-size as an input. */
ct->ct |= TCG_CT_CONST_WSZ;
break;
+ case 'x':
+ ct->ct |= TCG_CT_REG;
+ tcg_regset_set32(ct->u.regs, 0, 0xff0000);
+ break;
/* qemu_ld/st address constraint */
case 'L':
@@ -292,6 +316,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
#endif
#define P_SIMDF3 0x20000 /* 0xf3 opcode prefix */
#define P_SIMDF2 0x40000 /* 0xf2 opcode prefix */
+#define P_VEXL 0x80000 /* Set VEX.L = 1 */
#define OPC_ARITH_EvIz (0x81)
#define OPC_ARITH_EvIb (0x83)
@@ -324,13 +349,31 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
#define OPC_MOVL_Iv (0xb8)
#define OPC_MOVBE_GyMy (0xf0 | P_EXT38)
#define OPC_MOVBE_MyGy (0xf1 | P_EXT38)
+#define OPC_MOVDQA_GyMy (0x6f | P_EXT | P_DATA16)
+#define OPC_MOVDQA_MyGy (0x7f | P_EXT | P_DATA16)
+#define OPC_MOVDQU_GyMy (0x6f | P_EXT | P_SIMDF3)
+#define OPC_MOVDQU_MyGy (0x7f | P_EXT | P_SIMDF3)
+#define OPC_MOVQ_GyMy (0x7e | P_EXT | P_SIMDF3)
+#define OPC_MOVQ_MyGy (0xd6 | P_EXT | P_DATA16)
#define OPC_MOVSBL (0xbe | P_EXT)
#define OPC_MOVSWL (0xbf | P_EXT)
#define OPC_MOVSLQ (0x63 | P_REXW)
#define OPC_MOVZBL (0xb6 | P_EXT)
#define OPC_MOVZWL (0xb7 | P_EXT)
+#define OPC_PADDB (0xfc | P_EXT | P_DATA16)
+#define OPC_PADDW (0xfd | P_EXT | P_DATA16)
+#define OPC_PADDD (0xfe | P_EXT | P_DATA16)
+#define OPC_PADDQ (0xd4 | P_EXT | P_DATA16)
+#define OPC_PAND (0xdb | P_EXT | P_DATA16)
+#define OPC_PANDN (0xdf | P_EXT | P_DATA16)
#define OPC_PDEP (0xf5 | P_EXT38 | P_SIMDF2)
#define OPC_PEXT (0xf5 | P_EXT38 | P_SIMDF3)
+#define OPC_POR (0xeb | P_EXT | P_DATA16)
+#define OPC_PSUBB (0xf8 | P_EXT | P_DATA16)
+#define OPC_PSUBW (0xf9 | P_EXT | P_DATA16)
+#define OPC_PSUBD (0xfa | P_EXT | P_DATA16)
+#define OPC_PSUBQ (0xfb | P_EXT | P_DATA16)
+#define OPC_PXOR (0xef | P_EXT | P_DATA16)
#define OPC_POP_r32 (0x58)
#define OPC_POPCNT (0xb8 | P_EXT | P_SIMDF3)
#define OPC_PUSH_r32 (0x50)
@@ -500,7 +543,8 @@ static void tcg_out_modrm(TCGContext *s, int opc, int r, int rm)
tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
}
-static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
+static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v,
+ int rm, int index)
{
int tmp;
@@ -515,14 +559,16 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
} else if (opc & P_EXT) {
tmp = 1;
} else {
- tcg_abort();
+ g_assert_not_reached();
}
- tmp |= 0x40; /* VEX.X */
tmp |= (r & 8 ? 0 : 0x80); /* VEX.R */
+ tmp |= (index & 8 ? 0 : 0x40); /* VEX.X */
tmp |= (rm & 8 ? 0 : 0x20); /* VEX.B */
tcg_out8(s, tmp);
tmp = (opc & P_REXW ? 0x80 : 0); /* VEX.W */
+ tmp |= (opc & P_VEXL ? 0x04 : 0); /* VEX.L */
+
/* VEX.pp */
if (opc & P_DATA16) {
tmp |= 1; /* 0x66 */
@@ -538,7 +584,7 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
static void tcg_out_vex_modrm(TCGContext *s, int opc, int r, int v, int rm)
{
- tcg_out_vex_pfx_opc(s, opc, r, v, rm);
+ tcg_out_vex_pfx_opc(s, opc, r, v, rm, 0);
tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
}
@@ -565,7 +611,7 @@ static void tcg_out_opc_pool_imm(TCGContext *s, int opc, int r,
static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
tcg_target_ulong data)
{
- tcg_out_vex_pfx_opc(s, opc, r, v, 0);
+ tcg_out_vex_pfx_opc(s, opc, r, v, 0, 0);
tcg_out_sfx_pool_imm(s, r, data);
}
@@ -574,8 +620,8 @@ static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
mode for absolute addresses, ~RM is the size of the immediate operand
that will follow the instruction. */
-static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
- int index, int shift, intptr_t offset)
+static void tcg_out_sib_offset(TCGContext *s, int r, int rm, int index,
+ int shift, intptr_t offset)
{
int mod, len;
@@ -586,7 +632,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
intptr_t pc = (intptr_t)s->code_ptr + 5 + ~rm;
intptr_t disp = offset - pc;
if (disp == (int32_t)disp) {
- tcg_out_opc(s, opc, r, 0, 0);
tcg_out8(s, (LOWREGMASK(r) << 3) | 5);
tcg_out32(s, disp);
return;
@@ -596,7 +641,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
use of the MODRM+SIB encoding and is therefore larger than
rip-relative addressing. */
if (offset == (int32_t)offset) {
- tcg_out_opc(s, opc, r, 0, 0);
tcg_out8(s, (LOWREGMASK(r) << 3) | 4);
tcg_out8(s, (4 << 3) | 5);
tcg_out32(s, offset);
@@ -604,10 +648,9 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
}
/* ??? The memory isn't directly addressable. */
- tcg_abort();
+ g_assert_not_reached();
} else {
/* Absolute address. */
- tcg_out_opc(s, opc, r, 0, 0);
tcg_out8(s, (r << 3) | 5);
tcg_out32(s, offset);
return;
@@ -630,7 +673,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
that would be used for %esp is the escape to the two byte form. */
if (index < 0 && LOWREGMASK(rm) != TCG_REG_ESP) {
/* Single byte MODRM format. */
- tcg_out_opc(s, opc, r, rm, 0);
tcg_out8(s, mod | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
} else {
/* Two byte MODRM+SIB format. */
@@ -644,7 +686,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
tcg_debug_assert(index != TCG_REG_ESP);
}
- tcg_out_opc(s, opc, r, rm, index);
tcg_out8(s, mod | (LOWREGMASK(r) << 3) | 4);
tcg_out8(s, (shift << 6) | (LOWREGMASK(index) << 3) | LOWREGMASK(rm));
}
@@ -656,6 +697,21 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
}
}
+static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
+ int index, int shift, intptr_t offset)
+{
+ tcg_out_opc(s, opc, r, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
+ tcg_out_sib_offset(s, r, rm, index, shift, offset);
+}
+
+static void tcg_out_vex_modrm_sib_offset(TCGContext *s, int opc, int r, int v,
+ int rm, int index, int shift,
+ intptr_t offset)
+{
+ tcg_out_vex_pfx_opc(s, opc, r, v, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
+ tcg_out_sib_offset(s, r, rm, index, shift, offset);
+}
+
/* A simplification of the above with no index or shift. */
static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
int rm, intptr_t offset)
@@ -663,6 +719,31 @@ static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
tcg_out_modrm_sib_offset(s, opc, r, rm, -1, 0, offset);
}
+static inline void tcg_out_vex_modrm_offset(TCGContext *s, int opc, int r,
+ int v, int rm, intptr_t offset)
+{
+ tcg_out_vex_modrm_sib_offset(s, opc, r, v, rm, -1, 0, offset);
+}
+
+static void tcg_out_maybe_vex_modrm(TCGContext *s, int opc, int r, int rm)
+{
+ if (have_avx1) {
+ tcg_out_vex_modrm(s, opc, r, 0, rm);
+ } else {
+ tcg_out_modrm(s, opc, r, rm);
+ }
+}
+
+static void tcg_out_maybe_vex_modrm_offset(TCGContext *s, int opc, int r,
+ int rm, intptr_t offset)
+{
+ if (have_avx1) {
+ tcg_out_vex_modrm_offset(s, opc, r, 0, rm, offset);
+ } else {
+ tcg_out_modrm_offset(s, opc, r, rm, offset);
+ }
+}
+
/* Generate dest op= src. Uses the same ARITH_* codes as tgen_arithi. */
static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
{
@@ -673,12 +754,32 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
tcg_out_modrm(s, OPC_ARITH_GvEv + (subop << 3) + ext, dest, src);
}
-static inline void tcg_out_mov(TCGContext *s, TCGType type,
- TCGReg ret, TCGReg arg)
+static void tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
{
if (arg != ret) {
- int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
- tcg_out_modrm(s, opc, ret, arg);
+ int opc = 0;
+
+ switch (type) {
+ case TCG_TYPE_I64:
+ opc = P_REXW;
+ /* fallthru */
+ case TCG_TYPE_I32:
+ opc |= OPC_MOVL_GvEv;
+ tcg_out_modrm(s, opc, ret, arg);
+ break;
+
+ case TCG_TYPE_V256:
+ opc = P_VEXL;
+ /* fallthru */
+ case TCG_TYPE_V128:
+ case TCG_TYPE_V64:
+ opc |= OPC_MOVDQA_GyMy;
+ tcg_out_maybe_vex_modrm(s, opc, ret, arg);
+ break;
+
+ default:
+ g_assert_not_reached();
+ }
}
}
@@ -687,6 +788,27 @@ static void tcg_out_movi(TCGContext *s, TCGType type,
{
tcg_target_long diff;
+ switch (type) {
+ case TCG_TYPE_I32:
+ case TCG_TYPE_I64:
+ break;
+
+ case TCG_TYPE_V64:
+ case TCG_TYPE_V128:
+ case TCG_TYPE_V256:
+ /* ??? Revisit this as the implementation progresses. */
+ tcg_debug_assert(arg == 0);
+ if (have_avx1) {
+ tcg_out_vex_modrm(s, OPC_PXOR, ret, ret, ret);
+ } else {
+ tcg_out_modrm(s, OPC_PXOR, ret, ret);
+ }
+ return;
+
+ default:
+ g_assert_not_reached();
+ }
+
if (arg == 0) {
tgen_arithr(s, ARITH_XOR, ret, ret);
return;
@@ -750,18 +872,54 @@ static inline void tcg_out_pop(TCGContext *s, int reg)
tcg_out_opc(s, OPC_POP_r32 + LOWREGMASK(reg), 0, reg, 0);
}
-static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
- TCGReg arg1, intptr_t arg2)
+static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
+ TCGReg arg1, intptr_t arg2)
{
- int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
- tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
+ switch (type) {
+ case TCG_TYPE_I64:
+ tcg_out_modrm_offset(s, OPC_MOVL_GvEv | P_REXW, ret, arg1, arg2);
+ break;
+ case TCG_TYPE_I32:
+ tcg_out_modrm_offset(s, OPC_MOVL_GvEv, ret, arg1, arg2);
+ break;
+ case TCG_TYPE_V64:
+ tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_GyMy, ret, arg1, arg2);
+ break;
+ case TCG_TYPE_V128:
+ tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_GyMy, ret, arg1, arg2);
+ break;
+ case TCG_TYPE_V256:
+ tcg_out_vex_modrm_offset(s, OPC_MOVDQU_GyMy | P_VEXL,
+ ret, 0, arg1, arg2);
+ break;
+ default:
+ g_assert_not_reached();
+ }
}
-static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
- TCGReg arg1, intptr_t arg2)
+static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
+ TCGReg arg1, intptr_t arg2)
{
- int opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
- tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
+ switch (type) {
+ case TCG_TYPE_I64:
+ tcg_out_modrm_offset(s, OPC_MOVL_EvGv | P_REXW, arg, arg1, arg2);
+ break;
+ case TCG_TYPE_I32:
+ tcg_out_modrm_offset(s, OPC_MOVL_EvGv, arg, arg1, arg2);
+ break;
+ case TCG_TYPE_V64:
+ tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_MyGy, arg, arg1, arg2);
+ break;
+ case TCG_TYPE_V128:
+ tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_MyGy, arg, arg1, arg2);
+ break;
+ case TCG_TYPE_V256:
+ tcg_out_vex_modrm_offset(s, OPC_MOVDQU_MyGy | P_VEXL,
+ arg, 0, arg1, arg2);
+ break;
+ default:
+ g_assert_not_reached();
+ }
}
static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
@@ -773,6 +931,8 @@ static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
return false;
}
rexw = P_REXW;
+ } else if (type != TCG_TYPE_I32) {
+ return false;
}
tcg_out_modrm_offset(s, OPC_MOVL_EvIz | rexw, 0, base, ofs);
tcg_out32(s, val);
@@ -1914,6 +2074,15 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
case glue(glue(INDEX_op_, x), _i32)
#endif
+#define OP_128_256(x) \
+ case glue(glue(INDEX_op_, x), _v256): \
+ rexw = P_VEXL; /* FALLTHRU */ \
+ case glue(glue(INDEX_op_, x), _v128)
+
+#define OP_64_128_256(x) \
+ OP_128_256(x): \
+ case glue(glue(INDEX_op_, x), _v64)
+
/* Hoist the loads of the most common arguments. */
a0 = args[0];
a1 = args[1];
@@ -2379,19 +2548,94 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
}
break;
+ OP_64_128_256(add8):
+ c = OPC_PADDB;
+ goto gen_simd;
+ OP_64_128_256(add16):
+ c = OPC_PADDW;
+ goto gen_simd;
+ OP_64_128_256(add32):
+ c = OPC_PADDD;
+ goto gen_simd;
+ OP_128_256(add64):
+ c = OPC_PADDQ;
+ goto gen_simd;
+ OP_64_128_256(sub8):
+ c = OPC_PSUBB;
+ goto gen_simd;
+ OP_64_128_256(sub16):
+ c = OPC_PSUBW;
+ goto gen_simd;
+ OP_64_128_256(sub32):
+ c = OPC_PSUBD;
+ goto gen_simd;
+ OP_128_256(sub64):
+ c = OPC_PSUBQ;
+ goto gen_simd;
+ OP_64_128_256(and):
+ c = OPC_PAND;
+ goto gen_simd;
+ OP_64_128_256(andc):
+ c = OPC_PANDN;
+ goto gen_simd;
+ OP_64_128_256(or):
+ c = OPC_POR;
+ goto gen_simd;
+ OP_64_128_256(xor):
+ c = OPC_PXOR;
+ gen_simd:
+ if (have_avx1) {
+ tcg_out_vex_modrm(s, c, a0, a1, a2);
+ } else {
+ tcg_out_modrm(s, c, a0, a2);
+ }
+ break;
+
+ case INDEX_op_ld_v64:
+ c = TCG_TYPE_V64;
+ goto gen_simd_ld;
+ case INDEX_op_ld_v128:
+ c = TCG_TYPE_V128;
+ goto gen_simd_ld;
+ case INDEX_op_ld_v256:
+ c = TCG_TYPE_V256;
+ gen_simd_ld:
+ tcg_out_ld(s, c, a0, a1, a2);
+ break;
+
+ case INDEX_op_st_v64:
+ c = TCG_TYPE_V64;
+ goto gen_simd_st;
+ case INDEX_op_st_v128:
+ c = TCG_TYPE_V128;
+ goto gen_simd_st;
+ case INDEX_op_st_v256:
+ c = TCG_TYPE_V256;
+ gen_simd_st:
+ tcg_out_st(s, c, a0, a1, a2);
+ break;
+
case INDEX_op_mb:
tcg_out_mb(s, a0);
break;
case INDEX_op_mov_i32: /* Always emitted via tcg_out_mov. */
case INDEX_op_mov_i64:
+ case INDEX_op_mov_v64:
+ case INDEX_op_mov_v128:
+ case INDEX_op_mov_v256:
case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi. */
case INDEX_op_movi_i64:
+ case INDEX_op_movi_v64:
+ case INDEX_op_movi_v128:
+ case INDEX_op_movi_v256:
case INDEX_op_call: /* Always emitted via tcg_out_call. */
default:
tcg_abort();
}
#undef OP_32_64
+#undef OP_128_256
+#undef OP_64_128_256
}
static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
@@ -2417,6 +2661,9 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
= { .args_ct_str = { "r", "r", "L", "L" } };
static const TCGTargetOpDef L_L_L_L
= { .args_ct_str = { "L", "L", "L", "L" } };
+ static const TCGTargetOpDef x_0_x = { .args_ct_str = { "x", "0", "x" } };
+ static const TCGTargetOpDef x_x_x = { .args_ct_str = { "x", "x", "x" } };
+ static const TCGTargetOpDef x_r = { .args_ct_str = { "x", "r" } };
switch (op) {
case INDEX_op_goto_ptr:
@@ -2620,6 +2867,52 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
return &s2;
}
+ case INDEX_op_ld_v64:
+ case INDEX_op_ld_v128:
+ case INDEX_op_ld_v256:
+ case INDEX_op_st_v64:
+ case INDEX_op_st_v128:
+ case INDEX_op_st_v256:
+ return &x_r;
+
+ case INDEX_op_add8_v64:
+ case INDEX_op_add8_v128:
+ case INDEX_op_add16_v64:
+ case INDEX_op_add16_v128:
+ case INDEX_op_add32_v64:
+ case INDEX_op_add32_v128:
+ case INDEX_op_add64_v128:
+ case INDEX_op_sub8_v64:
+ case INDEX_op_sub8_v128:
+ case INDEX_op_sub16_v64:
+ case INDEX_op_sub16_v128:
+ case INDEX_op_sub32_v64:
+ case INDEX_op_sub32_v128:
+ case INDEX_op_sub64_v128:
+ case INDEX_op_and_v64:
+ case INDEX_op_and_v128:
+ case INDEX_op_andc_v64:
+ case INDEX_op_andc_v128:
+ case INDEX_op_or_v64:
+ case INDEX_op_or_v128:
+ case INDEX_op_xor_v64:
+ case INDEX_op_xor_v128:
+ return have_avx1 ? &x_x_x : &x_0_x;
+
+ case INDEX_op_add8_v256:
+ case INDEX_op_add16_v256:
+ case INDEX_op_add32_v256:
+ case INDEX_op_add64_v256:
+ case INDEX_op_sub8_v256:
+ case INDEX_op_sub16_v256:
+ case INDEX_op_sub32_v256:
+ case INDEX_op_sub64_v256:
+ case INDEX_op_and_v256:
+ case INDEX_op_andc_v256:
+ case INDEX_op_or_v256:
+ case INDEX_op_xor_v256:
+ return &x_x_x;
+
default:
break;
}
@@ -2725,9 +3018,16 @@ static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
static void tcg_target_init(TCGContext *s)
{
#ifdef CONFIG_CPUID_H
- unsigned a, b, c, d;
+ unsigned a, b, c, d, b7 = 0;
int max = __get_cpuid_max(0, 0);
+ if (max >= 7) {
+ /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs. */
+ __cpuid_count(7, 0, a, b7, c, d);
+ have_bmi1 = (b7 & bit_BMI) != 0;
+ have_bmi2 = (b7 & bit_BMI2) != 0;
+ }
+
if (max >= 1) {
__cpuid(1, a, b, c, d);
#ifndef have_cmov
@@ -2736,17 +3036,26 @@ static void tcg_target_init(TCGContext *s)
available, we'll use a small forward branch. */
have_cmov = (d & bit_CMOV) != 0;
#endif
+#ifndef have_sse2
+ have_sse2 = (d & bit_SSE2) != 0;
+#endif
/* MOVBE is only available on Intel Atom and Haswell CPUs, so we
need to probe for it. */
have_movbe = (c & bit_MOVBE) != 0;
have_popcnt = (c & bit_POPCNT) != 0;
- }
- if (max >= 7) {
- /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs. */
- __cpuid_count(7, 0, a, b, c, d);
- have_bmi1 = (b & bit_BMI) != 0;
- have_bmi2 = (b & bit_BMI2) != 0;
+#ifndef have_avx2
+ /* There are a number of things we must check before we can be
+ sure of not hitting invalid opcode. */
+ if (c & bit_OSXSAVE) {
+ unsigned xcrl, xcrh;
+ asm ("xgetbv" : "=a" (xcrl), "=d" (xcrh) : "c" (0));
+ if (xcrl & 6 == 6) {
+ have_avx1 = (c & bit_AVX) != 0;
+ have_avx2 = (b7 & bit_AVX2) != 0;
+ }
+ }
+#endif
}
max = __get_cpuid_max(0x8000000, 0);
@@ -2763,6 +3072,13 @@ static void tcg_target_init(TCGContext *s)
} else {
tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xff);
}
+ if (have_sse2) {
+ tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V64], 0, 0xff0000);
+ tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V128], 0, 0xff0000);
+ }
+ if (have_avx2) {
+ tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V256], 0, 0xff0000);
+ }
tcg_regset_clear(tcg_target_call_clobber_regs);
tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_EAX);
--
2.13.5
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported
2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
@ 2017-08-17 23:44 ` Philippe Mathieu-Daudé
2017-09-07 19:02 ` Alex Bennée
1 sibling, 0 replies; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-17 23:44 UTC (permalink / raw)
To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee
On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
> ---
> tcg/tcg.h | 2 +
> tcg/tcg.c | 310 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 312 insertions(+)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index b9e15da13b..b443143b21 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -962,6 +962,8 @@ do {\
> #define tcg_temp_free_ptr(T) tcg_temp_free_i64(TCGV_PTR_TO_NAT(T))
> #endif
>
> +bool tcg_op_supported(TCGOpcode op);
> +
> void tcg_gen_callN(TCGContext *s, void *func,
> TCGArg ret, int nargs, TCGArg *args);
>
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index ea78d47fad..3c3cdda938 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -751,6 +751,316 @@ int tcg_check_temp_count(void)
> }
> #endif
>
> +/* Return true if OP may appear in the opcode stream.
> + Test the runtime variable that controls each opcode. */
> +bool tcg_op_supported(TCGOpcode op)
> +{
> + switch (op) {
> + case INDEX_op_discard:
> + case INDEX_op_set_label:
> + case INDEX_op_call:
> + case INDEX_op_br:
> + case INDEX_op_mb:
> + case INDEX_op_insn_start:
> + case INDEX_op_exit_tb:
> + case INDEX_op_goto_tb:
> + case INDEX_op_qemu_ld_i32:
> + case INDEX_op_qemu_st_i32:
> + case INDEX_op_qemu_ld_i64:
> + case INDEX_op_qemu_st_i64:
> + return true;
> +
> + case INDEX_op_goto_ptr:
> + return TCG_TARGET_HAS_goto_ptr;
> +
> + case INDEX_op_mov_i32:
> + case INDEX_op_movi_i32:
> + case INDEX_op_setcond_i32:
> + case INDEX_op_brcond_i32:
> + case INDEX_op_ld8u_i32:
> + case INDEX_op_ld8s_i32:
> + case INDEX_op_ld16u_i32:
> + case INDEX_op_ld16s_i32:
> + case INDEX_op_ld_i32:
> + case INDEX_op_st8_i32:
> + case INDEX_op_st16_i32:
> + case INDEX_op_st_i32:
> + case INDEX_op_add_i32:
> + case INDEX_op_sub_i32:
> + case INDEX_op_mul_i32:
> + case INDEX_op_and_i32:
> + case INDEX_op_or_i32:
> + case INDEX_op_xor_i32:
> + case INDEX_op_shl_i32:
> + case INDEX_op_shr_i32:
> + case INDEX_op_sar_i32:
> + return true;
> +
> + case INDEX_op_movcond_i32:
> + return TCG_TARGET_HAS_movcond_i32;
> + case INDEX_op_div_i32:
> + case INDEX_op_divu_i32:
> + return TCG_TARGET_HAS_div_i32;
> + case INDEX_op_rem_i32:
> + case INDEX_op_remu_i32:
> + return TCG_TARGET_HAS_rem_i32;
> + case INDEX_op_div2_i32:
> + case INDEX_op_divu2_i32:
> + return TCG_TARGET_HAS_div2_i32;
> + case INDEX_op_rotl_i32:
> + case INDEX_op_rotr_i32:
> + return TCG_TARGET_HAS_rot_i32;
> + case INDEX_op_deposit_i32:
> + return TCG_TARGET_HAS_deposit_i32;
> + case INDEX_op_extract_i32:
> + return TCG_TARGET_HAS_extract_i32;
> + case INDEX_op_sextract_i32:
> + return TCG_TARGET_HAS_sextract_i32;
> + case INDEX_op_add2_i32:
> + return TCG_TARGET_HAS_add2_i32;
> + case INDEX_op_sub2_i32:
> + return TCG_TARGET_HAS_sub2_i32;
> + case INDEX_op_mulu2_i32:
> + return TCG_TARGET_HAS_mulu2_i32;
> + case INDEX_op_muls2_i32:
> + return TCG_TARGET_HAS_muls2_i32;
> + case INDEX_op_muluh_i32:
> + return TCG_TARGET_HAS_muluh_i32;
> + case INDEX_op_mulsh_i32:
> + return TCG_TARGET_HAS_mulsh_i32;
> + case INDEX_op_ext8s_i32:
> + return TCG_TARGET_HAS_ext8s_i32;
> + case INDEX_op_ext16s_i32:
> + return TCG_TARGET_HAS_ext16s_i32;
> + case INDEX_op_ext8u_i32:
> + return TCG_TARGET_HAS_ext8u_i32;
> + case INDEX_op_ext16u_i32:
> + return TCG_TARGET_HAS_ext16u_i32;
> + case INDEX_op_bswap16_i32:
> + return TCG_TARGET_HAS_bswap16_i32;
> + case INDEX_op_bswap32_i32:
> + return TCG_TARGET_HAS_bswap32_i32;
> + case INDEX_op_not_i32:
> + return TCG_TARGET_HAS_not_i32;
> + case INDEX_op_neg_i32:
> + return TCG_TARGET_HAS_neg_i32;
> + case INDEX_op_andc_i32:
> + return TCG_TARGET_HAS_andc_i32;
> + case INDEX_op_orc_i32:
> + return TCG_TARGET_HAS_orc_i32;
> + case INDEX_op_eqv_i32:
> + return TCG_TARGET_HAS_eqv_i32;
> + case INDEX_op_nand_i32:
> + return TCG_TARGET_HAS_nand_i32;
> + case INDEX_op_nor_i32:
> + return TCG_TARGET_HAS_nor_i32;
> + case INDEX_op_clz_i32:
> + return TCG_TARGET_HAS_clz_i32;
> + case INDEX_op_ctz_i32:
> + return TCG_TARGET_HAS_ctz_i32;
> + case INDEX_op_ctpop_i32:
> + return TCG_TARGET_HAS_ctpop_i32;
> +
> + case INDEX_op_brcond2_i32:
> + case INDEX_op_setcond2_i32:
> + return TCG_TARGET_REG_BITS == 32;
> +
> + case INDEX_op_mov_i64:
> + case INDEX_op_movi_i64:
> + case INDEX_op_setcond_i64:
> + case INDEX_op_brcond_i64:
> + case INDEX_op_ld8u_i64:
> + case INDEX_op_ld8s_i64:
> + case INDEX_op_ld16u_i64:
> + case INDEX_op_ld16s_i64:
> + case INDEX_op_ld32u_i64:
> + case INDEX_op_ld32s_i64:
> + case INDEX_op_ld_i64:
> + case INDEX_op_st8_i64:
> + case INDEX_op_st16_i64:
> + case INDEX_op_st32_i64:
> + case INDEX_op_st_i64:
> + case INDEX_op_add_i64:
> + case INDEX_op_sub_i64:
> + case INDEX_op_mul_i64:
> + case INDEX_op_and_i64:
> + case INDEX_op_or_i64:
> + case INDEX_op_xor_i64:
> + case INDEX_op_shl_i64:
> + case INDEX_op_shr_i64:
> + case INDEX_op_sar_i64:
> + case INDEX_op_ext_i32_i64:
> + case INDEX_op_extu_i32_i64:
> + return TCG_TARGET_REG_BITS == 64;
> +
> + case INDEX_op_movcond_i64:
> + return TCG_TARGET_HAS_movcond_i64;
> + case INDEX_op_div_i64:
> + case INDEX_op_divu_i64:
> + return TCG_TARGET_HAS_div_i64;
> + case INDEX_op_rem_i64:
> + case INDEX_op_remu_i64:
> + return TCG_TARGET_HAS_rem_i64;
> + case INDEX_op_div2_i64:
> + case INDEX_op_divu2_i64:
> + return TCG_TARGET_HAS_div2_i64;
> + case INDEX_op_rotl_i64:
> + case INDEX_op_rotr_i64:
> + return TCG_TARGET_HAS_rot_i64;
> + case INDEX_op_deposit_i64:
> + return TCG_TARGET_HAS_deposit_i64;
> + case INDEX_op_extract_i64:
> + return TCG_TARGET_HAS_extract_i64;
> + case INDEX_op_sextract_i64:
> + return TCG_TARGET_HAS_sextract_i64;
> + case INDEX_op_extrl_i64_i32:
> + return TCG_TARGET_HAS_extrl_i64_i32;
> + case INDEX_op_extrh_i64_i32:
> + return TCG_TARGET_HAS_extrh_i64_i32;
> + case INDEX_op_ext8s_i64:
> + return TCG_TARGET_HAS_ext8s_i64;
> + case INDEX_op_ext16s_i64:
> + return TCG_TARGET_HAS_ext16s_i64;
> + case INDEX_op_ext32s_i64:
> + return TCG_TARGET_HAS_ext32s_i64;
> + case INDEX_op_ext8u_i64:
> + return TCG_TARGET_HAS_ext8u_i64;
> + case INDEX_op_ext16u_i64:
> + return TCG_TARGET_HAS_ext16u_i64;
> + case INDEX_op_ext32u_i64:
> + return TCG_TARGET_HAS_ext32u_i64;
> + case INDEX_op_bswap16_i64:
> + return TCG_TARGET_HAS_bswap16_i64;
> + case INDEX_op_bswap32_i64:
> + return TCG_TARGET_HAS_bswap32_i64;
> + case INDEX_op_bswap64_i64:
> + return TCG_TARGET_HAS_bswap64_i64;
> + case INDEX_op_not_i64:
> + return TCG_TARGET_HAS_not_i64;
> + case INDEX_op_neg_i64:
> + return TCG_TARGET_HAS_neg_i64;
> + case INDEX_op_andc_i64:
> + return TCG_TARGET_HAS_andc_i64;
> + case INDEX_op_orc_i64:
> + return TCG_TARGET_HAS_orc_i64;
> + case INDEX_op_eqv_i64:
> + return TCG_TARGET_HAS_eqv_i64;
> + case INDEX_op_nand_i64:
> + return TCG_TARGET_HAS_nand_i64;
> + case INDEX_op_nor_i64:
> + return TCG_TARGET_HAS_nor_i64;
> + case INDEX_op_clz_i64:
> + return TCG_TARGET_HAS_clz_i64;
> + case INDEX_op_ctz_i64:
> + return TCG_TARGET_HAS_ctz_i64;
> + case INDEX_op_ctpop_i64:
> + return TCG_TARGET_HAS_ctpop_i64;
> + case INDEX_op_add2_i64:
> + return TCG_TARGET_HAS_add2_i64;
> + case INDEX_op_sub2_i64:
> + return TCG_TARGET_HAS_sub2_i64;
> + case INDEX_op_mulu2_i64:
> + return TCG_TARGET_HAS_mulu2_i64;
> + case INDEX_op_muls2_i64:
> + return TCG_TARGET_HAS_muls2_i64;
> + case INDEX_op_muluh_i64:
> + return TCG_TARGET_HAS_muluh_i64;
> + case INDEX_op_mulsh_i64:
> + return TCG_TARGET_HAS_mulsh_i64;
> +
> + case INDEX_op_mov_v64:
> + case INDEX_op_movi_v64:
> + case INDEX_op_ld_v64:
> + case INDEX_op_st_v64:
> + case INDEX_op_and_v64:
> + case INDEX_op_or_v64:
> + case INDEX_op_xor_v64:
> + case INDEX_op_add8_v64:
> + case INDEX_op_add16_v64:
> + case INDEX_op_add32_v64:
> + case INDEX_op_sub8_v64:
> + case INDEX_op_sub16_v64:
> + case INDEX_op_sub32_v64:
> + return TCG_TARGET_HAS_v64;
> +
> + case INDEX_op_mov_v128:
> + case INDEX_op_movi_v128:
> + case INDEX_op_ld_v128:
> + case INDEX_op_st_v128:
> + case INDEX_op_and_v128:
> + case INDEX_op_or_v128:
> + case INDEX_op_xor_v128:
> + case INDEX_op_add8_v128:
> + case INDEX_op_add16_v128:
> + case INDEX_op_add32_v128:
> + case INDEX_op_add64_v128:
> + case INDEX_op_sub8_v128:
> + case INDEX_op_sub16_v128:
> + case INDEX_op_sub32_v128:
> + case INDEX_op_sub64_v128:
> + return TCG_TARGET_HAS_v128;
> +
> + case INDEX_op_mov_v256:
> + case INDEX_op_movi_v256:
> + case INDEX_op_ld_v256:
> + case INDEX_op_st_v256:
> + case INDEX_op_and_v256:
> + case INDEX_op_or_v256:
> + case INDEX_op_xor_v256:
> + case INDEX_op_add8_v256:
> + case INDEX_op_add16_v256:
> + case INDEX_op_add32_v256:
> + case INDEX_op_add64_v256:
> + case INDEX_op_sub8_v256:
> + case INDEX_op_sub16_v256:
> + case INDEX_op_sub32_v256:
> + case INDEX_op_sub64_v256:
> + return TCG_TARGET_HAS_v256;
> +
> + case INDEX_op_not_v64:
> + return TCG_TARGET_HAS_not_v64;
> + case INDEX_op_not_v128:
> + return TCG_TARGET_HAS_not_v128;
> + case INDEX_op_not_v256:
> + return TCG_TARGET_HAS_not_v256;
> +
> + case INDEX_op_andc_v64:
> + return TCG_TARGET_HAS_andc_v64;
> + case INDEX_op_andc_v128:
> + return TCG_TARGET_HAS_andc_v128;
> + case INDEX_op_andc_v256:
> + return TCG_TARGET_HAS_andc_v256;
> +
> + case INDEX_op_orc_v64:
> + return TCG_TARGET_HAS_orc_v64;
> + case INDEX_op_orc_v128:
> + return TCG_TARGET_HAS_orc_v128;
> + case INDEX_op_orc_v256:
> + return TCG_TARGET_HAS_orc_v256;
> +
> + case INDEX_op_neg8_v64:
> + case INDEX_op_neg16_v64:
> + case INDEX_op_neg32_v64:
> + return TCG_TARGET_HAS_neg_v64;
> +
> + case INDEX_op_neg8_v128:
> + case INDEX_op_neg16_v128:
> + case INDEX_op_neg32_v128:
> + case INDEX_op_neg64_v128:
> + return TCG_TARGET_HAS_neg_v128;
> +
> + case INDEX_op_neg8_v256:
> + case INDEX_op_neg16_v256:
> + case INDEX_op_neg32_v256:
> + case INDEX_op_neg64_v256:
> + return TCG_TARGET_HAS_neg_v256;
> +
> + case NB_OPS:
> + break;
> + }
> + g_assert_not_reached();
> +}
> +
> /* Note: we convert the 64 bit args to 32 bit and do some alignment
> and endian swap. Maybe it would be better to do the alignment
> and endian swap in tcg_reg_alloc_call(). */
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid
2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
@ 2017-08-17 23:45 ` Philippe Mathieu-Daudé
2017-09-08 9:30 ` Alex Bennée
1 sibling, 0 replies; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-17 23:45 UTC (permalink / raw)
To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee
On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Add with value 0 so that structure zero initialization can
> indicate that the field is not present.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
> ---
> tcg/tcg-opc.h | 2 ++
> tcg/tcg.c | 3 +++
> 2 files changed, 5 insertions(+)
>
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 9162125fac..b1445a4c24 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -26,6 +26,8 @@
> * DEF(name, oargs, iargs, cargs, flags)
> */
>
> +DEF(invalid, 0, 0, 0, TCG_OPF_NOT_PRESENT)
> +
> /* predefined ops */
> DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT)
> DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 3c3cdda938..879b29e81f 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -756,6 +756,9 @@ int tcg_check_temp_count(void)
> bool tcg_op_supported(TCGOpcode op)
> {
> switch (op) {
> + case INDEX_op_invalid:
> + return false;
> +
> case INDEX_op_discard:
> case INDEX_op_set_label:
> case INDEX_op_call:
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors
2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
@ 2017-08-17 23:46 ` Philippe Mathieu-Daudé
2017-09-07 18:18 ` Alex Bennée
1 sibling, 0 replies; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-17 23:46 UTC (permalink / raw)
To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee
On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Nothing uses or enables them yet.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
> ---
> tcg/tcg.h | 5 +++++
> tcg/tcg.c | 2 +-
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index dd97095af5..1277caed3d 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -256,6 +256,11 @@ typedef struct TCGPool {
> typedef enum TCGType {
> TCG_TYPE_I32,
> TCG_TYPE_I64,
> +
> + TCG_TYPE_V64,
> + TCG_TYPE_V128,
> + TCG_TYPE_V256,
> +
> TCG_TYPE_COUNT, /* number of different types */
>
> /* An alias for the size of the host register. */
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 787c8ba0f7..ea78d47fad 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -118,7 +118,7 @@ static TCGReg tcg_reg_alloc_new(TCGContext *s, TCGType t)
> static bool tcg_out_ldst_finalize(TCGContext *s);
> #endif
>
> -static TCGRegSet tcg_target_available_regs[2];
> +static TCGRegSet tcg_target_available_regs[TCG_TYPE_COUNT];
> static TCGRegSet tcg_target_call_clobber_regs;
>
> #if TCG_TARGET_INSN_UNIT_SIZE == 1
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
@ 2017-08-22 13:15 ` Alex Bennée
2017-08-23 19:02 ` Richard Henderson
2017-09-08 10:13 ` Alex Bennée
1 sibling, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-08-22 13:15 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> tcg/i386/tcg-target.h | 46 +++++-
> tcg/tcg-opc.h | 12 +-
> tcg/i386/tcg-target.inc.c | 382 ++++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 399 insertions(+), 41 deletions(-)
>
> diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
> index e512648c95..147f82062b 100644
> --- a/tcg/i386/tcg-target.h
> +++ b/tcg/i386/tcg-target.h
> @@ -30,11 +30,10 @@
>
> #ifdef __x86_64__
> # define TCG_TARGET_REG_BITS 64
> -# define TCG_TARGET_NB_REGS 16
> #else
> # define TCG_TARGET_REG_BITS 32
> -# define TCG_TARGET_NB_REGS 8
> #endif
> +# define TCG_TARGET_NB_REGS 24
>
> typedef enum {
> TCG_REG_EAX = 0,
> @@ -56,6 +55,19 @@ typedef enum {
> TCG_REG_R13,
> TCG_REG_R14,
> TCG_REG_R15,
> +
> + /* SSE registers; 64-bit has access to 8 more, but we won't
> + need more than a few and using only the first 8 minimizes
> + the need for a rex prefix on the sse instructions. */
> + TCG_REG_XMM0,
> + TCG_REG_XMM1,
> + TCG_REG_XMM2,
> + TCG_REG_XMM3,
> + TCG_REG_XMM4,
> + TCG_REG_XMM5,
> + TCG_REG_XMM6,
> + TCG_REG_XMM7,
> +
> TCG_REG_RAX = TCG_REG_EAX,
> TCG_REG_RCX = TCG_REG_ECX,
> TCG_REG_RDX = TCG_REG_EDX,
> @@ -79,6 +91,17 @@ extern bool have_bmi1;
> extern bool have_bmi2;
> extern bool have_popcnt;
>
> +#ifdef __SSE2__
> +#define have_sse2 true
> +#else
> +extern bool have_sse2;
> +#endif
> +#ifdef __AVX2__
> +#define have_avx2 true
> +#else
> +extern bool have_avx2;
> +#endif
> +
> /* optional instructions */
> #define TCG_TARGET_HAS_div2_i32 1
> #define TCG_TARGET_HAS_rot_i32 1
> @@ -147,6 +170,25 @@ extern bool have_popcnt;
> #define TCG_TARGET_HAS_mulsh_i64 0
> #endif
>
> +#define TCG_TARGET_HAS_v64 have_sse2
> +#define TCG_TARGET_HAS_v128 have_sse2
> +#define TCG_TARGET_HAS_v256 have_avx2
> +
> +#define TCG_TARGET_HAS_andc_v64 TCG_TARGET_HAS_v64
> +#define TCG_TARGET_HAS_orc_v64 0
> +#define TCG_TARGET_HAS_not_v64 0
> +#define TCG_TARGET_HAS_neg_v64 0
> +
> +#define TCG_TARGET_HAS_andc_v128 TCG_TARGET_HAS_v128
> +#define TCG_TARGET_HAS_orc_v128 0
> +#define TCG_TARGET_HAS_not_v128 0
> +#define TCG_TARGET_HAS_neg_v128 0
> +
> +#define TCG_TARGET_HAS_andc_v256 TCG_TARGET_HAS_v256
> +#define TCG_TARGET_HAS_orc_v256 0
> +#define TCG_TARGET_HAS_not_v256 0
> +#define TCG_TARGET_HAS_neg_v256 0
> +
> #define TCG_TARGET_deposit_i32_valid(ofs, len) \
> (have_bmi2 || \
> ((ofs) == 0 && (len) == 8) || \
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index b1445a4c24..b84cd584fb 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -212,13 +212,13 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
> /* Host integer vector operations. */
> /* These opcodes are required whenever the base vector size is enabled. */
>
> -DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
> -DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
> -DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(mov_v64, 1, 1, 0, TCG_OPF_NOT_PRESENT)
> +DEF(mov_v128, 1, 1, 0, TCG_OPF_NOT_PRESENT)
> +DEF(mov_v256, 1, 1, 0, TCG_OPF_NOT_PRESENT)
>
> -DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
> -DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
> -DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
> +DEF(movi_v64, 1, 0, 1, TCG_OPF_NOT_PRESENT)
> +DEF(movi_v128, 1, 0, 1, TCG_OPF_NOT_PRESENT)
> +DEF(movi_v256, 1, 0, 1, TCG_OPF_NOT_PRESENT)
>
> DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
> DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
> diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
> index aeefb72aa0..0e01b54aa0 100644
> --- a/tcg/i386/tcg-target.inc.c
> +++ b/tcg/i386/tcg-target.inc.c
> @@ -31,7 +31,9 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
> "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15",
> #else
> "%eax", "%ecx", "%edx", "%ebx", "%esp", "%ebp", "%esi", "%edi",
> + NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
> #endif
> + "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7",
> };
> #endif
>
> @@ -61,6 +63,14 @@ static const int tcg_target_reg_alloc_order[] = {
> TCG_REG_EDX,
> TCG_REG_EAX,
> #endif
> + TCG_REG_XMM0,
> + TCG_REG_XMM1,
> + TCG_REG_XMM2,
> + TCG_REG_XMM3,
> + TCG_REG_XMM4,
> + TCG_REG_XMM5,
> + TCG_REG_XMM6,
> + TCG_REG_XMM7,
> };
>
> static const int tcg_target_call_iarg_regs[] = {
> @@ -94,7 +104,7 @@ static const int tcg_target_call_oarg_regs[] = {
> #define TCG_CT_CONST_I32 0x400
> #define TCG_CT_CONST_WSZ 0x800
>
> -/* Registers used with L constraint, which are the first argument
> +/* Registers used with L constraint, which are the first argument
> registers on x86_64, and two random call clobbered registers on
> i386. */
> #if TCG_TARGET_REG_BITS == 64
> @@ -127,6 +137,16 @@ bool have_bmi1;
> bool have_bmi2;
> bool have_popcnt;
>
> +#ifndef have_sse2
> +bool have_sse2;
> +#endif
> +#ifdef have_avx2
> +#define have_avx1 have_avx2
> +#else
> +static bool have_avx1;
> +bool have_avx2;
> +#endif
> +
> #ifdef CONFIG_CPUID_H
> static bool have_movbe;
> static bool have_lzcnt;
> @@ -215,6 +235,10 @@ static const char *target_parse_constraint(TCGArgConstraint *ct,
> /* With TZCNT/LZCNT, we can have operand-size as an input. */
> ct->ct |= TCG_CT_CONST_WSZ;
> break;
> + case 'x':
> + ct->ct |= TCG_CT_REG;
> + tcg_regset_set32(ct->u.regs, 0, 0xff0000);
> + break;
>
> /* qemu_ld/st address constraint */
> case 'L':
> @@ -292,6 +316,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
> #endif
> #define P_SIMDF3 0x20000 /* 0xf3 opcode prefix */
> #define P_SIMDF2 0x40000 /* 0xf2 opcode prefix */
> +#define P_VEXL 0x80000 /* Set VEX.L = 1 */
>
> #define OPC_ARITH_EvIz (0x81)
> #define OPC_ARITH_EvIb (0x83)
> @@ -324,13 +349,31 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
> #define OPC_MOVL_Iv (0xb8)
> #define OPC_MOVBE_GyMy (0xf0 | P_EXT38)
> #define OPC_MOVBE_MyGy (0xf1 | P_EXT38)
> +#define OPC_MOVDQA_GyMy (0x6f | P_EXT | P_DATA16)
> +#define OPC_MOVDQA_MyGy (0x7f | P_EXT | P_DATA16)
> +#define OPC_MOVDQU_GyMy (0x6f | P_EXT | P_SIMDF3)
> +#define OPC_MOVDQU_MyGy (0x7f | P_EXT | P_SIMDF3)
> +#define OPC_MOVQ_GyMy (0x7e | P_EXT | P_SIMDF3)
> +#define OPC_MOVQ_MyGy (0xd6 | P_EXT | P_DATA16)
> #define OPC_MOVSBL (0xbe | P_EXT)
> #define OPC_MOVSWL (0xbf | P_EXT)
> #define OPC_MOVSLQ (0x63 | P_REXW)
> #define OPC_MOVZBL (0xb6 | P_EXT)
> #define OPC_MOVZWL (0xb7 | P_EXT)
> +#define OPC_PADDB (0xfc | P_EXT | P_DATA16)
> +#define OPC_PADDW (0xfd | P_EXT | P_DATA16)
> +#define OPC_PADDD (0xfe | P_EXT | P_DATA16)
> +#define OPC_PADDQ (0xd4 | P_EXT | P_DATA16)
> +#define OPC_PAND (0xdb | P_EXT | P_DATA16)
> +#define OPC_PANDN (0xdf | P_EXT | P_DATA16)
> #define OPC_PDEP (0xf5 | P_EXT38 | P_SIMDF2)
> #define OPC_PEXT (0xf5 | P_EXT38 | P_SIMDF3)
> +#define OPC_POR (0xeb | P_EXT | P_DATA16)
> +#define OPC_PSUBB (0xf8 | P_EXT | P_DATA16)
> +#define OPC_PSUBW (0xf9 | P_EXT | P_DATA16)
> +#define OPC_PSUBD (0xfa | P_EXT | P_DATA16)
> +#define OPC_PSUBQ (0xfb | P_EXT | P_DATA16)
> +#define OPC_PXOR (0xef | P_EXT | P_DATA16)
> #define OPC_POP_r32 (0x58)
> #define OPC_POPCNT (0xb8 | P_EXT | P_SIMDF3)
> #define OPC_PUSH_r32 (0x50)
> @@ -500,7 +543,8 @@ static void tcg_out_modrm(TCGContext *s, int opc, int r, int rm)
> tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
> }
>
> -static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
> +static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v,
> + int rm, int index)
> {
> int tmp;
>
> @@ -515,14 +559,16 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
> } else if (opc & P_EXT) {
> tmp = 1;
> } else {
> - tcg_abort();
> + g_assert_not_reached();
> }
> - tmp |= 0x40; /* VEX.X */
> tmp |= (r & 8 ? 0 : 0x80); /* VEX.R */
> + tmp |= (index & 8 ? 0 : 0x40); /* VEX.X */
> tmp |= (rm & 8 ? 0 : 0x20); /* VEX.B */
> tcg_out8(s, tmp);
>
> tmp = (opc & P_REXW ? 0x80 : 0); /* VEX.W */
> + tmp |= (opc & P_VEXL ? 0x04 : 0); /* VEX.L */
> +
> /* VEX.pp */
> if (opc & P_DATA16) {
> tmp |= 1; /* 0x66 */
> @@ -538,7 +584,7 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
>
> static void tcg_out_vex_modrm(TCGContext *s, int opc, int r, int v, int rm)
> {
> - tcg_out_vex_pfx_opc(s, opc, r, v, rm);
> + tcg_out_vex_pfx_opc(s, opc, r, v, rm, 0);
> tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
> }
>
> @@ -565,7 +611,7 @@ static void tcg_out_opc_pool_imm(TCGContext *s, int opc, int r,
> static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
> tcg_target_ulong data)
> {
> - tcg_out_vex_pfx_opc(s, opc, r, v, 0);
> + tcg_out_vex_pfx_opc(s, opc, r, v, 0, 0);
> tcg_out_sfx_pool_imm(s, r, data);
> }
>
> @@ -574,8 +620,8 @@ static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
> mode for absolute addresses, ~RM is the size of the immediate operand
> that will follow the instruction. */
>
> -static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> - int index, int shift, intptr_t offset)
> +static void tcg_out_sib_offset(TCGContext *s, int r, int rm, int index,
> + int shift, intptr_t offset)
> {
> int mod, len;
>
> @@ -586,7 +632,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> intptr_t pc = (intptr_t)s->code_ptr + 5 + ~rm;
> intptr_t disp = offset - pc;
> if (disp == (int32_t)disp) {
> - tcg_out_opc(s, opc, r, 0, 0);
> tcg_out8(s, (LOWREGMASK(r) << 3) | 5);
> tcg_out32(s, disp);
> return;
> @@ -596,7 +641,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> use of the MODRM+SIB encoding and is therefore larger than
> rip-relative addressing. */
> if (offset == (int32_t)offset) {
> - tcg_out_opc(s, opc, r, 0, 0);
> tcg_out8(s, (LOWREGMASK(r) << 3) | 4);
> tcg_out8(s, (4 << 3) | 5);
> tcg_out32(s, offset);
> @@ -604,10 +648,9 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> }
>
> /* ??? The memory isn't directly addressable. */
> - tcg_abort();
> + g_assert_not_reached();
> } else {
> /* Absolute address. */
> - tcg_out_opc(s, opc, r, 0, 0);
> tcg_out8(s, (r << 3) | 5);
> tcg_out32(s, offset);
> return;
> @@ -630,7 +673,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> that would be used for %esp is the escape to the two byte form. */
> if (index < 0 && LOWREGMASK(rm) != TCG_REG_ESP) {
> /* Single byte MODRM format. */
> - tcg_out_opc(s, opc, r, rm, 0);
> tcg_out8(s, mod | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
> } else {
> /* Two byte MODRM+SIB format. */
> @@ -644,7 +686,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> tcg_debug_assert(index != TCG_REG_ESP);
> }
>
> - tcg_out_opc(s, opc, r, rm, index);
> tcg_out8(s, mod | (LOWREGMASK(r) << 3) | 4);
> tcg_out8(s, (shift << 6) | (LOWREGMASK(index) << 3) | LOWREGMASK(rm));
> }
> @@ -656,6 +697,21 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> }
> }
>
> +static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> + int index, int shift, intptr_t offset)
> +{
> + tcg_out_opc(s, opc, r, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
> + tcg_out_sib_offset(s, r, rm, index, shift, offset);
> +}
> +
> +static void tcg_out_vex_modrm_sib_offset(TCGContext *s, int opc, int r, int v,
> + int rm, int index, int shift,
> + intptr_t offset)
> +{
> + tcg_out_vex_pfx_opc(s, opc, r, v, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
> + tcg_out_sib_offset(s, r, rm, index, shift, offset);
> +}
> +
> /* A simplification of the above with no index or shift. */
> static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
> int rm, intptr_t offset)
> @@ -663,6 +719,31 @@ static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
> tcg_out_modrm_sib_offset(s, opc, r, rm, -1, 0, offset);
> }
>
> +static inline void tcg_out_vex_modrm_offset(TCGContext *s, int opc, int r,
> + int v, int rm, intptr_t offset)
> +{
> + tcg_out_vex_modrm_sib_offset(s, opc, r, v, rm, -1, 0, offset);
> +}
> +
> +static void tcg_out_maybe_vex_modrm(TCGContext *s, int opc, int r, int rm)
> +{
> + if (have_avx1) {
> + tcg_out_vex_modrm(s, opc, r, 0, rm);
> + } else {
> + tcg_out_modrm(s, opc, r, rm);
> + }
> +}
> +
> +static void tcg_out_maybe_vex_modrm_offset(TCGContext *s, int opc, int r,
> + int rm, intptr_t offset)
> +{
> + if (have_avx1) {
> + tcg_out_vex_modrm_offset(s, opc, r, 0, rm, offset);
> + } else {
> + tcg_out_modrm_offset(s, opc, r, rm, offset);
> + }
> +}
> +
> /* Generate dest op= src. Uses the same ARITH_* codes as tgen_arithi. */
> static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
> {
> @@ -673,12 +754,32 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
> tcg_out_modrm(s, OPC_ARITH_GvEv + (subop << 3) + ext, dest, src);
> }
>
> -static inline void tcg_out_mov(TCGContext *s, TCGType type,
> - TCGReg ret, TCGReg arg)
> +static void tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
> {
> if (arg != ret) {
> - int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> - tcg_out_modrm(s, opc, ret, arg);
> + int opc = 0;
> +
> + switch (type) {
> + case TCG_TYPE_I64:
> + opc = P_REXW;
> + /* fallthru */
> + case TCG_TYPE_I32:
> + opc |= OPC_MOVL_GvEv;
> + tcg_out_modrm(s, opc, ret, arg);
> + break;
> +
> + case TCG_TYPE_V256:
> + opc = P_VEXL;
> + /* fallthru */
> + case TCG_TYPE_V128:
> + case TCG_TYPE_V64:
> + opc |= OPC_MOVDQA_GyMy;
> + tcg_out_maybe_vex_modrm(s, opc, ret, arg);
> + break;
> +
> + default:
> + g_assert_not_reached();
> + }
> }
> }
>
> @@ -687,6 +788,27 @@ static void tcg_out_movi(TCGContext *s, TCGType type,
> {
> tcg_target_long diff;
>
> + switch (type) {
> + case TCG_TYPE_I32:
> + case TCG_TYPE_I64:
> + break;
> +
> + case TCG_TYPE_V64:
> + case TCG_TYPE_V128:
> + case TCG_TYPE_V256:
> + /* ??? Revisit this as the implementation progresses. */
> + tcg_debug_assert(arg == 0);
> + if (have_avx1) {
> + tcg_out_vex_modrm(s, OPC_PXOR, ret, ret, ret);
> + } else {
> + tcg_out_modrm(s, OPC_PXOR, ret, ret);
> + }
> + return;
> +
> + default:
> + g_assert_not_reached();
> + }
> +
> if (arg == 0) {
> tgen_arithr(s, ARITH_XOR, ret, ret);
> return;
> @@ -750,18 +872,54 @@ static inline void tcg_out_pop(TCGContext *s, int reg)
> tcg_out_opc(s, OPC_POP_r32 + LOWREGMASK(reg), 0, reg, 0);
> }
>
> -static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
> - TCGReg arg1, intptr_t arg2)
> +static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
> + TCGReg arg1, intptr_t arg2)
> {
> - int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> - tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
> + switch (type) {
> + case TCG_TYPE_I64:
> + tcg_out_modrm_offset(s, OPC_MOVL_GvEv | P_REXW, ret, arg1, arg2);
> + break;
> + case TCG_TYPE_I32:
> + tcg_out_modrm_offset(s, OPC_MOVL_GvEv, ret, arg1, arg2);
> + break;
> + case TCG_TYPE_V64:
> + tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_GyMy, ret, arg1, arg2);
> + break;
> + case TCG_TYPE_V128:
> + tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_GyMy, ret, arg1, arg2);
> + break;
> + case TCG_TYPE_V256:
> + tcg_out_vex_modrm_offset(s, OPC_MOVDQU_GyMy | P_VEXL,
> + ret, 0, arg1, arg2);
> + break;
> + default:
> + g_assert_not_reached();
> + }
> }
>
> -static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> - TCGReg arg1, intptr_t arg2)
> +static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> + TCGReg arg1, intptr_t arg2)
> {
> - int opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> - tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
> + switch (type) {
> + case TCG_TYPE_I64:
> + tcg_out_modrm_offset(s, OPC_MOVL_EvGv | P_REXW, arg, arg1, arg2);
> + break;
> + case TCG_TYPE_I32:
> + tcg_out_modrm_offset(s, OPC_MOVL_EvGv, arg, arg1, arg2);
> + break;
> + case TCG_TYPE_V64:
> + tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_MyGy, arg, arg1, arg2);
> + break;
> + case TCG_TYPE_V128:
> + tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_MyGy, arg, arg1, arg2);
> + break;
> + case TCG_TYPE_V256:
> + tcg_out_vex_modrm_offset(s, OPC_MOVDQU_MyGy | P_VEXL,
> + arg, 0, arg1, arg2);
> + break;
> + default:
> + g_assert_not_reached();
> + }
> }
>
> static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
> @@ -773,6 +931,8 @@ static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
> return false;
> }
> rexw = P_REXW;
> + } else if (type != TCG_TYPE_I32) {
> + return false;
> }
> tcg_out_modrm_offset(s, OPC_MOVL_EvIz | rexw, 0, base, ofs);
> tcg_out32(s, val);
> @@ -1914,6 +2074,15 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
> case glue(glue(INDEX_op_, x), _i32)
> #endif
>
> +#define OP_128_256(x) \
> + case glue(glue(INDEX_op_, x), _v256): \
> + rexw = P_VEXL; /* FALLTHRU */ \
> + case glue(glue(INDEX_op_, x), _v128)
> +
> +#define OP_64_128_256(x) \
> + OP_128_256(x): \
> + case glue(glue(INDEX_op_, x), _v64)
> +
> /* Hoist the loads of the most common arguments. */
> a0 = args[0];
> a1 = args[1];
> @@ -2379,19 +2548,94 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
> }
> break;
>
> + OP_64_128_256(add8):
> + c = OPC_PADDB;
> + goto gen_simd;
> + OP_64_128_256(add16):
> + c = OPC_PADDW;
> + goto gen_simd;
> + OP_64_128_256(add32):
> + c = OPC_PADDD;
> + goto gen_simd;
> + OP_128_256(add64):
> + c = OPC_PADDQ;
> + goto gen_simd;
> + OP_64_128_256(sub8):
> + c = OPC_PSUBB;
> + goto gen_simd;
> + OP_64_128_256(sub16):
> + c = OPC_PSUBW;
> + goto gen_simd;
> + OP_64_128_256(sub32):
> + c = OPC_PSUBD;
> + goto gen_simd;
> + OP_128_256(sub64):
> + c = OPC_PSUBQ;
> + goto gen_simd;
> + OP_64_128_256(and):
> + c = OPC_PAND;
> + goto gen_simd;
> + OP_64_128_256(andc):
> + c = OPC_PANDN;
> + goto gen_simd;
> + OP_64_128_256(or):
> + c = OPC_POR;
> + goto gen_simd;
> + OP_64_128_256(xor):
> + c = OPC_PXOR;
> + gen_simd:
> + if (have_avx1) {
> + tcg_out_vex_modrm(s, c, a0, a1, a2);
> + } else {
> + tcg_out_modrm(s, c, a0, a2);
> + }
> + break;
> +
> + case INDEX_op_ld_v64:
> + c = TCG_TYPE_V64;
> + goto gen_simd_ld;
> + case INDEX_op_ld_v128:
> + c = TCG_TYPE_V128;
> + goto gen_simd_ld;
> + case INDEX_op_ld_v256:
> + c = TCG_TYPE_V256;
> + gen_simd_ld:
> + tcg_out_ld(s, c, a0, a1, a2);
> + break;
> +
> + case INDEX_op_st_v64:
> + c = TCG_TYPE_V64;
> + goto gen_simd_st;
> + case INDEX_op_st_v128:
> + c = TCG_TYPE_V128;
> + goto gen_simd_st;
> + case INDEX_op_st_v256:
> + c = TCG_TYPE_V256;
> + gen_simd_st:
> + tcg_out_st(s, c, a0, a1, a2);
> + break;
> +
> case INDEX_op_mb:
> tcg_out_mb(s, a0);
> break;
> case INDEX_op_mov_i32: /* Always emitted via tcg_out_mov. */
> case INDEX_op_mov_i64:
> + case INDEX_op_mov_v64:
> + case INDEX_op_mov_v128:
> + case INDEX_op_mov_v256:
> case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi. */
> case INDEX_op_movi_i64:
> + case INDEX_op_movi_v64:
> + case INDEX_op_movi_v128:
> + case INDEX_op_movi_v256:
> case INDEX_op_call: /* Always emitted via tcg_out_call. */
> default:
> tcg_abort();
> }
>
> #undef OP_32_64
> +#undef OP_128_256
> +#undef OP_64_128_256
> }
>
> static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
> @@ -2417,6 +2661,9 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
> = { .args_ct_str = { "r", "r", "L", "L" } };
> static const TCGTargetOpDef L_L_L_L
> = { .args_ct_str = { "L", "L", "L", "L" } };
> + static const TCGTargetOpDef x_0_x = { .args_ct_str = { "x", "0", "x" } };
> + static const TCGTargetOpDef x_x_x = { .args_ct_str = { "x", "x", "x" } };
> + static const TCGTargetOpDef x_r = { .args_ct_str = { "x", "r" } };
>
> switch (op) {
> case INDEX_op_goto_ptr:
> @@ -2620,6 +2867,52 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
> return &s2;
> }
>
> + case INDEX_op_ld_v64:
> + case INDEX_op_ld_v128:
> + case INDEX_op_ld_v256:
> + case INDEX_op_st_v64:
> + case INDEX_op_st_v128:
> + case INDEX_op_st_v256:
> + return &x_r;
> +
> + case INDEX_op_add8_v64:
> + case INDEX_op_add8_v128:
> + case INDEX_op_add16_v64:
> + case INDEX_op_add16_v128:
> + case INDEX_op_add32_v64:
> + case INDEX_op_add32_v128:
> + case INDEX_op_add64_v128:
> + case INDEX_op_sub8_v64:
> + case INDEX_op_sub8_v128:
> + case INDEX_op_sub16_v64:
> + case INDEX_op_sub16_v128:
> + case INDEX_op_sub32_v64:
> + case INDEX_op_sub32_v128:
> + case INDEX_op_sub64_v128:
> + case INDEX_op_and_v64:
> + case INDEX_op_and_v128:
> + case INDEX_op_andc_v64:
> + case INDEX_op_andc_v128:
> + case INDEX_op_or_v64:
> + case INDEX_op_or_v128:
> + case INDEX_op_xor_v64:
> + case INDEX_op_xor_v128:
> + return have_avx1 ? &x_x_x : &x_0_x;
> +
> + case INDEX_op_add8_v256:
> + case INDEX_op_add16_v256:
> + case INDEX_op_add32_v256:
> + case INDEX_op_add64_v256:
> + case INDEX_op_sub8_v256:
> + case INDEX_op_sub16_v256:
> + case INDEX_op_sub32_v256:
> + case INDEX_op_sub64_v256:
> + case INDEX_op_and_v256:
> + case INDEX_op_andc_v256:
> + case INDEX_op_or_v256:
> + case INDEX_op_xor_v256:
> + return &x_x_x;
> +
> default:
> break;
> }
> @@ -2725,9 +3018,16 @@ static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
> static void tcg_target_init(TCGContext *s)
> {
> #ifdef CONFIG_CPUID_H
> - unsigned a, b, c, d;
> + unsigned a, b, c, d, b7 = 0;
> int max = __get_cpuid_max(0, 0);
>
> + if (max >= 7) {
> + /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs. */
> + __cpuid_count(7, 0, a, b7, c, d);
> + have_bmi1 = (b7 & bit_BMI) != 0;
> + have_bmi2 = (b7 & bit_BMI2) != 0;
> + }
> +
> if (max >= 1) {
> __cpuid(1, a, b, c, d);
> #ifndef have_cmov
> @@ -2736,17 +3036,26 @@ static void tcg_target_init(TCGContext *s)
> available, we'll use a small forward branch. */
> have_cmov = (d & bit_CMOV) != 0;
> #endif
> +#ifndef have_sse2
> + have_sse2 = (d & bit_SSE2) != 0;
> +#endif
> /* MOVBE is only available on Intel Atom and Haswell CPUs, so we
> need to probe for it. */
> have_movbe = (c & bit_MOVBE) != 0;
> have_popcnt = (c & bit_POPCNT) != 0;
> - }
>
> - if (max >= 7) {
> - /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs. */
> - __cpuid_count(7, 0, a, b, c, d);
> - have_bmi1 = (b & bit_BMI) != 0;
> - have_bmi2 = (b & bit_BMI2) != 0;
> +#ifndef have_avx2
> + /* There are a number of things we must check before we can be
> + sure of not hitting invalid opcode. */
> + if (c & bit_OSXSAVE) {
> + unsigned xcrl, xcrh;
> + asm ("xgetbv" : "=a" (xcrl), "=d" (xcrh) : "c" (0));
> + if (xcrl & 6 == 6) {
My picky compiler complains:
/home/alex/lsrc/qemu/qemu.git/tcg/i386/tcg-target.inc.c: In function ‘tcg_target_init’:
/home/alex/lsrc/qemu/qemu.git/tcg/i386/tcg-target.inc.c:3053:22: error: suggest parentheses around comparison in operand of ‘&’ [-Werror=parentheses]
if (xcrl & 6 == 6) {
> + have_avx1 = (c & bit_AVX) != 0;
> + have_avx2 = (b7 & bit_AVX2) != 0;
> + }
> + }
> +#endif
> }
>
> max = __get_cpuid_max(0x8000000, 0);
> @@ -2763,6 +3072,13 @@ static void tcg_target_init(TCGContext *s)
> } else {
> tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xff);
> }
> + if (have_sse2) {
> + tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V64], 0, 0xff0000);
> + tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V128], 0, 0xff0000);
> + }
> + if (have_avx2) {
> + tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V256], 0, 0xff0000);
> + }
>
> tcg_regset_clear(tcg_target_call_clobber_regs);
> tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_EAX);
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
2017-08-22 13:15 ` Alex Bennée
@ 2017-08-23 19:02 ` Richard Henderson
0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-08-23 19:02 UTC (permalink / raw)
To: Alex Bennée; +Cc: qemu-devel, qemu-arm
On 08/22/2017 06:15 AM, Alex Bennée wrote:
>> +#ifndef have_avx2
>> + /* There are a number of things we must check before we can be
>> + sure of not hitting invalid opcode. */
>> + if (c & bit_OSXSAVE) {
>> + unsigned xcrl, xcrh;
>> + asm ("xgetbv" : "=a" (xcrl), "=d" (xcrh) : "c" (0));
>> + if (xcrl & 6 == 6) {
>
> My picky compiler complains:
>
> /home/alex/lsrc/qemu/qemu.git/tcg/i386/tcg-target.inc.c: In function ‘tcg_target_init’:
> /home/alex/lsrc/qemu/qemu.git/tcg/i386/tcg-target.inc.c:3053:22: error: suggest parentheses around comparison in operand of ‘&’ [-Werror=parentheses]
> if (xcrl & 6 == 6) {
Bah. I forgot that my default build uses -march=native, and my laptop has
AVX2, so this bit wouldn't have been compile tested at all.
Fixed on the branch.
r~
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic
2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
@ 2017-08-30 1:31 ` Philippe Mathieu-Daudé
2017-09-01 20:38 ` Richard Henderson
2017-09-07 16:34 ` Alex Bennée
1 sibling, 1 reply; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-30 1:31 UTC (permalink / raw)
To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee
Hi Richard,
I can't find anything to say about this patch... Hardcore stuff.
Some part could be more a bit more verbose but after a while focusing it
makes sens.
I wonder how long it took you to write this :) "roughly 2h"
On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Hoping I didn't miss anything:
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
> ---
> Makefile.target | 5 +-
> tcg/tcg-op-gvec.h | 88 ++++++++++
> tcg/tcg-runtime.h | 16 ++
> tcg/tcg-op-gvec.c | 443 +++++++++++++++++++++++++++++++++++++++++++++++++
> tcg/tcg-runtime-gvec.c | 199 ++++++++++++++++++++++
> 5 files changed, 749 insertions(+), 2 deletions(-)
> create mode 100644 tcg/tcg-op-gvec.h
> create mode 100644 tcg/tcg-op-gvec.c
> create mode 100644 tcg/tcg-runtime-gvec.c
>
> diff --git a/Makefile.target b/Makefile.target
> index 7f42c45db8..9ae3e904f7 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -93,8 +93,9 @@ all: $(PROGS) stap
> # cpu emulator library
> obj-y += exec.o
> obj-y += accel/
> -obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/optimize.o
> -obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/tcg-runtime.o
> +obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-common.o tcg/optimize.o
> +obj-$(CONFIG_TCG) += tcg/tcg-op.o tcg/tcg-op-gvec.o
> +obj-$(CONFIG_TCG) += tcg/tcg-runtime.o tcg/tcg-runtime-gvec.o
> obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
> obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
> obj-y += fpu/softfloat.o
> diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
> new file mode 100644
> index 0000000000..10db3599a5
> --- /dev/null
> +++ b/tcg/tcg-op-gvec.h
> @@ -0,0 +1,88 @@
> +/*
> + * Generic vector operation expansion
> + *
> + * Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +/*
> + * "Generic" vectors. All operands are given as offsets from ENV,
> + * and therefore cannot also be allocated via tcg_global_mem_new_*.
> + * OPSZ is the byte size of the vector upon which the operation is performed.
> + * CLSZ is the byte size of the full vector; bytes beyond OPSZ are cleared.
> + *
> + * All sizes must be 8 or any multiple of 16.
> + * When OPSZ is 8, the alignment may be 8, otherwise must be 16.
> + * Operands may completely, but not partially, overlap.
> + */
> +
> +/* Fundamental operation expanders. These are exposed to the front ends
> + so that target-specific SIMD operations can be handled similarly to
> + the standard SIMD operations. */
> +
> +typedef struct {
> + /* "Small" sizes: expand inline as a 64-bit or 32-bit lane.
> + Generally only one of these will be non-NULL. */
> + void (*fni8)(TCGv_i64, TCGv_i64, TCGv_i64);
> + void (*fni4)(TCGv_i32, TCGv_i32, TCGv_i32);
> + /* Similarly, but load up a constant and re-use across lanes. */
> + void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
> + uint64_t extra_value;
> + /* Larger sizes: expand out-of-line helper w/size descriptor. */
> + void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
> +} GVecGen3;
> +
> +void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz, const GVecGen3 *);
> +
> +#define DEF_GVEC_2(X) \
> + void tcg_gen_gvec_##X(uint32_t dofs, uint32_t aofs, uint32_t bofs, \
> + uint32_t opsz, uint32_t clsz)
> +
> +DEF_GVEC_2(add8);
> +DEF_GVEC_2(add16);
> +DEF_GVEC_2(add32);
> +DEF_GVEC_2(add64);
> +
> +DEF_GVEC_2(sub8);
> +DEF_GVEC_2(sub16);
> +DEF_GVEC_2(sub32);
> +DEF_GVEC_2(sub64);
> +
> +DEF_GVEC_2(and8);
> +DEF_GVEC_2(or8);
> +DEF_GVEC_2(xor8);
> +DEF_GVEC_2(andc8);
> +DEF_GVEC_2(orc8);
> +
> +#undef DEF_GVEC_2
> +
> +/*
> + * 64-bit vector operations. Use these when the register has been
> + * allocated with tcg_global_mem_new_i64. OPSZ = CLSZ = 8.
> + */
> +
> +#define DEF_VEC8_2(X) \
> + void tcg_gen_vec8_##X(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +
> +DEF_VEC8_2(add8);
> +DEF_VEC8_2(add16);
> +DEF_VEC8_2(add32);
> +
> +DEF_VEC8_2(sub8);
> +DEF_VEC8_2(sub16);
> +DEF_VEC8_2(sub32);
> +
> +#undef DEF_VEC8_2
> diff --git a/tcg/tcg-runtime.h b/tcg/tcg-runtime.h
> index c41d38a557..f8d07090f8 100644
> --- a/tcg/tcg-runtime.h
> +++ b/tcg/tcg-runtime.h
> @@ -134,3 +134,19 @@ GEN_ATOMIC_HELPERS(xor_fetch)
> GEN_ATOMIC_HELPERS(xchg)
>
> #undef GEN_ATOMIC_HELPERS
> +
> +DEF_HELPER_FLAGS_4(gvec_add8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
> +DEF_HELPER_FLAGS_4(gvec_sub8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
> +DEF_HELPER_FLAGS_4(gvec_and8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_or8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_xor8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_andc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_orc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
> new file mode 100644
> index 0000000000..6de49dc07f
> --- /dev/null
> +++ b/tcg/tcg-op-gvec.c
> @@ -0,0 +1,443 @@
> +/*
> + * Generic vector operation expansion
> + *
> + * Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "cpu.h"
> +#include "exec/exec-all.h"
> +#include "tcg.h"
> +#include "tcg-op.h"
> +#include "tcg-op-gvec.h"
> +#include "trace-tcg.h"
> +#include "trace/mem.h"
> +
> +#define REP8(x) ((x) * 0x0101010101010101ull)
> +#define REP16(x) ((x) * 0x0001000100010001ull)
> +
> +#define MAX_INLINE 16
> +
> +static inline void check_size_s(uint32_t opsz, uint32_t clsz)
> +{
> + tcg_debug_assert(opsz % 8 == 0);
> + tcg_debug_assert(clsz % 8 == 0);
> + tcg_debug_assert(opsz <= clsz);
> +}
> +
> +static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +{
> + tcg_debug_assert(dofs % 8 == 0);
> + tcg_debug_assert(aofs % 8 == 0);
> + tcg_debug_assert(bofs % 8 == 0);
> +}
> +
> +static inline void check_size_l(uint32_t opsz, uint32_t clsz)
> +{
> + tcg_debug_assert(opsz % 16 == 0);
> + tcg_debug_assert(clsz % 16 == 0);
> + tcg_debug_assert(opsz <= clsz);
> +}
> +
> +static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +{
> + tcg_debug_assert(dofs % 16 == 0);
> + tcg_debug_assert(aofs % 16 == 0);
> + tcg_debug_assert(bofs % 16 == 0);
> +}
> +
> +static inline void check_overlap_3(uint32_t d, uint32_t a,
> + uint32_t b, uint32_t s)
> +{
> + tcg_debug_assert(d == a || d + s <= a || a + s <= d);
> + tcg_debug_assert(d == b || d + s <= b || b + s <= d);
> + tcg_debug_assert(a == b || a + s <= b || b + s <= a);
> +}
> +
> +static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
> +{
> + if (clsz > opsz) {
> + TCGv_i64 zero = tcg_const_i64(0);
> + uint32_t i;
> +
> + for (i = opsz; i < clsz; i += 8) {
> + tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
> + }
> + tcg_temp_free_i64(zero);
> + }
> +}
> +
> +static TCGv_i32 make_desc(uint32_t opsz, uint32_t clsz)
> +{
> + tcg_debug_assert(opsz >= 16 && opsz <= 255 * 16 && opsz % 16 == 0);
> + tcg_debug_assert(clsz >= 16 && clsz <= 255 * 16 && clsz % 16 == 0);
> + opsz /= 16;
> + clsz /= 16;
> + opsz -= 1;
> + clsz -= 1;
> + return tcg_const_i32(deposit32(opsz, 8, 8, clsz));
> +}
> +
> +static void expand_3_o(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz,
> + void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32))
> +{
> + TCGv_ptr d = tcg_temp_new_ptr();
> + TCGv_ptr a = tcg_temp_new_ptr();
> + TCGv_ptr b = tcg_temp_new_ptr();
> + TCGv_i32 desc = make_desc(opsz, clsz);
> +
> + tcg_gen_addi_ptr(d, tcg_ctx.tcg_env, dofs);
> + tcg_gen_addi_ptr(a, tcg_ctx.tcg_env, aofs);
> + tcg_gen_addi_ptr(b, tcg_ctx.tcg_env, bofs);
> + fno(d, a, b, desc);
> +
> + tcg_temp_free_ptr(d);
> + tcg_temp_free_ptr(a);
> + tcg_temp_free_ptr(b);
> + tcg_temp_free_i32(desc);
> +}
> +
> +static void expand_3x4(uint32_t dofs, uint32_t aofs,
> + uint32_t bofs, uint32_t opsz,
> + void (*fni)(TCGv_i32, TCGv_i32, TCGv_i32))
> +{
> + TCGv_i32 t0 = tcg_temp_new_i32();
> + uint32_t i;
> +
> + if (aofs == bofs) {
> + for (i = 0; i < opsz; i += 4) {
> + tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
> + fni(t0, t0, t0);
> + tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + } else {
> + TCGv_i32 t1 = tcg_temp_new_i32();
> + for (i = 0; i < opsz; i += 4) {
> + tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
> + tcg_gen_ld_i32(t1, tcg_ctx.tcg_env, bofs + i);
> + fni(t0, t0, t1);
> + tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + tcg_temp_free_i32(t1);
> + }
> + tcg_temp_free_i32(t0);
> +}
> +
> +static void expand_3x8(uint32_t dofs, uint32_t aofs,
> + uint32_t bofs, uint32_t opsz,
> + void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64))
> +{
> + TCGv_i64 t0 = tcg_temp_new_i64();
> + uint32_t i;
> +
> + if (aofs == bofs) {
> + for (i = 0; i < opsz; i += 8) {
> + tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> + fni(t0, t0, t0);
> + tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + } else {
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + for (i = 0; i < opsz; i += 8) {
> + tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> + tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
> + fni(t0, t0, t1);
> + tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + tcg_temp_free_i64(t1);
> + }
> + tcg_temp_free_i64(t0);
> +}
> +
> +static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint64_t data,
> + void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64))
> +{
> + TCGv_i64 t0 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_const_i64(data);
> + uint32_t i;
> +
> + if (aofs == bofs) {
> + for (i = 0; i < opsz; i += 8) {
> + tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> + fni(t0, t0, t0, t2);
> + tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + } else {
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + for (i = 0; i < opsz; i += 8) {
> + tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> + tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
> + fni(t0, t0, t1, t2);
> + tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + tcg_temp_free_i64(t1);
> + }
> + tcg_temp_free_i64(t0);
> + tcg_temp_free_i64(t2);
> +}
> +
> +void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
> +{
> + check_overlap_3(dofs, aofs, bofs, clsz);
> + if (opsz <= MAX_INLINE) {
> + check_size_s(opsz, clsz);
> + check_align_s_3(dofs, aofs, bofs);
> + if (g->fni8) {
> + expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
> + } else if (g->fni4) {
> + expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
> + } else if (g->fni8x) {
> + expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
> + } else {
> + g_assert_not_reached();
> + }
> + expand_clr(dofs, opsz, clsz);
> + } else {
> + check_size_l(opsz, clsz);
> + check_align_l_3(dofs, aofs, bofs);
> + expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
> + }
> +}
> +
> +static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> +{
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_temp_new_i64();
> + TCGv_i64 t3 = tcg_temp_new_i64();
> +
> + tcg_gen_andc_i64(t1, a, m);
> + tcg_gen_andc_i64(t2, b, m);
> + tcg_gen_xor_i64(t3, a, b);
> + tcg_gen_add_i64(d, t1, t2);
> + tcg_gen_and_i64(t3, t3, m);
> + tcg_gen_xor_i64(d, d, t3);
> +
> + tcg_temp_free_i64(t1);
> + tcg_temp_free_i64(t2);
> + tcg_temp_free_i64(t3);
> +}
> +
> +void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .extra_value = REP8(0x80),
> + .fni8x = gen_addv_mask,
> + .fno = gen_helper_gvec_add8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .extra_value = REP16(0x8000),
> + .fni8x = gen_addv_mask,
> + .fno = gen_helper_gvec_add16,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni4 = tcg_gen_add_i32,
> + .fno = gen_helper_gvec_add32,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_add_i64,
> + .fno = gen_helper_gvec_add64,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_vec8_add8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 m = tcg_const_i64(REP8(0x80));
> + gen_addv_mask(d, a, b, m);
> + tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_add16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 m = tcg_const_i64(REP16(0x8000));
> + gen_addv_mask(d, a, b, m);
> + tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_add32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_temp_new_i64();
> +
> + tcg_gen_andi_i64(t1, a, ~0xffffffffull);
> + tcg_gen_add_i64(t2, a, b);
> + tcg_gen_add_i64(t1, t1, b);
> + tcg_gen_deposit_i64(d, t1, t2, 0, 32);
> +
> + tcg_temp_free_i64(t1);
> + tcg_temp_free_i64(t2);
> +}
> +
> +static void gen_subv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> +{
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_temp_new_i64();
> + TCGv_i64 t3 = tcg_temp_new_i64();
> +
> + tcg_gen_or_i64(t1, a, m);
> + tcg_gen_andc_i64(t2, b, m);
> + tcg_gen_eqv_i64(t3, a, b);
> + tcg_gen_sub_i64(d, t1, t2);
> + tcg_gen_and_i64(t3, t3, m);
> + tcg_gen_xor_i64(d, d, t3);
> +
> + tcg_temp_free_i64(t1);
> + tcg_temp_free_i64(t2);
> + tcg_temp_free_i64(t3);
> +}
> +
> +void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .extra_value = REP8(0x80),
> + .fni8x = gen_subv_mask,
> + .fno = gen_helper_gvec_sub8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .extra_value = REP16(0x8000),
> + .fni8x = gen_subv_mask,
> + .fno = gen_helper_gvec_sub16,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni4 = tcg_gen_sub_i32,
> + .fno = gen_helper_gvec_sub32,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_sub_i64,
> + .fno = gen_helper_gvec_sub64,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_vec8_sub8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 m = tcg_const_i64(REP8(0x80));
> + gen_subv_mask(d, a, b, m);
> + tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_sub16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 m = tcg_const_i64(REP16(0x8000));
> + gen_subv_mask(d, a, b, m);
> + tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_sub32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_temp_new_i64();
> +
> + tcg_gen_andi_i64(t1, b, ~0xffffffffull);
> + tcg_gen_sub_i64(t2, a, b);
> + tcg_gen_sub_i64(t1, a, t1);
> + tcg_gen_deposit_i64(d, t1, t2, 0, 32);
> +
> + tcg_temp_free_i64(t1);
> + tcg_temp_free_i64(t2);
> +}
> +
> +void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_and_i64,
> + .fno = gen_helper_gvec_and8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_or_i64,
> + .fno = gen_helper_gvec_or8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_xor_i64,
> + .fno = gen_helper_gvec_xor8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_andc_i64,
> + .fno = gen_helper_gvec_andc8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_orc_i64,
> + .fno = gen_helper_gvec_orc8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> diff --git a/tcg/tcg-runtime-gvec.c b/tcg/tcg-runtime-gvec.c
> new file mode 100644
> index 0000000000..9a37ce07a2
> --- /dev/null
> +++ b/tcg/tcg-runtime-gvec.c
> @@ -0,0 +1,199 @@
> +/*
> + * Generic vectorized operation runtime
> + *
> + * Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/host-utils.h"
> +#include "cpu.h"
> +#include "exec/helper-proto.h"
> +
> +/* Virtually all hosts support 16-byte vectors. Those that don't
> + can emulate them via GCC's generic vector extension.
> +
> + In tcg-op-gvec.c, we asserted that both the size and alignment
> + of the data are multiples of 16. */
> +
> +typedef uint8_t vec8 __attribute__((vector_size(16)));
> +typedef uint16_t vec16 __attribute__((vector_size(16)));
> +typedef uint32_t vec32 __attribute__((vector_size(16)));
> +typedef uint64_t vec64 __attribute__((vector_size(16)));
> +
> +static inline intptr_t extract_opsz(uint32_t desc)
> +{
> + return ((desc & 0xff) + 1) * 16;
> +}
> +
> +static inline intptr_t extract_clsz(uint32_t desc)
> +{
> + return (((desc >> 8) & 0xff) + 1) * 16;
> +}
> +
> +static inline void clear_high(void *d, intptr_t opsz, uint32_t desc)
> +{
> + intptr_t clsz = extract_clsz(desc);
> + intptr_t i;
> +
> + if (unlikely(clsz > opsz)) {
> + for (i = opsz; i < clsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = (vec64){ 0 };
> + }
> + }
> +}
> +
> +void HELPER(gvec_add8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec8)) {
> + *(vec8 *)(d + i) = *(vec8 *)(a + i) + *(vec8 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add16)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec16)) {
> + *(vec16 *)(d + i) = *(vec16 *)(a + i) + *(vec16 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add32)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec32)) {
> + *(vec32 *)(d + i) = *(vec32 *)(a + i) + *(vec32 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add64)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) + *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec8)) {
> + *(vec8 *)(d + i) = *(vec8 *)(a + i) - *(vec8 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub16)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec16)) {
> + *(vec16 *)(d + i) = *(vec16 *)(a + i) - *(vec16 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub32)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec32)) {
> + *(vec32 *)(d + i) = *(vec32 *)(a + i) - *(vec32 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub64)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) - *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_and8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) & *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_or8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) | *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_xor8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) ^ *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_andc8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) &~ *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_orc8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) |~ *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
@ 2017-08-30 1:34 ` Philippe Mathieu-Daudé
2017-09-07 19:00 ` Alex Bennée
1 sibling, 0 replies; 36+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-08-30 1:34 UTC (permalink / raw)
To: Richard Henderson, qemu-devel; +Cc: qemu-arm, alex.bennee
On 08/17/2017 08:01 PM, Richard Henderson wrote:
> Nothing uses or implements them yet.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
> ---
> tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> tcg/tcg.h | 24 ++++++++++++++++
> 2 files changed, 113 insertions(+)
>
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 956fb1e9f3..9162125fac 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>
> #undef TLADDR_ARGS
> #undef DATA64_ARGS
> +
> +/* Host integer vector operations. */
> +/* These opcodes are required whenever the base vector size is enabled. */
> +
> +DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +
> +DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +
> +DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +
> +DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +
> +DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +/* These opcodes are optional.
> + All element counts must be supported if any are. */
> +
> +DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
> +DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
> +DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
> +
> +DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
> +DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
> +DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
> +
> +DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
> +DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
> +DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
> +
> +DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +
> +DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +
> +DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +
> #undef IMPL
> #undef IMPL64
> #undef DEF
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 1277caed3d..b9e15da13b 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
> #define TCG_TARGET_HAS_rem_i64 0
> #endif
>
> +#ifndef TCG_TARGET_HAS_v64
> +#define TCG_TARGET_HAS_v64 0
> +#define TCG_TARGET_HAS_andc_v64 0
> +#define TCG_TARGET_HAS_orc_v64 0
> +#define TCG_TARGET_HAS_not_v64 0
> +#define TCG_TARGET_HAS_neg_v64 0
> +#endif
> +
> +#ifndef TCG_TARGET_HAS_v128
> +#define TCG_TARGET_HAS_v128 0
> +#define TCG_TARGET_HAS_andc_v128 0
> +#define TCG_TARGET_HAS_orc_v128 0
> +#define TCG_TARGET_HAS_not_v128 0
> +#define TCG_TARGET_HAS_neg_v128 0
> +#endif
> +
> +#ifndef TCG_TARGET_HAS_v256
> +#define TCG_TARGET_HAS_v256 0
> +#define TCG_TARGET_HAS_andc_v256 0
> +#define TCG_TARGET_HAS_orc_v256 0
> +#define TCG_TARGET_HAS_not_v256 0
> +#define TCG_TARGET_HAS_neg_v256 0
> +#endif
> +
> /* For 32-bit targets, some sort of unsigned widening multiply is required. */
> #if TCG_TARGET_REG_BITS == 32 \
> && !(defined(TCG_TARGET_HAS_mulu2_i32) \
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic
2017-08-30 1:31 ` Philippe Mathieu-Daudé
@ 2017-09-01 20:38 ` Richard Henderson
0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-09-01 20:38 UTC (permalink / raw)
To: Philippe Mathieu-Daudé, qemu-devel; +Cc: qemu-arm, alex.bennee
On 08/29/2017 06:31 PM, Philippe Mathieu-Daudé wrote:
> Hi Richard,
>
> I can't find anything to say about this patch... Hardcore stuff.
> Some part could be more a bit more verbose but after a while focusing it makes
> sens.
> I wonder how long it took you to write this :) "roughly 2h"
Not quite that quickly. ;-)
You're absolutely right that it needs lots more documentation.
I'll improve that when it comes to round 2.
r~
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic
2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
2017-08-30 1:31 ` Philippe Mathieu-Daudé
@ 2017-09-07 16:34 ` Alex Bennée
1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 16:34 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> Makefile.target | 5 +-
> tcg/tcg-op-gvec.h | 88 ++++++++++
> tcg/tcg-runtime.h | 16 ++
> tcg/tcg-op-gvec.c | 443 +++++++++++++++++++++++++++++++++++++++++++++++++
> tcg/tcg-runtime-gvec.c | 199 ++++++++++++++++++++++
> 5 files changed, 749 insertions(+), 2 deletions(-)
> create mode 100644 tcg/tcg-op-gvec.h
> create mode 100644 tcg/tcg-op-gvec.c
> create mode 100644 tcg/tcg-runtime-gvec.c
>
> diff --git a/Makefile.target b/Makefile.target
> index 7f42c45db8..9ae3e904f7 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -93,8 +93,9 @@ all: $(PROGS) stap
> # cpu emulator library
> obj-y += exec.o
> obj-y += accel/
> -obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-op.o tcg/optimize.o
> -obj-$(CONFIG_TCG) += tcg/tcg-common.o tcg/tcg-runtime.o
> +obj-$(CONFIG_TCG) += tcg/tcg.o tcg/tcg-common.o tcg/optimize.o
> +obj-$(CONFIG_TCG) += tcg/tcg-op.o tcg/tcg-op-gvec.o
> +obj-$(CONFIG_TCG) += tcg/tcg-runtime.o tcg/tcg-runtime-gvec.o
> obj-$(CONFIG_TCG_INTERPRETER) += tcg/tci.o
> obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
> obj-y += fpu/softfloat.o
> diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
> new file mode 100644
> index 0000000000..10db3599a5
> --- /dev/null
> +++ b/tcg/tcg-op-gvec.h
> @@ -0,0 +1,88 @@
> +/*
> + * Generic vector operation expansion
> + *
> + * Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +/*
> + * "Generic" vectors. All operands are given as offsets from ENV,
> + * and therefore cannot also be allocated via tcg_global_mem_new_*.
> + * OPSZ is the byte size of the vector upon which the operation is performed.
> + * CLSZ is the byte size of the full vector; bytes beyond OPSZ are cleared.
> + *
> + * All sizes must be 8 or any multiple of 16.
> + * When OPSZ is 8, the alignment may be 8, otherwise must be 16.
> + * Operands may completely, but not partially, overlap.
Isn't this going to be a problem for narrow/widden Rn->Rn operations?
Should we say so explicitly here?
> + */
> +
> +/* Fundamental operation expanders. These are exposed to the front ends
> + so that target-specific SIMD operations can be handled similarly to
> + the standard SIMD operations. */
> +
> +typedef struct {
> + /* "Small" sizes: expand inline as a 64-bit or 32-bit lane.
> + Generally only one of these will be non-NULL. */
Generally or always? We after all go through in a certain order and
expand the first one defined.
> + void (*fni8)(TCGv_i64, TCGv_i64, TCGv_i64);
> + void (*fni4)(TCGv_i32, TCGv_i32, TCGv_i32);
> + /* Similarly, but load up a constant and re-use across lanes. */
> + void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
> + uint64_t extra_value;
Probably personal preference but I'd leave extra_value and additional
non-function pointers to the end of the structure for cleaner
readability.
> + /* Larger sizes: expand out-of-line helper w/size descriptor. */
> + void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
> +} GVecGen3;
> +
> +void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz, const GVecGen3 *);
> +
Why GVecGen3 and tcg_gen_gvec_3? It seems a little arbitrary.
> +#define DEF_GVEC_2(X) \
> + void tcg_gen_gvec_##X(uint32_t dofs, uint32_t aofs, uint32_t bofs, \
> + uint32_t opsz, uint32_t clsz)
> +
> +DEF_GVEC_2(add8);
> +DEF_GVEC_2(add16);
> +DEF_GVEC_2(add32);
> +DEF_GVEC_2(add64);
> +
> +DEF_GVEC_2(sub8);
> +DEF_GVEC_2(sub16);
> +DEF_GVEC_2(sub32);
> +DEF_GVEC_2(sub64);
> +
> +DEF_GVEC_2(and8);
> +DEF_GVEC_2(or8);
> +DEF_GVEC_2(xor8);
> +DEF_GVEC_2(andc8);
> +DEF_GVEC_2(orc8);
> +
> +#undef DEF_GVEC_2
> +
> +/*
> + * 64-bit vector operations. Use these when the register has been
> + * allocated with tcg_global_mem_new_i64. OPSZ = CLSZ = 8.
> + */
> +
> +#define DEF_VEC8_2(X) \
> + void tcg_gen_vec8_##X(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +
> +DEF_VEC8_2(add8);
> +DEF_VEC8_2(add16);
> +DEF_VEC8_2(add32);
> +
> +DEF_VEC8_2(sub8);
> +DEF_VEC8_2(sub16);
> +DEF_VEC8_2(sub32);
> +
> +#undef DEF_VEC8_2
Again GVEC_2 and VEC8_2 don't tell me much.
> diff --git a/tcg/tcg-runtime.h b/tcg/tcg-runtime.h
> index c41d38a557..f8d07090f8 100644
> --- a/tcg/tcg-runtime.h
> +++ b/tcg/tcg-runtime.h
> @@ -134,3 +134,19 @@ GEN_ATOMIC_HELPERS(xor_fetch)
> GEN_ATOMIC_HELPERS(xchg)
>
> #undef GEN_ATOMIC_HELPERS
> +
> +DEF_HELPER_FLAGS_4(gvec_add8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_add64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
> +DEF_HELPER_FLAGS_4(gvec_sub8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub16, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub32, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sub64, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
> +DEF_HELPER_FLAGS_4(gvec_and8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_or8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_xor8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_andc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_orc8, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
> new file mode 100644
> index 0000000000..6de49dc07f
> --- /dev/null
> +++ b/tcg/tcg-op-gvec.c
> @@ -0,0 +1,443 @@
> +/*
> + * Generic vector operation expansion
> + *
> + * Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "cpu.h"
> +#include "exec/exec-all.h"
> +#include "tcg.h"
> +#include "tcg-op.h"
> +#include "tcg-op-gvec.h"
> +#include "trace-tcg.h"
> +#include "trace/mem.h"
> +
> +#define REP8(x) ((x) * 0x0101010101010101ull)
> +#define REP16(x) ((x) * 0x0001000100010001ull)
> +
> +#define MAX_INLINE 16
> +
> +static inline void check_size_s(uint32_t opsz, uint32_t clsz)
> +{
> + tcg_debug_assert(opsz % 8 == 0);
> + tcg_debug_assert(clsz % 8 == 0);
> + tcg_debug_assert(opsz <= clsz);
> +}
> +
> +static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +{
> + tcg_debug_assert(dofs % 8 == 0);
> + tcg_debug_assert(aofs % 8 == 0);
> + tcg_debug_assert(bofs % 8 == 0);
> +}
> +
> +static inline void check_size_l(uint32_t opsz, uint32_t clsz)
> +{
> + tcg_debug_assert(opsz % 16 == 0);
> + tcg_debug_assert(clsz % 16 == 0);
> + tcg_debug_assert(opsz <= clsz);
> +}
> +
> +static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +{
> + tcg_debug_assert(dofs % 16 == 0);
> + tcg_debug_assert(aofs % 16 == 0);
> + tcg_debug_assert(bofs % 16 == 0);
> +}
> +
> +static inline void check_overlap_3(uint32_t d, uint32_t a,
> + uint32_t b, uint32_t s)
> +{
> + tcg_debug_assert(d == a || d + s <= a || a + s <= d);
> + tcg_debug_assert(d == b || d + s <= b || b + s <= d);
> + tcg_debug_assert(a == b || a + s <= b || b + s <= a);
> +}
> +
> +static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
> +{
> + if (clsz > opsz) {
> + TCGv_i64 zero = tcg_const_i64(0);
> + uint32_t i;
> +
> + for (i = opsz; i < clsz; i += 8) {
> + tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
> + }
> + tcg_temp_free_i64(zero);
> + }
> +}
> +
> +static TCGv_i32 make_desc(uint32_t opsz, uint32_t clsz)
A comment about the encoding of opdata into the constant probably
wouldn't go amiss. Should we have some inline helpers to extract the
data for the actual implementations?
> +{
> + tcg_debug_assert(opsz >= 16 && opsz <= 255 * 16 && opsz % 16 == 0);
> + tcg_debug_assert(clsz >= 16 && clsz <= 255 * 16 && clsz % 16 == 0);
> + opsz /= 16;
> + clsz /= 16;
> + opsz -= 1;
> + clsz -= 1;
> + return tcg_const_i32(deposit32(opsz, 8, 8, clsz));
> +}
> +
> +static void expand_3_o(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz,
> + void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr,
> TCGv_i32))
Hmm copy of the function pointer definition, maybe they should be
typedefs and declared with comments in tcg-op-gvec.h?
> +{
> + TCGv_ptr d = tcg_temp_new_ptr();
> + TCGv_ptr a = tcg_temp_new_ptr();
> + TCGv_ptr b = tcg_temp_new_ptr();
> + TCGv_i32 desc = make_desc(opsz, clsz);
> +
> + tcg_gen_addi_ptr(d, tcg_ctx.tcg_env, dofs);
> + tcg_gen_addi_ptr(a, tcg_ctx.tcg_env, aofs);
> + tcg_gen_addi_ptr(b, tcg_ctx.tcg_env, bofs);
> + fno(d, a, b, desc);
> +
> + tcg_temp_free_ptr(d);
> + tcg_temp_free_ptr(a);
> + tcg_temp_free_ptr(b);
> + tcg_temp_free_i32(desc);
> +}
> +
> +static void expand_3x4(uint32_t dofs, uint32_t aofs,
> + uint32_t bofs, uint32_t opsz,
> + void (*fni)(TCGv_i32, TCGv_i32, TCGv_i32))
Ditto typedef?
> +{
> + TCGv_i32 t0 = tcg_temp_new_i32();
> + uint32_t i;
> +
> + if (aofs == bofs) {
> + for (i = 0; i < opsz; i += 4) {
> + tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
> + fni(t0, t0, t0);
> + tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + } else {
> + TCGv_i32 t1 = tcg_temp_new_i32();
> + for (i = 0; i < opsz; i += 4) {
> + tcg_gen_ld_i32(t0, tcg_ctx.tcg_env, aofs + i);
> + tcg_gen_ld_i32(t1, tcg_ctx.tcg_env, bofs + i);
> + fni(t0, t0, t1);
> + tcg_gen_st_i32(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + tcg_temp_free_i32(t1);
> + }
> + tcg_temp_free_i32(t0);
> +}
> +
> +static void expand_3x8(uint32_t dofs, uint32_t aofs,
> + uint32_t bofs, uint32_t opsz,
> + void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64))
> +{
> + TCGv_i64 t0 = tcg_temp_new_i64();
> + uint32_t i;
> +
> + if (aofs == bofs) {
> + for (i = 0; i < opsz; i += 8) {
> + tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> + fni(t0, t0, t0);
> + tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + } else {
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + for (i = 0; i < opsz; i += 8) {
> + tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> + tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
> + fni(t0, t0, t1);
> + tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + tcg_temp_free_i64(t1);
> + }
> + tcg_temp_free_i64(t0);
> +}
> +
> +static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint64_t data,
> + void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64,
> TCGv_i64))
Again typedef
I don't quite follow the suffix's of the expanders. I guess _o is for
offset but p1? Either we need a mini comment for each expander or a more
obvious suffix scheme...
> +{
> + TCGv_i64 t0 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_const_i64(data);
> + uint32_t i;
> +
> + if (aofs == bofs) {
> + for (i = 0; i < opsz; i += 8) {
> + tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> + fni(t0, t0, t0, t2);
> + tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + } else {
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + for (i = 0; i < opsz; i += 8) {
> + tcg_gen_ld_i64(t0, tcg_ctx.tcg_env, aofs + i);
> + tcg_gen_ld_i64(t1, tcg_ctx.tcg_env, bofs + i);
> + fni(t0, t0, t1, t2);
> + tcg_gen_st_i64(t0, tcg_ctx.tcg_env, dofs + i);
> + }
> + tcg_temp_free_i64(t1);
> + }
> + tcg_temp_free_i64(t0);
> + tcg_temp_free_i64(t2);
> +}
> +
> +void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
> +{
> + check_overlap_3(dofs, aofs, bofs, clsz);
> + if (opsz <= MAX_INLINE) {
> + check_size_s(opsz, clsz);
> + check_align_s_3(dofs, aofs, bofs);
> + if (g->fni8) {
> + expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
> + } else if (g->fni4) {
> + expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
> + } else if (g->fni8x) {
> + expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
> + } else {
> + g_assert_not_reached();
> + }
> + expand_clr(dofs, opsz, clsz);
> + } else {
> + check_size_l(opsz, clsz);
> + check_align_l_3(dofs, aofs, bofs);
> + expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
> + }
> +}
> +
> +static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> +{
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_temp_new_i64();
> + TCGv_i64 t3 = tcg_temp_new_i64();
> +
> + tcg_gen_andc_i64(t1, a, m);
> + tcg_gen_andc_i64(t2, b, m);
> + tcg_gen_xor_i64(t3, a, b);
> + tcg_gen_add_i64(d, t1, t2);
> + tcg_gen_and_i64(t3, t3, m);
> + tcg_gen_xor_i64(d, d, t3);
> +
> + tcg_temp_free_i64(t1);
> + tcg_temp_free_i64(t2);
> + tcg_temp_free_i64(t3);
> +}
> +
> +void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .extra_value = REP8(0x80),
> + .fni8x = gen_addv_mask,
> + .fno = gen_helper_gvec_add8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .extra_value = REP16(0x8000),
> + .fni8x = gen_addv_mask,
> + .fno = gen_helper_gvec_add16,
OK now I'm confused - we have two functions here but tcg_gen_gvec_3
expand one of them depending on the leg taken by opsz. One is a mask
function and the other using adds?
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni4 = tcg_gen_add_i32,
> + .fno = gen_helper_gvec_add32,
Ahh ok I see here, use native add_i32 for small values, pass to the
generic helper for larger vectors. Still confused about the previous
expander though...
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_add_i64,
> + .fno = gen_helper_gvec_add64,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_vec8_add8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 m = tcg_const_i64(REP8(0x80));
> + gen_addv_mask(d, a, b, m);
> + tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_add16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 m = tcg_const_i64(REP16(0x8000));
> + gen_addv_mask(d, a, b, m);
> + tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_add32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_temp_new_i64();
> +
> + tcg_gen_andi_i64(t1, a, ~0xffffffffull);
> + tcg_gen_add_i64(t2, a, b);
> + tcg_gen_add_i64(t1, t1, b);
> + tcg_gen_deposit_i64(d, t1, t2, 0, 32);
> +
> + tcg_temp_free_i64(t1);
> + tcg_temp_free_i64(t2);
> +}
> +
> +static void gen_subv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> +{
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_temp_new_i64();
> + TCGv_i64 t3 = tcg_temp_new_i64();
> +
> + tcg_gen_or_i64(t1, a, m);
> + tcg_gen_andc_i64(t2, b, m);
> + tcg_gen_eqv_i64(t3, a, b);
> + tcg_gen_sub_i64(d, t1, t2);
> + tcg_gen_and_i64(t3, t3, m);
> + tcg_gen_xor_i64(d, d, t3);
> +
> + tcg_temp_free_i64(t1);
> + tcg_temp_free_i64(t2);
> + tcg_temp_free_i64(t3);
> +}
> +
> +void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .extra_value = REP8(0x80),
> + .fni8x = gen_subv_mask,
> + .fno = gen_helper_gvec_sub8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .extra_value = REP16(0x8000),
> + .fni8x = gen_subv_mask,
> + .fno = gen_helper_gvec_sub16,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni4 = tcg_gen_sub_i32,
> + .fno = gen_helper_gvec_sub32,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_sub_i64,
> + .fno = gen_helper_gvec_sub64,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_vec8_sub8(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 m = tcg_const_i64(REP8(0x80));
> + gen_subv_mask(d, a, b, m);
> + tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_sub16(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 m = tcg_const_i64(REP16(0x8000));
> + gen_subv_mask(d, a, b, m);
> + tcg_temp_free_i64(m);
> +}
> +
> +void tcg_gen_vec8_sub32(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> + TCGv_i64 t1 = tcg_temp_new_i64();
> + TCGv_i64 t2 = tcg_temp_new_i64();
> +
> + tcg_gen_andi_i64(t1, b, ~0xffffffffull);
> + tcg_gen_sub_i64(t2, a, b);
> + tcg_gen_sub_i64(t1, a, t1);
> + tcg_gen_deposit_i64(d, t1, t2, 0, 32);
> +
> + tcg_temp_free_i64(t1);
> + tcg_temp_free_i64(t2);
> +}
> +
> +void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_and_i64,
> + .fno = gen_helper_gvec_and8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_or_i64,
> + .fno = gen_helper_gvec_or8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_xor_i64,
> + .fno = gen_helper_gvec_xor8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_andc_i64,
> + .fno = gen_helper_gvec_andc8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> +
> +void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t clsz)
> +{
> + static const GVecGen3 g = {
> + .fni8 = tcg_gen_orc_i64,
> + .fno = gen_helper_gvec_orc8,
> + };
> + tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> +}
> diff --git a/tcg/tcg-runtime-gvec.c b/tcg/tcg-runtime-gvec.c
> new file mode 100644
> index 0000000000..9a37ce07a2
> --- /dev/null
> +++ b/tcg/tcg-runtime-gvec.c
> @@ -0,0 +1,199 @@
> +/*
> + * Generic vectorized operation runtime
> + *
> + * Copyright (c) 2017 Linaro
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/host-utils.h"
> +#include "cpu.h"
> +#include "exec/helper-proto.h"
> +
> +/* Virtually all hosts support 16-byte vectors. Those that don't
> + can emulate them via GCC's generic vector extension.
> +
> + In tcg-op-gvec.c, we asserted that both the size and alignment
> + of the data are multiples of 16. */
> +
> +typedef uint8_t vec8 __attribute__((vector_size(16)));
> +typedef uint16_t vec16 __attribute__((vector_size(16)));
> +typedef uint32_t vec32 __attribute__((vector_size(16)));
> +typedef uint64_t vec64 __attribute__((vector_size(16)));
> +
> +static inline intptr_t extract_opsz(uint32_t desc)
> +{
> + return ((desc & 0xff) + 1) * 16;
> +}
> +
> +static inline intptr_t extract_clsz(uint32_t desc)
> +{
> + return (((desc >> 8) & 0xff) + 1) * 16;
> +}
Ahh the data helpers. Any reason we don't use extract32() here where as
we used deposit32 the other end? It should generate the most efficient
code right?
> +
> +static inline void clear_high(void *d, intptr_t opsz, uint32_t desc)
> +{
> + intptr_t clsz = extract_clsz(desc);
> + intptr_t i;
> +
> + if (unlikely(clsz > opsz)) {
> + for (i = opsz; i < clsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = (vec64){ 0 };
> + }
> + }
> +}
> +
> +void HELPER(gvec_add8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec8)) {
> + *(vec8 *)(d + i) = *(vec8 *)(a + i) + *(vec8 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add16)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec16)) {
> + *(vec16 *)(d + i) = *(vec16 *)(a + i) + *(vec16 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add32)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec32)) {
> + *(vec32 *)(d + i) = *(vec32 *)(a + i) + *(vec32 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_add64)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) + *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec8)) {
> + *(vec8 *)(d + i) = *(vec8 *)(a + i) - *(vec8 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub16)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec16)) {
> + *(vec16 *)(d + i) = *(vec16 *)(a + i) - *(vec16 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub32)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec32)) {
> + *(vec32 *)(d + i) = *(vec32 *)(a + i) - *(vec32 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_sub64)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) - *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_and8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) & *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_or8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) | *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_xor8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) ^ *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_andc8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) &~ *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
> +
> +void HELPER(gvec_orc8)(void *d, void *a, void *b, uint32_t desc)
> +{
> + intptr_t opsz = extract_opsz(desc);
> + intptr_t i;
> +
> + for (i = 0; i < opsz; i += sizeof(vec64)) {
> + *(vec64 *)(d + i) = *(vec64 *)(a + i) |~ *(vec64 *)(b + i);
> + }
> + clear_high(d, opsz, desc);
> +}
OK I can follow the helpers easily enough. I think the generators just
need to be a little clearer for non-authors to follow ;-)
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
2017-08-17 23:01 ` [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic Richard Henderson
@ 2017-09-07 16:58 ` Alex Bennée
2017-09-10 1:43 ` Richard Henderson
0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 16:58 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> target/arm/translate-a64.c | 137 ++++++++++++++++++++++++++++-----------------
> 1 file changed, 87 insertions(+), 50 deletions(-)
>
> diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
> index 2200e25be0..025354f983 100644
> --- a/target/arm/translate-a64.c
> +++ b/target/arm/translate-a64.c
> @@ -21,6 +21,7 @@
> #include "cpu.h"
> #include "exec/exec-all.h"
> #include "tcg-op.h"
> +#include "tcg-op-gvec.h"
> #include "qemu/log.h"
> #include "arm_ldst.h"
> #include "translate.h"
> @@ -82,6 +83,7 @@ typedef void NeonGenTwoDoubleOPFn(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_ptr);
> typedef void NeonGenOneOpFn(TCGv_i64, TCGv_i64);
> typedef void CryptoTwoOpEnvFn(TCGv_ptr, TCGv_i32, TCGv_i32);
> typedef void CryptoThreeOpEnvFn(TCGv_ptr, TCGv_i32, TCGv_i32, TCGv_i32);
> +typedef void GVecGenTwoFn(uint32_t, uint32_t, uint32_t, uint32_t, uint32_t);
>
> /* initialize TCG globals. */
> void a64_translate_init(void)
> @@ -537,6 +539,21 @@ static inline int vec_reg_offset(DisasContext *s, int regno,
> return offs;
> }
>
> +/* Return the offset info CPUARMState of the "whole" vector register Qn. */
> +static inline int vec_full_reg_offset(DisasContext *s, int regno)
> +{
> + assert_fp_access_checked(s);
> + return offsetof(CPUARMState, vfp.regs[regno * 2]);
> +}
> +
> +/* Return the byte size of the "whole" vector register, VL / 8. */
> +static inline int vec_full_reg_size(DisasContext *s)
> +{
> + /* FIXME SVE: We should put the composite ZCR_EL* value into tb->flags.
> + In the meantime this is just the AdvSIMD length of 128. */
> + return 128 / 8;
> +}
> +
> /* Return the offset into CPUARMState of a slice (from
> * the least significant end) of FP register Qn (ie
> * Dn, Sn, Hn or Bn).
> @@ -9042,11 +9059,38 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
> bool is_q = extract32(insn, 30, 1);
> TCGv_i64 tcg_op1, tcg_op2, tcg_res[2];
> int pass;
> + GVecGenTwoFn *gvec_op;
>
> if (!fp_access_check(s)) {
> return;
> }
>
> + switch (size + 4 * is_u) {
Hmm I find this switch a little too magical. I mean I can see that the
encoding abuses size for the final opcode when I look at the manual but
it reads badly.
> + case 0: /* AND */
> + gvec_op = tcg_gen_gvec_and8;
> + goto do_gvec;
> + case 1: /* BIC */
> + gvec_op = tcg_gen_gvec_andc8;
> + goto do_gvec;
> + case 2: /* ORR */
> + gvec_op = tcg_gen_gvec_or8;
> + goto do_gvec;
> + case 3: /* ORN */
> + gvec_op = tcg_gen_gvec_orc8;
> + goto do_gvec;
> + case 4: /* EOR */
> + gvec_op = tcg_gen_gvec_xor8;
> + goto do_gvec;
> + do_gvec:
> + gvec_op(vec_full_reg_offset(s, rd),
> + vec_full_reg_offset(s, rn),
> + vec_full_reg_offset(s, rm),
> + is_q ? 16 : 8, vec_full_reg_size(s));
> + return;
No default case (although I guess we just fall through). What's wrong
with just having a !is_u test with gvec_op = tbl[size] and skipping all
the goto stuff?
> + }
> +
> + /* Note that we've now eliminated all !is_u. */
> +
> tcg_op1 = tcg_temp_new_i64();
> tcg_op2 = tcg_temp_new_i64();
> tcg_res[0] = tcg_temp_new_i64();
> @@ -9056,47 +9100,27 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
> read_vec_element(s, tcg_op1, rn, pass, MO_64);
> read_vec_element(s, tcg_op2, rm, pass, MO_64);
>
> - if (!is_u) {
> - switch (size) {
> - case 0: /* AND */
> - tcg_gen_and_i64(tcg_res[pass], tcg_op1, tcg_op2);
> - break;
> - case 1: /* BIC */
> - tcg_gen_andc_i64(tcg_res[pass], tcg_op1, tcg_op2);
> - break;
> - case 2: /* ORR */
> - tcg_gen_or_i64(tcg_res[pass], tcg_op1, tcg_op2);
> - break;
> - case 3: /* ORN */
> - tcg_gen_orc_i64(tcg_res[pass], tcg_op1, tcg_op2);
> - break;
> - }
> - } else {
> - if (size != 0) {
> - /* B* ops need res loaded to operate on */
> - read_vec_element(s, tcg_res[pass], rd, pass, MO_64);
> - }
> + /* B* ops need res loaded to operate on */
> + read_vec_element(s, tcg_res[pass], rd, pass, MO_64);
>
> - switch (size) {
> - case 0: /* EOR */
> - tcg_gen_xor_i64(tcg_res[pass], tcg_op1, tcg_op2);
> - break;
> - case 1: /* BSL bitwise select */
> - tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_op2);
> - tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> - tcg_gen_xor_i64(tcg_res[pass], tcg_op2, tcg_op1);
> - break;
> - case 2: /* BIT, bitwise insert if true */
> - tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> - tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_op2);
> - tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
> - break;
> - case 3: /* BIF, bitwise insert if false */
> - tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> - tcg_gen_andc_i64(tcg_op1, tcg_op1, tcg_op2);
> - tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
> - break;
> - }
> + switch (size) {
> + case 1: /* BSL bitwise select */
> + tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_op2);
> + tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> + tcg_gen_xor_i64(tcg_res[pass], tcg_op2, tcg_op1);
> + break;
> + case 2: /* BIT, bitwise insert if true */
> + tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> + tcg_gen_and_i64(tcg_op1, tcg_op1, tcg_op2);
> + tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
> + break;
> + case 3: /* BIF, bitwise insert if false */
> + tcg_gen_xor_i64(tcg_op1, tcg_op1, tcg_res[pass]);
> + tcg_gen_andc_i64(tcg_op1, tcg_op1, tcg_op2);
> + tcg_gen_xor_i64(tcg_res[pass], tcg_res[pass], tcg_op1);
> + break;
> + default:
> + g_assert_not_reached();
> }
> }
>
> @@ -9370,6 +9394,7 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
> int rn = extract32(insn, 5, 5);
> int rd = extract32(insn, 0, 5);
> int pass;
> + GVecGenTwoFn *gvec_op;
>
> switch (opcode) {
> case 0x13: /* MUL, PMUL */
> @@ -9409,6 +9434,28 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
> return;
> }
>
> + switch (opcode) {
> + case 0x10: /* ADD, SUB */
> + {
> + static GVecGenTwoFn * const fns[4][2] = {
> + { tcg_gen_gvec_add8, tcg_gen_gvec_sub8 },
> + { tcg_gen_gvec_add16, tcg_gen_gvec_sub16 },
> + { tcg_gen_gvec_add32, tcg_gen_gvec_sub32 },
> + { tcg_gen_gvec_add64, tcg_gen_gvec_sub64 },
> + };
> + gvec_op = fns[size][u];
> + goto do_gvec;
> + }
> + break;
> +
> + do_gvec:
> + gvec_op(vec_full_reg_offset(s, rd),
> + vec_full_reg_offset(s, rn),
> + vec_full_reg_offset(s, rm),
> + is_q ? 16 : 8, vec_full_reg_size(s));
> + return;
> + }
> +
> if (size == 3) {
> assert(is_q);
> for (pass = 0; pass < 2; pass++) {
> @@ -9581,16 +9628,6 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
> genfn = fns[size][u];
> break;
> }
> - case 0x10: /* ADD, SUB */
> - {
> - static NeonGenTwoOpFn * const fns[3][2] = {
> - { gen_helper_neon_add_u8, gen_helper_neon_sub_u8 },
> - { gen_helper_neon_add_u16, gen_helper_neon_sub_u16 },
> - { tcg_gen_add_i32, tcg_gen_sub_i32 },
> - };
> - genfn = fns[size][u];
> - break;
> - }
> case 0x11: /* CMTST, CMEQ */
> {
> static NeonGenTwoOpFn * const fns[3][2] = {
Other than the comments on the switch the rest looks good to me.
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors
2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
2017-08-17 23:46 ` Philippe Mathieu-Daudé
@ 2017-09-07 18:18 ` Alex Bennée
1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 18:18 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> Nothing uses or enables them yet.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> ---
> tcg/tcg.h | 5 +++++
> tcg/tcg.c | 2 +-
> 2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index dd97095af5..1277caed3d 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -256,6 +256,11 @@ typedef struct TCGPool {
> typedef enum TCGType {
> TCG_TYPE_I32,
> TCG_TYPE_I64,
> +
> + TCG_TYPE_V64,
> + TCG_TYPE_V128,
> + TCG_TYPE_V256,
> +
> TCG_TYPE_COUNT, /* number of different types */
>
> /* An alias for the size of the host register. */
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 787c8ba0f7..ea78d47fad 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -118,7 +118,7 @@ static TCGReg tcg_reg_alloc_new(TCGContext *s, TCGType t)
> static bool tcg_out_ldst_finalize(TCGContext *s);
> #endif
>
> -static TCGRegSet tcg_target_available_regs[2];
> +static TCGRegSet tcg_target_available_regs[TCG_TYPE_COUNT];
> static TCGRegSet tcg_target_call_clobber_regs;
>
> #if TCG_TARGET_INSN_UNIT_SIZE == 1
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
2017-08-30 1:34 ` Philippe Mathieu-Daudé
@ 2017-09-07 19:00 ` Alex Bennée
2017-09-07 19:02 ` Richard Henderson
1 sibling, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 19:00 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> Nothing uses or implements them yet.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> tcg/tcg.h | 24 ++++++++++++++++
> 2 files changed, 113 insertions(+)
>
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 956fb1e9f3..9162125fac 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>
> #undef TLADDR_ARGS
> #undef DATA64_ARGS
> +
> +/* Host integer vector operations. */
> +/* These opcodes are required whenever the base vector size is enabled. */
> +
> +DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
> +DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
> +DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +
> +DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +
> +DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
> +
> +DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
> +
> +DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
> +
> +/* These opcodes are optional.
> + All element counts must be supported if any are. */
> +
> +DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
> +DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
> +DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
> +
> +DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
> +DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
> +DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
> +
> +DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
> +DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
> +DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
> +
> +DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
> +
> +DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
> +
> +DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
> +
> #undef IMPL
> #undef IMPL64
> #undef DEF
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 1277caed3d..b9e15da13b 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
> #define TCG_TARGET_HAS_rem_i64 0
> #endif
>
> +#ifndef TCG_TARGET_HAS_v64
> +#define TCG_TARGET_HAS_v64 0
> +#define TCG_TARGET_HAS_andc_v64 0
> +#define TCG_TARGET_HAS_orc_v64 0
> +#define TCG_TARGET_HAS_not_v64 0
> +#define TCG_TARGET_HAS_neg_v64 0
> +#endif
> +
> +#ifndef TCG_TARGET_HAS_v128
> +#define TCG_TARGET_HAS_v128 0
> +#define TCG_TARGET_HAS_andc_v128 0
> +#define TCG_TARGET_HAS_orc_v128 0
> +#define TCG_TARGET_HAS_not_v128 0
> +#define TCG_TARGET_HAS_neg_v128 0
> +#endif
> +
> +#ifndef TCG_TARGET_HAS_v256
> +#define TCG_TARGET_HAS_v256 0
> +#define TCG_TARGET_HAS_andc_v256 0
> +#define TCG_TARGET_HAS_orc_v256 0
> +#define TCG_TARGET_HAS_not_v256 0
> +#define TCG_TARGET_HAS_neg_v256 0
> +#endif
Is it possible to use the DEF expanders to avoid manually defining all
the TCG_TARGET_HAS_op for each vector size?
> +
> /* For 32-bit targets, some sort of unsigned widening multiply is required. */
> #if TCG_TARGET_REG_BITS == 32 \
> && !(defined(TCG_TARGET_HAS_mulu2_i32) \
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported
2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
2017-08-17 23:44 ` Philippe Mathieu-Daudé
@ 2017-09-07 19:02 ` Alex Bennée
1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-07 19:02 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> ---
> tcg/tcg.h | 2 +
> tcg/tcg.c | 310 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 312 insertions(+)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index b9e15da13b..b443143b21 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -962,6 +962,8 @@ do {\
> #define tcg_temp_free_ptr(T) tcg_temp_free_i64(TCGV_PTR_TO_NAT(T))
> #endif
>
> +bool tcg_op_supported(TCGOpcode op);
> +
> void tcg_gen_callN(TCGContext *s, void *func,
> TCGArg ret, int nargs, TCGArg *args);
>
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index ea78d47fad..3c3cdda938 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -751,6 +751,316 @@ int tcg_check_temp_count(void)
> }
> #endif
>
> +/* Return true if OP may appear in the opcode stream.
> + Test the runtime variable that controls each opcode. */
> +bool tcg_op_supported(TCGOpcode op)
> +{
> + switch (op) {
> + case INDEX_op_discard:
> + case INDEX_op_set_label:
> + case INDEX_op_call:
> + case INDEX_op_br:
> + case INDEX_op_mb:
> + case INDEX_op_insn_start:
> + case INDEX_op_exit_tb:
> + case INDEX_op_goto_tb:
> + case INDEX_op_qemu_ld_i32:
> + case INDEX_op_qemu_st_i32:
> + case INDEX_op_qemu_ld_i64:
> + case INDEX_op_qemu_st_i64:
> + return true;
> +
> + case INDEX_op_goto_ptr:
> + return TCG_TARGET_HAS_goto_ptr;
> +
> + case INDEX_op_mov_i32:
> + case INDEX_op_movi_i32:
> + case INDEX_op_setcond_i32:
> + case INDEX_op_brcond_i32:
> + case INDEX_op_ld8u_i32:
> + case INDEX_op_ld8s_i32:
> + case INDEX_op_ld16u_i32:
> + case INDEX_op_ld16s_i32:
> + case INDEX_op_ld_i32:
> + case INDEX_op_st8_i32:
> + case INDEX_op_st16_i32:
> + case INDEX_op_st_i32:
> + case INDEX_op_add_i32:
> + case INDEX_op_sub_i32:
> + case INDEX_op_mul_i32:
> + case INDEX_op_and_i32:
> + case INDEX_op_or_i32:
> + case INDEX_op_xor_i32:
> + case INDEX_op_shl_i32:
> + case INDEX_op_shr_i32:
> + case INDEX_op_sar_i32:
> + return true;
> +
> + case INDEX_op_movcond_i32:
> + return TCG_TARGET_HAS_movcond_i32;
> + case INDEX_op_div_i32:
> + case INDEX_op_divu_i32:
> + return TCG_TARGET_HAS_div_i32;
> + case INDEX_op_rem_i32:
> + case INDEX_op_remu_i32:
> + return TCG_TARGET_HAS_rem_i32;
> + case INDEX_op_div2_i32:
> + case INDEX_op_divu2_i32:
> + return TCG_TARGET_HAS_div2_i32;
> + case INDEX_op_rotl_i32:
> + case INDEX_op_rotr_i32:
> + return TCG_TARGET_HAS_rot_i32;
> + case INDEX_op_deposit_i32:
> + return TCG_TARGET_HAS_deposit_i32;
> + case INDEX_op_extract_i32:
> + return TCG_TARGET_HAS_extract_i32;
> + case INDEX_op_sextract_i32:
> + return TCG_TARGET_HAS_sextract_i32;
> + case INDEX_op_add2_i32:
> + return TCG_TARGET_HAS_add2_i32;
> + case INDEX_op_sub2_i32:
> + return TCG_TARGET_HAS_sub2_i32;
> + case INDEX_op_mulu2_i32:
> + return TCG_TARGET_HAS_mulu2_i32;
> + case INDEX_op_muls2_i32:
> + return TCG_TARGET_HAS_muls2_i32;
> + case INDEX_op_muluh_i32:
> + return TCG_TARGET_HAS_muluh_i32;
> + case INDEX_op_mulsh_i32:
> + return TCG_TARGET_HAS_mulsh_i32;
> + case INDEX_op_ext8s_i32:
> + return TCG_TARGET_HAS_ext8s_i32;
> + case INDEX_op_ext16s_i32:
> + return TCG_TARGET_HAS_ext16s_i32;
> + case INDEX_op_ext8u_i32:
> + return TCG_TARGET_HAS_ext8u_i32;
> + case INDEX_op_ext16u_i32:
> + return TCG_TARGET_HAS_ext16u_i32;
> + case INDEX_op_bswap16_i32:
> + return TCG_TARGET_HAS_bswap16_i32;
> + case INDEX_op_bswap32_i32:
> + return TCG_TARGET_HAS_bswap32_i32;
> + case INDEX_op_not_i32:
> + return TCG_TARGET_HAS_not_i32;
> + case INDEX_op_neg_i32:
> + return TCG_TARGET_HAS_neg_i32;
> + case INDEX_op_andc_i32:
> + return TCG_TARGET_HAS_andc_i32;
> + case INDEX_op_orc_i32:
> + return TCG_TARGET_HAS_orc_i32;
> + case INDEX_op_eqv_i32:
> + return TCG_TARGET_HAS_eqv_i32;
> + case INDEX_op_nand_i32:
> + return TCG_TARGET_HAS_nand_i32;
> + case INDEX_op_nor_i32:
> + return TCG_TARGET_HAS_nor_i32;
> + case INDEX_op_clz_i32:
> + return TCG_TARGET_HAS_clz_i32;
> + case INDEX_op_ctz_i32:
> + return TCG_TARGET_HAS_ctz_i32;
> + case INDEX_op_ctpop_i32:
> + return TCG_TARGET_HAS_ctpop_i32;
> +
> + case INDEX_op_brcond2_i32:
> + case INDEX_op_setcond2_i32:
> + return TCG_TARGET_REG_BITS == 32;
> +
> + case INDEX_op_mov_i64:
> + case INDEX_op_movi_i64:
> + case INDEX_op_setcond_i64:
> + case INDEX_op_brcond_i64:
> + case INDEX_op_ld8u_i64:
> + case INDEX_op_ld8s_i64:
> + case INDEX_op_ld16u_i64:
> + case INDEX_op_ld16s_i64:
> + case INDEX_op_ld32u_i64:
> + case INDEX_op_ld32s_i64:
> + case INDEX_op_ld_i64:
> + case INDEX_op_st8_i64:
> + case INDEX_op_st16_i64:
> + case INDEX_op_st32_i64:
> + case INDEX_op_st_i64:
> + case INDEX_op_add_i64:
> + case INDEX_op_sub_i64:
> + case INDEX_op_mul_i64:
> + case INDEX_op_and_i64:
> + case INDEX_op_or_i64:
> + case INDEX_op_xor_i64:
> + case INDEX_op_shl_i64:
> + case INDEX_op_shr_i64:
> + case INDEX_op_sar_i64:
> + case INDEX_op_ext_i32_i64:
> + case INDEX_op_extu_i32_i64:
> + return TCG_TARGET_REG_BITS == 64;
> +
> + case INDEX_op_movcond_i64:
> + return TCG_TARGET_HAS_movcond_i64;
> + case INDEX_op_div_i64:
> + case INDEX_op_divu_i64:
> + return TCG_TARGET_HAS_div_i64;
> + case INDEX_op_rem_i64:
> + case INDEX_op_remu_i64:
> + return TCG_TARGET_HAS_rem_i64;
> + case INDEX_op_div2_i64:
> + case INDEX_op_divu2_i64:
> + return TCG_TARGET_HAS_div2_i64;
> + case INDEX_op_rotl_i64:
> + case INDEX_op_rotr_i64:
> + return TCG_TARGET_HAS_rot_i64;
> + case INDEX_op_deposit_i64:
> + return TCG_TARGET_HAS_deposit_i64;
> + case INDEX_op_extract_i64:
> + return TCG_TARGET_HAS_extract_i64;
> + case INDEX_op_sextract_i64:
> + return TCG_TARGET_HAS_sextract_i64;
> + case INDEX_op_extrl_i64_i32:
> + return TCG_TARGET_HAS_extrl_i64_i32;
> + case INDEX_op_extrh_i64_i32:
> + return TCG_TARGET_HAS_extrh_i64_i32;
> + case INDEX_op_ext8s_i64:
> + return TCG_TARGET_HAS_ext8s_i64;
> + case INDEX_op_ext16s_i64:
> + return TCG_TARGET_HAS_ext16s_i64;
> + case INDEX_op_ext32s_i64:
> + return TCG_TARGET_HAS_ext32s_i64;
> + case INDEX_op_ext8u_i64:
> + return TCG_TARGET_HAS_ext8u_i64;
> + case INDEX_op_ext16u_i64:
> + return TCG_TARGET_HAS_ext16u_i64;
> + case INDEX_op_ext32u_i64:
> + return TCG_TARGET_HAS_ext32u_i64;
> + case INDEX_op_bswap16_i64:
> + return TCG_TARGET_HAS_bswap16_i64;
> + case INDEX_op_bswap32_i64:
> + return TCG_TARGET_HAS_bswap32_i64;
> + case INDEX_op_bswap64_i64:
> + return TCG_TARGET_HAS_bswap64_i64;
> + case INDEX_op_not_i64:
> + return TCG_TARGET_HAS_not_i64;
> + case INDEX_op_neg_i64:
> + return TCG_TARGET_HAS_neg_i64;
> + case INDEX_op_andc_i64:
> + return TCG_TARGET_HAS_andc_i64;
> + case INDEX_op_orc_i64:
> + return TCG_TARGET_HAS_orc_i64;
> + case INDEX_op_eqv_i64:
> + return TCG_TARGET_HAS_eqv_i64;
> + case INDEX_op_nand_i64:
> + return TCG_TARGET_HAS_nand_i64;
> + case INDEX_op_nor_i64:
> + return TCG_TARGET_HAS_nor_i64;
> + case INDEX_op_clz_i64:
> + return TCG_TARGET_HAS_clz_i64;
> + case INDEX_op_ctz_i64:
> + return TCG_TARGET_HAS_ctz_i64;
> + case INDEX_op_ctpop_i64:
> + return TCG_TARGET_HAS_ctpop_i64;
> + case INDEX_op_add2_i64:
> + return TCG_TARGET_HAS_add2_i64;
> + case INDEX_op_sub2_i64:
> + return TCG_TARGET_HAS_sub2_i64;
> + case INDEX_op_mulu2_i64:
> + return TCG_TARGET_HAS_mulu2_i64;
> + case INDEX_op_muls2_i64:
> + return TCG_TARGET_HAS_muls2_i64;
> + case INDEX_op_muluh_i64:
> + return TCG_TARGET_HAS_muluh_i64;
> + case INDEX_op_mulsh_i64:
> + return TCG_TARGET_HAS_mulsh_i64;
> +
> + case INDEX_op_mov_v64:
> + case INDEX_op_movi_v64:
> + case INDEX_op_ld_v64:
> + case INDEX_op_st_v64:
> + case INDEX_op_and_v64:
> + case INDEX_op_or_v64:
> + case INDEX_op_xor_v64:
> + case INDEX_op_add8_v64:
> + case INDEX_op_add16_v64:
> + case INDEX_op_add32_v64:
> + case INDEX_op_sub8_v64:
> + case INDEX_op_sub16_v64:
> + case INDEX_op_sub32_v64:
> + return TCG_TARGET_HAS_v64;
> +
> + case INDEX_op_mov_v128:
> + case INDEX_op_movi_v128:
> + case INDEX_op_ld_v128:
> + case INDEX_op_st_v128:
> + case INDEX_op_and_v128:
> + case INDEX_op_or_v128:
> + case INDEX_op_xor_v128:
> + case INDEX_op_add8_v128:
> + case INDEX_op_add16_v128:
> + case INDEX_op_add32_v128:
> + case INDEX_op_add64_v128:
> + case INDEX_op_sub8_v128:
> + case INDEX_op_sub16_v128:
> + case INDEX_op_sub32_v128:
> + case INDEX_op_sub64_v128:
> + return TCG_TARGET_HAS_v128;
> +
> + case INDEX_op_mov_v256:
> + case INDEX_op_movi_v256:
> + case INDEX_op_ld_v256:
> + case INDEX_op_st_v256:
> + case INDEX_op_and_v256:
> + case INDEX_op_or_v256:
> + case INDEX_op_xor_v256:
> + case INDEX_op_add8_v256:
> + case INDEX_op_add16_v256:
> + case INDEX_op_add32_v256:
> + case INDEX_op_add64_v256:
> + case INDEX_op_sub8_v256:
> + case INDEX_op_sub16_v256:
> + case INDEX_op_sub32_v256:
> + case INDEX_op_sub64_v256:
> + return TCG_TARGET_HAS_v256;
> +
> + case INDEX_op_not_v64:
> + return TCG_TARGET_HAS_not_v64;
> + case INDEX_op_not_v128:
> + return TCG_TARGET_HAS_not_v128;
> + case INDEX_op_not_v256:
> + return TCG_TARGET_HAS_not_v256;
> +
> + case INDEX_op_andc_v64:
> + return TCG_TARGET_HAS_andc_v64;
> + case INDEX_op_andc_v128:
> + return TCG_TARGET_HAS_andc_v128;
> + case INDEX_op_andc_v256:
> + return TCG_TARGET_HAS_andc_v256;
> +
> + case INDEX_op_orc_v64:
> + return TCG_TARGET_HAS_orc_v64;
> + case INDEX_op_orc_v128:
> + return TCG_TARGET_HAS_orc_v128;
> + case INDEX_op_orc_v256:
> + return TCG_TARGET_HAS_orc_v256;
> +
> + case INDEX_op_neg8_v64:
> + case INDEX_op_neg16_v64:
> + case INDEX_op_neg32_v64:
> + return TCG_TARGET_HAS_neg_v64;
> +
> + case INDEX_op_neg8_v128:
> + case INDEX_op_neg16_v128:
> + case INDEX_op_neg32_v128:
> + case INDEX_op_neg64_v128:
> + return TCG_TARGET_HAS_neg_v128;
> +
> + case INDEX_op_neg8_v256:
> + case INDEX_op_neg16_v256:
> + case INDEX_op_neg32_v256:
> + case INDEX_op_neg64_v256:
> + return TCG_TARGET_HAS_neg_v256;
> +
> + case NB_OPS:
> + break;
> + }
> + g_assert_not_reached();
> +}
> +
> /* Note: we convert the 64 bit args to 32 bit and do some alignment
> and endian swap. Maybe it would be better to do the alignment
> and endian swap in tcg_reg_alloc_call(). */
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
2017-09-07 19:00 ` Alex Bennée
@ 2017-09-07 19:02 ` Richard Henderson
2017-09-08 9:28 ` Alex Bennée
0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-09-07 19:02 UTC (permalink / raw)
To: Alex Bennée; +Cc: qemu-devel, qemu-arm
On 09/07/2017 12:00 PM, Alex Bennée wrote:
>
> Richard Henderson <richard.henderson@linaro.org> writes:
>
>> Nothing uses or implements them yet.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>> tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> tcg/tcg.h | 24 ++++++++++++++++
>> 2 files changed, 113 insertions(+)
>>
>> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
>> index 956fb1e9f3..9162125fac 100644
>> --- a/tcg/tcg-opc.h
>> +++ b/tcg/tcg-opc.h
>> @@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>>
>> #undef TLADDR_ARGS
>> #undef DATA64_ARGS
>> +
>> +/* Host integer vector operations. */
>> +/* These opcodes are required whenever the base vector size is enabled. */
>> +
>> +DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +
>> +DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +
>> +DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>> +
>> +DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>> +
>> +DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>> +
>> +/* These opcodes are optional.
>> + All element counts must be supported if any are. */
>> +
>> +DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
>> +DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
>> +DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
>> +
>> +DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
>> +DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
>> +DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
>> +
>> +DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
>> +DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
>> +DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
>> +
>> +DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>> +DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>> +DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>> +
>> +DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>> +DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>> +DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>> +DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>> +
>> +DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>> +DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>> +DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>> +DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>> +
>> #undef IMPL
>> #undef IMPL64
>> #undef DEF
>> diff --git a/tcg/tcg.h b/tcg/tcg.h
>> index 1277caed3d..b9e15da13b 100644
>> --- a/tcg/tcg.h
>> +++ b/tcg/tcg.h
>> @@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
>> #define TCG_TARGET_HAS_rem_i64 0
>> #endif
>>
>> +#ifndef TCG_TARGET_HAS_v64
>> +#define TCG_TARGET_HAS_v64 0
>> +#define TCG_TARGET_HAS_andc_v64 0
>> +#define TCG_TARGET_HAS_orc_v64 0
>> +#define TCG_TARGET_HAS_not_v64 0
>> +#define TCG_TARGET_HAS_neg_v64 0
>> +#endif
>> +
>> +#ifndef TCG_TARGET_HAS_v128
>> +#define TCG_TARGET_HAS_v128 0
>> +#define TCG_TARGET_HAS_andc_v128 0
>> +#define TCG_TARGET_HAS_orc_v128 0
>> +#define TCG_TARGET_HAS_not_v128 0
>> +#define TCG_TARGET_HAS_neg_v128 0
>> +#endif
>> +
>> +#ifndef TCG_TARGET_HAS_v256
>> +#define TCG_TARGET_HAS_v256 0
>> +#define TCG_TARGET_HAS_andc_v256 0
>> +#define TCG_TARGET_HAS_orc_v256 0
>> +#define TCG_TARGET_HAS_not_v256 0
>> +#define TCG_TARGET_HAS_neg_v256 0
>> +#endif
>
> Is it possible to use the DEF expanders to avoid manually defining all
> the TCG_TARGET_HAS_op for each vector size?
No. The preprocessor doesn't work that way.
r~
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 4/8] tcg: Add operations for host vectors
2017-09-07 19:02 ` Richard Henderson
@ 2017-09-08 9:28 ` Alex Bennée
0 siblings, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-08 9:28 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> On 09/07/2017 12:00 PM, Alex Bennée wrote:
>>
>> Richard Henderson <richard.henderson@linaro.org> writes:
>>
>>> Nothing uses or implements them yet.
>>>
>>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>>> ---
>>> tcg/tcg-opc.h | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> tcg/tcg.h | 24 ++++++++++++++++
>>> 2 files changed, 113 insertions(+)
>>>
>>> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
>>> index 956fb1e9f3..9162125fac 100644
>>> --- a/tcg/tcg-opc.h
>>> +++ b/tcg/tcg-opc.h
>>> @@ -206,6 +206,95 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
>>>
>>> #undef TLADDR_ARGS
>>> #undef DATA64_ARGS
>>> +
>>> +/* Host integer vector operations. */
>>> +/* These opcodes are required whenever the base vector size is enabled. */
>>> +
>>> +DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(ld_v256, 1, 1, 1, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(st_v64, 0, 2, 1, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(st_v128, 0, 2, 1, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(st_v256, 0, 2, 1, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(and_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(and_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(and_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(or_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(or_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(or_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(xor_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(xor_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(xor_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(add8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(add16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(add32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +
>>> +DEF(add8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(add16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(add32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(add64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +
>>> +DEF(add8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(add16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(add32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(add64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +DEF(sub8_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(sub16_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +DEF(sub32_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_v64))
>>> +
>>> +DEF(sub8_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(sub16_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(sub32_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +DEF(sub64_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_v128))
>>> +
>>> +DEF(sub8_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(sub16_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(sub32_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +DEF(sub64_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_v256))
>>> +
>>> +/* These opcodes are optional.
>>> + All element counts must be supported if any are. */
>>> +
>>> +DEF(not_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v64))
>>> +DEF(not_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v128))
>>> +DEF(not_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_not_v256))
>>> +
>>> +DEF(andc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v64))
>>> +DEF(andc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v128))
>>> +DEF(andc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_andc_v256))
>>> +
>>> +DEF(orc_v64, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v64))
>>> +DEF(orc_v128, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v128))
>>> +DEF(orc_v256, 1, 2, 0, IMPL(TCG_TARGET_HAS_orc_v256))
>>> +
>>> +DEF(neg8_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>>> +DEF(neg16_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>>> +DEF(neg32_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v64))
>>> +
>>> +DEF(neg8_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>>> +DEF(neg16_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>>> +DEF(neg32_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>>> +DEF(neg64_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v128))
>>> +
>>> +DEF(neg8_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>>> +DEF(neg16_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>>> +DEF(neg32_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>>> +DEF(neg64_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_neg_v256))
>>> +
>>> #undef IMPL
>>> #undef IMPL64
>>> #undef DEF
>>> diff --git a/tcg/tcg.h b/tcg/tcg.h
>>> index 1277caed3d..b9e15da13b 100644
>>> --- a/tcg/tcg.h
>>> +++ b/tcg/tcg.h
>>> @@ -166,6 +166,30 @@ typedef uint64_t TCGRegSet;
>>> #define TCG_TARGET_HAS_rem_i64 0
>>> #endif
>>>
>>> +#ifndef TCG_TARGET_HAS_v64
>>> +#define TCG_TARGET_HAS_v64 0
>>> +#define TCG_TARGET_HAS_andc_v64 0
>>> +#define TCG_TARGET_HAS_orc_v64 0
>>> +#define TCG_TARGET_HAS_not_v64 0
>>> +#define TCG_TARGET_HAS_neg_v64 0
>>> +#endif
>>> +
>>> +#ifndef TCG_TARGET_HAS_v128
>>> +#define TCG_TARGET_HAS_v128 0
>>> +#define TCG_TARGET_HAS_andc_v128 0
>>> +#define TCG_TARGET_HAS_orc_v128 0
>>> +#define TCG_TARGET_HAS_not_v128 0
>>> +#define TCG_TARGET_HAS_neg_v128 0
>>> +#endif
>>> +
>>> +#ifndef TCG_TARGET_HAS_v256
>>> +#define TCG_TARGET_HAS_v256 0
>>> +#define TCG_TARGET_HAS_andc_v256 0
>>> +#define TCG_TARGET_HAS_orc_v256 0
>>> +#define TCG_TARGET_HAS_not_v256 0
>>> +#define TCG_TARGET_HAS_neg_v256 0
>>> +#endif
>>
>> Is it possible to use the DEF expanders to avoid manually defining all
>> the TCG_TARGET_HAS_op for each vector size?
>
> No. The preprocessor doesn't work that way.
Ahh I follow now. tcg-target.h defines the TCG_TARGET_HAS_foo for all
ops it supports and this boilerplate ensures there is a concrete define
for the targets that don't support it (yet).
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid
2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
2017-08-17 23:45 ` Philippe Mathieu-Daudé
@ 2017-09-08 9:30 ` Alex Bennée
1 sibling, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-08 9:30 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> Add with value 0 so that structure zero initialization can
> indicate that the field is not present.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> ---
> tcg/tcg-opc.h | 2 ++
> tcg/tcg.c | 3 +++
> 2 files changed, 5 insertions(+)
>
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 9162125fac..b1445a4c24 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -26,6 +26,8 @@
> * DEF(name, oargs, iargs, cargs, flags)
> */
>
> +DEF(invalid, 0, 0, 0, TCG_OPF_NOT_PRESENT)
> +
> /* predefined ops */
> DEF(discard, 1, 0, 0, TCG_OPF_NOT_PRESENT)
> DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 3c3cdda938..879b29e81f 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -756,6 +756,9 @@ int tcg_check_temp_count(void)
> bool tcg_op_supported(TCGOpcode op)
> {
> switch (op) {
> + case INDEX_op_invalid:
> + return false;
> +
> case INDEX_op_discard:
> case INDEX_op_set_label:
> case INDEX_op_call:
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops
2017-08-17 23:01 ` [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops Richard Henderson
@ 2017-09-08 9:34 ` Alex Bennée
0 siblings, 0 replies; 36+ messages in thread
From: Alex Bennée @ 2017-09-08 9:34 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
I can see where this is going but I'll defer the review until v2 with
the extra verbosity in the original expander patch.
> ---
> tcg/tcg-op-gvec.h | 4 +
> tcg/tcg.h | 6 +-
> tcg/tcg-op-gvec.c | 230 +++++++++++++++++++++++++++++++++++++++++++-----------
> tcg/tcg.c | 8 +-
> 4 files changed, 197 insertions(+), 51 deletions(-)
>
> diff --git a/tcg/tcg-op-gvec.h b/tcg/tcg-op-gvec.h
> index 10db3599a5..99f36d208e 100644
> --- a/tcg/tcg-op-gvec.h
> +++ b/tcg/tcg-op-gvec.h
> @@ -40,6 +40,10 @@ typedef struct {
> /* Similarly, but load up a constant and re-use across lanes. */
> void (*fni8x)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64);
> uint64_t extra_value;
> + /* Operations with host vector ops. */
> + TCGOpcode op_v256;
> + TCGOpcode op_v128;
> + TCGOpcode op_v64;
> /* Larger sizes: expand out-of-line helper w/size descriptor. */
> void (*fno)(TCGv_ptr, TCGv_ptr, TCGv_ptr, TCGv_i32);
> } GVecGen3;
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index b443143b21..7f10501d31 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -825,9 +825,11 @@ int tcg_global_mem_new_internal(TCGType, TCGv_ptr, intptr_t, const char *);
> TCGv_i32 tcg_global_reg_new_i32(TCGReg reg, const char *name);
> TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name);
>
> -TCGv_i32 tcg_temp_new_internal_i32(int temp_local);
> -TCGv_i64 tcg_temp_new_internal_i64(int temp_local);
> +int tcg_temp_new_internal(TCGType type, bool temp_local);
> +TCGv_i32 tcg_temp_new_internal_i32(bool temp_local);
> +TCGv_i64 tcg_temp_new_internal_i64(bool temp_local);
>
> +void tcg_temp_free_internal(int arg);
> void tcg_temp_free_i32(TCGv_i32 arg);
> void tcg_temp_free_i64(TCGv_i64 arg);
>
> diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
> index 6de49dc07f..3aca565dc0 100644
> --- a/tcg/tcg-op-gvec.c
> +++ b/tcg/tcg-op-gvec.c
> @@ -30,54 +30,73 @@
> #define REP8(x) ((x) * 0x0101010101010101ull)
> #define REP16(x) ((x) * 0x0001000100010001ull)
>
> -#define MAX_INLINE 16
> +#define MAX_UNROLL 4
>
> -static inline void check_size_s(uint32_t opsz, uint32_t clsz)
> +static inline void check_size_align(uint32_t opsz, uint32_t clsz, uint32_t ofs)
> {
> - tcg_debug_assert(opsz % 8 == 0);
> - tcg_debug_assert(clsz % 8 == 0);
> + uint32_t align = clsz > 16 || opsz >= 16 ? 15 : 7;
> + tcg_debug_assert(opsz > 0);
> tcg_debug_assert(opsz <= clsz);
> + tcg_debug_assert((opsz & align) == 0);
> + tcg_debug_assert((clsz & align) == 0);
> + tcg_debug_assert((ofs & align) == 0);
> }
>
> -static inline void check_align_s_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +static inline void check_overlap_3(uint32_t d, uint32_t a,
> + uint32_t b, uint32_t s)
> {
> - tcg_debug_assert(dofs % 8 == 0);
> - tcg_debug_assert(aofs % 8 == 0);
> - tcg_debug_assert(bofs % 8 == 0);
> + tcg_debug_assert(d == a || d + s <= a || a + s <= d);
> + tcg_debug_assert(d == b || d + s <= b || b + s <= d);
> + tcg_debug_assert(a == b || a + s <= b || b + s <= a);
> }
>
> -static inline void check_size_l(uint32_t opsz, uint32_t clsz)
> +static inline bool check_size_impl(uint32_t opsz, uint32_t lnsz)
> {
> - tcg_debug_assert(opsz % 16 == 0);
> - tcg_debug_assert(clsz % 16 == 0);
> - tcg_debug_assert(opsz <= clsz);
> + uint32_t lnct = opsz / lnsz;
> + return lnct >= 1 && lnct <= MAX_UNROLL;
> }
>
> -static inline void check_align_l_3(uint32_t dofs, uint32_t aofs, uint32_t bofs)
> +static void expand_clr_v(uint32_t dofs, uint32_t clsz, uint32_t lnsz,
> + TCGType type, TCGOpcode opc_mv, TCGOpcode opc_st)
> {
> - tcg_debug_assert(dofs % 16 == 0);
> - tcg_debug_assert(aofs % 16 == 0);
> - tcg_debug_assert(bofs % 16 == 0);
> -}
> + TCGArg t0 = tcg_temp_new_internal(type, 0);
> + TCGArg env = GET_TCGV_PTR(tcg_ctx.tcg_env);
> + uint32_t i;
>
> -static inline void check_overlap_3(uint32_t d, uint32_t a,
> - uint32_t b, uint32_t s)
> -{
> - tcg_debug_assert(d == a || d + s <= a || a + s <= d);
> - tcg_debug_assert(d == b || d + s <= b || b + s <= d);
> - tcg_debug_assert(a == b || a + s <= b || b + s <= a);
> + tcg_gen_op2(&tcg_ctx, opc_mv, t0, 0);
> + for (i = 0; i < clsz; i += lnsz) {
> + tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
> + }
> + tcg_temp_free_internal(t0);
> }
>
> -static void expand_clr(uint32_t dofs, uint32_t opsz, uint32_t clsz)
> +static void expand_clr(uint32_t dofs, uint32_t clsz)
> {
> - if (clsz > opsz) {
> - TCGv_i64 zero = tcg_const_i64(0);
> - uint32_t i;
> + if (clsz >= 32 && TCG_TARGET_HAS_v256) {
> + uint32_t done = QEMU_ALIGN_DOWN(clsz, 32);
> + expand_clr_v(dofs, done, 32, TCG_TYPE_V256,
> + INDEX_op_movi_v256, INDEX_op_st_v256);
> + dofs += done;
> + clsz -= done;
> + }
>
> - for (i = opsz; i < clsz; i += 8) {
> - tcg_gen_st_i64(zero, tcg_ctx.tcg_env, dofs + i);
> - }
> - tcg_temp_free_i64(zero);
> + if (clsz >= 16 && TCG_TARGET_HAS_v128) {
> + uint16_t done = QEMU_ALIGN_DOWN(clsz, 16);
> + expand_clr_v(dofs, done, 16, TCG_TYPE_V128,
> + INDEX_op_movi_v128, INDEX_op_st_v128);
> + dofs += done;
> + clsz -= done;
> + }
> +
> + if (TCG_TARGET_REG_BITS == 64) {
> + expand_clr_v(dofs, clsz, 8, TCG_TYPE_I64,
> + INDEX_op_movi_i64, INDEX_op_st_i64);
> + } else if (TCG_TARGET_HAS_v64) {
> + expand_clr_v(dofs, clsz, 8, TCG_TYPE_V64,
> + INDEX_op_movi_v64, INDEX_op_st_v64);
> + } else {
> + expand_clr_v(dofs, clsz, 4, TCG_TYPE_I32,
> + INDEX_op_movi_i32, INDEX_op_st_i32);
> }
> }
>
> @@ -164,6 +183,7 @@ static void expand_3x8(uint32_t dofs, uint32_t aofs,
> tcg_temp_free_i64(t0);
> }
>
> +/* FIXME: add CSE for constants and we can eliminate this. */
> static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> uint32_t opsz, uint64_t data,
> void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64, TCGv_i64))
> @@ -192,28 +212,111 @@ static void expand_3x8p1(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> tcg_temp_free_i64(t2);
> }
>
> +static void expand_3_v(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> + uint32_t opsz, uint32_t lnsz, TCGType type,
> + TCGOpcode opc_op, TCGOpcode opc_ld, TCGOpcode opc_st)
> +{
> + TCGArg t0 = tcg_temp_new_internal(type, 0);
> + TCGArg env = GET_TCGV_PTR(tcg_ctx.tcg_env);
> + uint32_t i;
> +
> + if (aofs == bofs) {
> + for (i = 0; i < opsz; i += lnsz) {
> + tcg_gen_op3(&tcg_ctx, opc_ld, t0, env, aofs + i);
> + tcg_gen_op3(&tcg_ctx, opc_op, t0, t0, t0);
> + tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
> + }
> + } else {
> + TCGArg t1 = tcg_temp_new_internal(type, 0);
> + for (i = 0; i < opsz; i += lnsz) {
> + tcg_gen_op3(&tcg_ctx, opc_ld, t0, env, aofs + i);
> + tcg_gen_op3(&tcg_ctx, opc_ld, t1, env, bofs + i);
> + tcg_gen_op3(&tcg_ctx, opc_op, t0, t0, t1);
> + tcg_gen_op3(&tcg_ctx, opc_st, t0, env, dofs + i);
> + }
> + tcg_temp_free_internal(t1);
> + }
> + tcg_temp_free_internal(t0);
> +}
> +
> void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> uint32_t opsz, uint32_t clsz, const GVecGen3 *g)
> {
> + check_size_align(opsz, clsz, dofs | aofs | bofs);
> check_overlap_3(dofs, aofs, bofs, clsz);
> - if (opsz <= MAX_INLINE) {
> - check_size_s(opsz, clsz);
> - check_align_s_3(dofs, aofs, bofs);
> - if (g->fni8) {
> - expand_3x8(dofs, aofs, bofs, opsz, g->fni8);
> - } else if (g->fni4) {
> - expand_3x4(dofs, aofs, bofs, opsz, g->fni4);
> +
> + if (opsz > MAX_UNROLL * 32 || clsz > MAX_UNROLL * 32) {
> + goto do_ool;
> + }
> +
> + /* Recall that ARM SVE allows vector sizes that are not a power of 2.
> + Expand with successively smaller host vector sizes. The intent is
> + that e.g. opsz == 80 would be expanded with 2x32 + 1x16. */
> + /* ??? For clsz > opsz, the host may be able to use an op-sized
> + operation, zeroing the balance of the register. We can then
> + use a cl-sized store to implement the clearing without an extra
> + store operation. This is true for aarch64 and x86_64 hosts. */
> +
> + if (check_size_impl(opsz, 32) && tcg_op_supported(g->op_v256)) {
> + uint32_t done = QEMU_ALIGN_DOWN(opsz, 32);
> + expand_3_v(dofs, aofs, bofs, done, 32, TCG_TYPE_V256,
> + g->op_v256, INDEX_op_ld_v256, INDEX_op_st_v256);
> + dofs += done;
> + aofs += done;
> + bofs += done;
> + opsz -= done;
> + clsz -= done;
> + }
> +
> + if (check_size_impl(opsz, 16) && tcg_op_supported(g->op_v128)) {
> + uint32_t done = QEMU_ALIGN_DOWN(opsz, 16);
> + expand_3_v(dofs, aofs, bofs, done, 16, TCG_TYPE_V128,
> + g->op_v128, INDEX_op_ld_v128, INDEX_op_st_v128);
> + dofs += done;
> + aofs += done;
> + bofs += done;
> + opsz -= done;
> + clsz -= done;
> + }
> +
> + if (check_size_impl(opsz, 8)) {
> + uint32_t done = QEMU_ALIGN_DOWN(opsz, 8);
> + if (tcg_op_supported(g->op_v64)) {
> + expand_3_v(dofs, aofs, bofs, done, 8, TCG_TYPE_V64,
> + g->op_v64, INDEX_op_ld_v64, INDEX_op_st_v64);
> + } else if (g->fni8) {
> + expand_3x8(dofs, aofs, bofs, done, g->fni8);
> } else if (g->fni8x) {
> - expand_3x8p1(dofs, aofs, bofs, opsz, g->extra_value, g->fni8x);
> + expand_3x8p1(dofs, aofs, bofs, done, g->extra_value, g->fni8x);
> } else {
> - g_assert_not_reached();
> + done = 0;
> }
> - expand_clr(dofs, opsz, clsz);
> - } else {
> - check_size_l(opsz, clsz);
> - check_align_l_3(dofs, aofs, bofs);
> - expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
> + dofs += done;
> + aofs += done;
> + bofs += done;
> + opsz -= done;
> + clsz -= done;
> }
> +
> + if (check_size_impl(opsz, 4)) {
> + uint32_t done = QEMU_ALIGN_DOWN(opsz, 4);
> + expand_3x4(dofs, aofs, bofs, done, g->fni4);
> + dofs += done;
> + aofs += done;
> + bofs += done;
> + opsz -= done;
> + clsz -= done;
> + }
> +
> + if (opsz == 0) {
> + if (clsz != 0) {
> + expand_clr(dofs, clsz);
> + }
> + return;
> + }
> +
> + do_ool:
> + expand_3_o(dofs, aofs, bofs, opsz, clsz, g->fno);
> }
>
> static void gen_addv_mask(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b, TCGv_i64 m)
> @@ -240,6 +343,9 @@ void tcg_gen_gvec_add8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> static const GVecGen3 g = {
> .extra_value = REP8(0x80),
> .fni8x = gen_addv_mask,
> + .op_v256 = INDEX_op_add8_v256,
> + .op_v128 = INDEX_op_add8_v128,
> + .op_v64 = INDEX_op_add8_v64,
> .fno = gen_helper_gvec_add8,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -251,6 +357,9 @@ void tcg_gen_gvec_add16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> static const GVecGen3 g = {
> .extra_value = REP16(0x8000),
> .fni8x = gen_addv_mask,
> + .op_v256 = INDEX_op_add16_v256,
> + .op_v128 = INDEX_op_add16_v128,
> + .op_v64 = INDEX_op_add16_v64,
> .fno = gen_helper_gvec_add16,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -261,6 +370,9 @@ void tcg_gen_gvec_add32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> {
> static const GVecGen3 g = {
> .fni4 = tcg_gen_add_i32,
> + .op_v256 = INDEX_op_add32_v256,
> + .op_v128 = INDEX_op_add32_v128,
> + .op_v64 = INDEX_op_add32_v64,
> .fno = gen_helper_gvec_add32,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -271,6 +383,8 @@ void tcg_gen_gvec_add64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> {
> static const GVecGen3 g = {
> .fni8 = tcg_gen_add_i64,
> + .op_v256 = INDEX_op_add64_v256,
> + .op_v128 = INDEX_op_add64_v128,
> .fno = gen_helper_gvec_add64,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -328,6 +442,9 @@ void tcg_gen_gvec_sub8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> static const GVecGen3 g = {
> .extra_value = REP8(0x80),
> .fni8x = gen_subv_mask,
> + .op_v256 = INDEX_op_sub8_v256,
> + .op_v128 = INDEX_op_sub8_v128,
> + .op_v64 = INDEX_op_sub8_v64,
> .fno = gen_helper_gvec_sub8,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -339,6 +456,9 @@ void tcg_gen_gvec_sub16(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> static const GVecGen3 g = {
> .extra_value = REP16(0x8000),
> .fni8x = gen_subv_mask,
> + .op_v256 = INDEX_op_sub16_v256,
> + .op_v128 = INDEX_op_sub16_v128,
> + .op_v64 = INDEX_op_sub16_v64,
> .fno = gen_helper_gvec_sub16,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -349,6 +469,9 @@ void tcg_gen_gvec_sub32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> {
> static const GVecGen3 g = {
> .fni4 = tcg_gen_sub_i32,
> + .op_v256 = INDEX_op_sub32_v256,
> + .op_v128 = INDEX_op_sub32_v128,
> + .op_v64 = INDEX_op_sub32_v64,
> .fno = gen_helper_gvec_sub32,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -359,6 +482,8 @@ void tcg_gen_gvec_sub64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> {
> static const GVecGen3 g = {
> .fni8 = tcg_gen_sub_i64,
> + .op_v256 = INDEX_op_sub64_v256,
> + .op_v128 = INDEX_op_sub64_v128,
> .fno = gen_helper_gvec_sub64,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -397,6 +522,9 @@ void tcg_gen_gvec_and8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> {
> static const GVecGen3 g = {
> .fni8 = tcg_gen_and_i64,
> + .op_v256 = INDEX_op_and_v256,
> + .op_v128 = INDEX_op_and_v128,
> + .op_v64 = INDEX_op_and_v64,
> .fno = gen_helper_gvec_and8,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -407,6 +535,9 @@ void tcg_gen_gvec_or8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> {
> static const GVecGen3 g = {
> .fni8 = tcg_gen_or_i64,
> + .op_v256 = INDEX_op_or_v256,
> + .op_v128 = INDEX_op_or_v128,
> + .op_v64 = INDEX_op_or_v64,
> .fno = gen_helper_gvec_or8,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -417,6 +548,9 @@ void tcg_gen_gvec_xor8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> {
> static const GVecGen3 g = {
> .fni8 = tcg_gen_xor_i64,
> + .op_v256 = INDEX_op_xor_v256,
> + .op_v128 = INDEX_op_xor_v128,
> + .op_v64 = INDEX_op_xor_v64,
> .fno = gen_helper_gvec_xor8,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -427,6 +561,9 @@ void tcg_gen_gvec_andc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> {
> static const GVecGen3 g = {
> .fni8 = tcg_gen_andc_i64,
> + .op_v256 = INDEX_op_andc_v256,
> + .op_v128 = INDEX_op_andc_v128,
> + .op_v64 = INDEX_op_andc_v64,
> .fno = gen_helper_gvec_andc8,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> @@ -437,6 +574,9 @@ void tcg_gen_gvec_orc8(uint32_t dofs, uint32_t aofs, uint32_t bofs,
> {
> static const GVecGen3 g = {
> .fni8 = tcg_gen_orc_i64,
> + .op_v256 = INDEX_op_orc_v256,
> + .op_v128 = INDEX_op_orc_v128,
> + .op_v64 = INDEX_op_orc_v64,
> .fno = gen_helper_gvec_orc8,
> };
> tcg_gen_gvec_3(dofs, aofs, bofs, opsz, clsz, &g);
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 879b29e81f..86eb4214b0 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -604,7 +604,7 @@ int tcg_global_mem_new_internal(TCGType type, TCGv_ptr base,
> return temp_idx(s, ts);
> }
>
> -static int tcg_temp_new_internal(TCGType type, int temp_local)
> +int tcg_temp_new_internal(TCGType type, bool temp_local)
> {
> TCGContext *s = &tcg_ctx;
> TCGTemp *ts;
> @@ -650,7 +650,7 @@ static int tcg_temp_new_internal(TCGType type, int temp_local)
> return idx;
> }
>
> -TCGv_i32 tcg_temp_new_internal_i32(int temp_local)
> +TCGv_i32 tcg_temp_new_internal_i32(bool temp_local)
> {
> int idx;
>
> @@ -658,7 +658,7 @@ TCGv_i32 tcg_temp_new_internal_i32(int temp_local)
> return MAKE_TCGV_I32(idx);
> }
>
> -TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
> +TCGv_i64 tcg_temp_new_internal_i64(bool temp_local)
> {
> int idx;
>
> @@ -666,7 +666,7 @@ TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
> return MAKE_TCGV_I64(idx);
> }
>
> -static void tcg_temp_free_internal(int idx)
> +void tcg_temp_free_internal(int idx)
> {
> TCGContext *s = &tcg_ctx;
> TCGTemp *ts;
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
2017-08-22 13:15 ` Alex Bennée
@ 2017-09-08 10:13 ` Alex Bennée
2017-09-08 13:10 ` Alex Bennée
1 sibling, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-08 10:13 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> tcg/i386/tcg-target.h | 46 +++++-
> tcg/tcg-opc.h | 12 +-
> tcg/i386/tcg-target.inc.c | 382 ++++++++++++++++++++++++++++++++++++++++++----
> 3 files changed, 399 insertions(+), 41 deletions(-)
>
> diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
> index e512648c95..147f82062b 100644
> --- a/tcg/i386/tcg-target.h
> +++ b/tcg/i386/tcg-target.h
> @@ -30,11 +30,10 @@
>
> #ifdef __x86_64__
> # define TCG_TARGET_REG_BITS 64
> -# define TCG_TARGET_NB_REGS 16
> #else
> # define TCG_TARGET_REG_BITS 32
> -# define TCG_TARGET_NB_REGS 8
> #endif
> +# define TCG_TARGET_NB_REGS 24
>
> typedef enum {
> TCG_REG_EAX = 0,
> @@ -56,6 +55,19 @@ typedef enum {
> TCG_REG_R13,
> TCG_REG_R14,
> TCG_REG_R15,
> +
> + /* SSE registers; 64-bit has access to 8 more, but we won't
> + need more than a few and using only the first 8 minimizes
> + the need for a rex prefix on the sse instructions. */
> + TCG_REG_XMM0,
> + TCG_REG_XMM1,
> + TCG_REG_XMM2,
> + TCG_REG_XMM3,
> + TCG_REG_XMM4,
> + TCG_REG_XMM5,
> + TCG_REG_XMM6,
> + TCG_REG_XMM7,
> +
> TCG_REG_RAX = TCG_REG_EAX,
> TCG_REG_RCX = TCG_REG_ECX,
> TCG_REG_RDX = TCG_REG_EDX,
> @@ -79,6 +91,17 @@ extern bool have_bmi1;
> extern bool have_bmi2;
> extern bool have_popcnt;
>
> +#ifdef __SSE2__
> +#define have_sse2 true
> +#else
> +extern bool have_sse2;
> +#endif
> +#ifdef __AVX2__
> +#define have_avx2 true
> +#else
> +extern bool have_avx2;
> +#endif
> +
> /* optional instructions */
> #define TCG_TARGET_HAS_div2_i32 1
> #define TCG_TARGET_HAS_rot_i32 1
> @@ -147,6 +170,25 @@ extern bool have_popcnt;
> #define TCG_TARGET_HAS_mulsh_i64 0
> #endif
>
> +#define TCG_TARGET_HAS_v64 have_sse2
> +#define TCG_TARGET_HAS_v128 have_sse2
> +#define TCG_TARGET_HAS_v256 have_avx2
> +
> +#define TCG_TARGET_HAS_andc_v64 TCG_TARGET_HAS_v64
> +#define TCG_TARGET_HAS_orc_v64 0
> +#define TCG_TARGET_HAS_not_v64 0
> +#define TCG_TARGET_HAS_neg_v64 0
> +
> +#define TCG_TARGET_HAS_andc_v128 TCG_TARGET_HAS_v128
> +#define TCG_TARGET_HAS_orc_v128 0
> +#define TCG_TARGET_HAS_not_v128 0
> +#define TCG_TARGET_HAS_neg_v128 0
> +
> +#define TCG_TARGET_HAS_andc_v256 TCG_TARGET_HAS_v256
> +#define TCG_TARGET_HAS_orc_v256 0
> +#define TCG_TARGET_HAS_not_v256 0
> +#define TCG_TARGET_HAS_neg_v256 0
> +
> #define TCG_TARGET_deposit_i32_valid(ofs, len) \
> (have_bmi2 || \
> ((ofs) == 0 && (len) == 8) || \
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index b1445a4c24..b84cd584fb 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -212,13 +212,13 @@ DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
> /* Host integer vector operations. */
> /* These opcodes are required whenever the base vector size is enabled. */
>
> -DEF(mov_v64, 1, 1, 0, IMPL(TCG_TARGET_HAS_v64))
> -DEF(mov_v128, 1, 1, 0, IMPL(TCG_TARGET_HAS_v128))
> -DEF(mov_v256, 1, 1, 0, IMPL(TCG_TARGET_HAS_v256))
> +DEF(mov_v64, 1, 1, 0, TCG_OPF_NOT_PRESENT)
> +DEF(mov_v128, 1, 1, 0, TCG_OPF_NOT_PRESENT)
> +DEF(mov_v256, 1, 1, 0, TCG_OPF_NOT_PRESENT)
>
> -DEF(movi_v64, 1, 0, 1, IMPL(TCG_TARGET_HAS_v64))
> -DEF(movi_v128, 1, 0, 1, IMPL(TCG_TARGET_HAS_v128))
> -DEF(movi_v256, 1, 0, 1, IMPL(TCG_TARGET_HAS_v256))
> +DEF(movi_v64, 1, 0, 1, TCG_OPF_NOT_PRESENT)
> +DEF(movi_v128, 1, 0, 1, TCG_OPF_NOT_PRESENT)
> +DEF(movi_v256, 1, 0, 1, TCG_OPF_NOT_PRESENT)
I don't follow, isn't the point of IMPL(TCG_TARGET_HAS_foo) to allow the
definition when the backend adds #define TCG_TARGET_HAS_foo 1?
>
> DEF(ld_v64, 1, 1, 1, IMPL(TCG_TARGET_HAS_v64))
> DEF(ld_v128, 1, 1, 1, IMPL(TCG_TARGET_HAS_v128))
> diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
> index aeefb72aa0..0e01b54aa0 100644
> --- a/tcg/i386/tcg-target.inc.c
> +++ b/tcg/i386/tcg-target.inc.c
> @@ -31,7 +31,9 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
> "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15",
> #else
> "%eax", "%ecx", "%edx", "%ebx", "%esp", "%ebp", "%esi", "%edi",
> + NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
> #endif
> + "%xmm0", "%xmm1", "%xmm2", "%xmm3", "%xmm4", "%xmm5", "%xmm6", "%xmm7",
> };
> #endif
>
> @@ -61,6 +63,14 @@ static const int tcg_target_reg_alloc_order[] = {
> TCG_REG_EDX,
> TCG_REG_EAX,
> #endif
> + TCG_REG_XMM0,
> + TCG_REG_XMM1,
> + TCG_REG_XMM2,
> + TCG_REG_XMM3,
> + TCG_REG_XMM4,
> + TCG_REG_XMM5,
> + TCG_REG_XMM6,
> + TCG_REG_XMM7,
> };
>
> static const int tcg_target_call_iarg_regs[] = {
> @@ -94,7 +104,7 @@ static const int tcg_target_call_oarg_regs[] = {
> #define TCG_CT_CONST_I32 0x400
> #define TCG_CT_CONST_WSZ 0x800
>
> -/* Registers used with L constraint, which are the first argument
> +/* Registers used with L constraint, which are the first argument
> registers on x86_64, and two random call clobbered registers on
> i386. */
> #if TCG_TARGET_REG_BITS == 64
> @@ -127,6 +137,16 @@ bool have_bmi1;
> bool have_bmi2;
> bool have_popcnt;
>
> +#ifndef have_sse2
> +bool have_sse2;
> +#endif
> +#ifdef have_avx2
> +#define have_avx1 have_avx2
> +#else
> +static bool have_avx1;
> +bool have_avx2;
> +#endif
> +
> #ifdef CONFIG_CPUID_H
> static bool have_movbe;
> static bool have_lzcnt;
> @@ -215,6 +235,10 @@ static const char *target_parse_constraint(TCGArgConstraint *ct,
> /* With TZCNT/LZCNT, we can have operand-size as an input. */
> ct->ct |= TCG_CT_CONST_WSZ;
> break;
> + case 'x':
> + ct->ct |= TCG_CT_REG;
> + tcg_regset_set32(ct->u.regs, 0, 0xff0000);
> + break;
The documentation on constraints in the README is fairly minimal and we
keep adding target specific ones so perhaps a single line comment here
for clarity?
>
> /* qemu_ld/st address constraint */
> case 'L':
> @@ -292,6 +316,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
> #endif
> #define P_SIMDF3 0x20000 /* 0xf3 opcode prefix */
> #define P_SIMDF2 0x40000 /* 0xf2 opcode prefix */
> +#define P_VEXL 0x80000 /* Set VEX.L = 1 */
>
> #define OPC_ARITH_EvIz (0x81)
> #define OPC_ARITH_EvIb (0x83)
> @@ -324,13 +349,31 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
> #define OPC_MOVL_Iv (0xb8)
> #define OPC_MOVBE_GyMy (0xf0 | P_EXT38)
> #define OPC_MOVBE_MyGy (0xf1 | P_EXT38)
> +#define OPC_MOVDQA_GyMy (0x6f | P_EXT | P_DATA16)
> +#define OPC_MOVDQA_MyGy (0x7f | P_EXT | P_DATA16)
> +#define OPC_MOVDQU_GyMy (0x6f | P_EXT | P_SIMDF3)
> +#define OPC_MOVDQU_MyGy (0x7f | P_EXT | P_SIMDF3)
> +#define OPC_MOVQ_GyMy (0x7e | P_EXT | P_SIMDF3)
> +#define OPC_MOVQ_MyGy (0xd6 | P_EXT | P_DATA16)
> #define OPC_MOVSBL (0xbe | P_EXT)
> #define OPC_MOVSWL (0xbf | P_EXT)
> #define OPC_MOVSLQ (0x63 | P_REXW)
> #define OPC_MOVZBL (0xb6 | P_EXT)
> #define OPC_MOVZWL (0xb7 | P_EXT)
> +#define OPC_PADDB (0xfc | P_EXT | P_DATA16)
> +#define OPC_PADDW (0xfd | P_EXT | P_DATA16)
> +#define OPC_PADDD (0xfe | P_EXT | P_DATA16)
> +#define OPC_PADDQ (0xd4 | P_EXT | P_DATA16)
> +#define OPC_PAND (0xdb | P_EXT | P_DATA16)
> +#define OPC_PANDN (0xdf | P_EXT | P_DATA16)
> #define OPC_PDEP (0xf5 | P_EXT38 | P_SIMDF2)
> #define OPC_PEXT (0xf5 | P_EXT38 | P_SIMDF3)
> +#define OPC_POR (0xeb | P_EXT | P_DATA16)
> +#define OPC_PSUBB (0xf8 | P_EXT | P_DATA16)
> +#define OPC_PSUBW (0xf9 | P_EXT | P_DATA16)
> +#define OPC_PSUBD (0xfa | P_EXT | P_DATA16)
> +#define OPC_PSUBQ (0xfb | P_EXT | P_DATA16)
> +#define OPC_PXOR (0xef | P_EXT | P_DATA16)
> #define OPC_POP_r32 (0x58)
> #define OPC_POPCNT (0xb8 | P_EXT | P_SIMDF3)
> #define OPC_PUSH_r32 (0x50)
> @@ -500,7 +543,8 @@ static void tcg_out_modrm(TCGContext *s, int opc, int r, int rm)
> tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
> }
>
> -static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
> +static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v,
> + int rm, int index)
> {
> int tmp;
>
> @@ -515,14 +559,16 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
> } else if (opc & P_EXT) {
> tmp = 1;
> } else {
> - tcg_abort();
> + g_assert_not_reached();
> }
> - tmp |= 0x40; /* VEX.X */
> tmp |= (r & 8 ? 0 : 0x80); /* VEX.R */
> + tmp |= (index & 8 ? 0 : 0x40); /* VEX.X */
> tmp |= (rm & 8 ? 0 : 0x20); /* VEX.B */
> tcg_out8(s, tmp);
>
> tmp = (opc & P_REXW ? 0x80 : 0); /* VEX.W */
> + tmp |= (opc & P_VEXL ? 0x04 : 0); /* VEX.L */
> +
> /* VEX.pp */
> if (opc & P_DATA16) {
> tmp |= 1; /* 0x66 */
> @@ -538,7 +584,7 @@ static void tcg_out_vex_pfx_opc(TCGContext *s, int opc, int r, int v, int rm)
>
> static void tcg_out_vex_modrm(TCGContext *s, int opc, int r, int v, int rm)
> {
> - tcg_out_vex_pfx_opc(s, opc, r, v, rm);
> + tcg_out_vex_pfx_opc(s, opc, r, v, rm, 0);
> tcg_out8(s, 0xc0 | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
> }
>
> @@ -565,7 +611,7 @@ static void tcg_out_opc_pool_imm(TCGContext *s, int opc, int r,
> static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
> tcg_target_ulong data)
> {
> - tcg_out_vex_pfx_opc(s, opc, r, v, 0);
> + tcg_out_vex_pfx_opc(s, opc, r, v, 0, 0);
> tcg_out_sfx_pool_imm(s, r, data);
> }
>
> @@ -574,8 +620,8 @@ static void tcg_out_vex_pool_imm(TCGContext *s, int opc, int r, int v,
> mode for absolute addresses, ~RM is the size of the immediate operand
> that will follow the instruction. */
>
> -static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> - int index, int shift, intptr_t offset)
> +static void tcg_out_sib_offset(TCGContext *s, int r, int rm, int index,
> + int shift, intptr_t offset)
> {
> int mod, len;
>
> @@ -586,7 +632,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> intptr_t pc = (intptr_t)s->code_ptr + 5 + ~rm;
> intptr_t disp = offset - pc;
> if (disp == (int32_t)disp) {
> - tcg_out_opc(s, opc, r, 0, 0);
> tcg_out8(s, (LOWREGMASK(r) << 3) | 5);
> tcg_out32(s, disp);
> return;
> @@ -596,7 +641,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> use of the MODRM+SIB encoding and is therefore larger than
> rip-relative addressing. */
> if (offset == (int32_t)offset) {
> - tcg_out_opc(s, opc, r, 0, 0);
> tcg_out8(s, (LOWREGMASK(r) << 3) | 4);
> tcg_out8(s, (4 << 3) | 5);
> tcg_out32(s, offset);
> @@ -604,10 +648,9 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> }
>
> /* ??? The memory isn't directly addressable. */
> - tcg_abort();
> + g_assert_not_reached();
> } else {
> /* Absolute address. */
> - tcg_out_opc(s, opc, r, 0, 0);
> tcg_out8(s, (r << 3) | 5);
> tcg_out32(s, offset);
> return;
> @@ -630,7 +673,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> that would be used for %esp is the escape to the two byte form. */
> if (index < 0 && LOWREGMASK(rm) != TCG_REG_ESP) {
> /* Single byte MODRM format. */
> - tcg_out_opc(s, opc, r, rm, 0);
> tcg_out8(s, mod | (LOWREGMASK(r) << 3) | LOWREGMASK(rm));
> } else {
> /* Two byte MODRM+SIB format. */
> @@ -644,7 +686,6 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> tcg_debug_assert(index != TCG_REG_ESP);
> }
>
> - tcg_out_opc(s, opc, r, rm, index);
> tcg_out8(s, mod | (LOWREGMASK(r) << 3) | 4);
> tcg_out8(s, (shift << 6) | (LOWREGMASK(index) << 3) | LOWREGMASK(rm));
> }
> @@ -656,6 +697,21 @@ static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> }
> }
>
> +static void tcg_out_modrm_sib_offset(TCGContext *s, int opc, int r, int rm,
> + int index, int shift, intptr_t offset)
> +{
> + tcg_out_opc(s, opc, r, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
> + tcg_out_sib_offset(s, r, rm, index, shift, offset);
> +}
> +
> +static void tcg_out_vex_modrm_sib_offset(TCGContext *s, int opc, int r, int v,
> + int rm, int index, int shift,
> + intptr_t offset)
> +{
> + tcg_out_vex_pfx_opc(s, opc, r, v, rm < 0 ? 0 : rm, index < 0 ? 0 : index);
> + tcg_out_sib_offset(s, r, rm, index, shift, offset);
> +}
> +
> /* A simplification of the above with no index or shift. */
> static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
> int rm, intptr_t offset)
> @@ -663,6 +719,31 @@ static inline void tcg_out_modrm_offset(TCGContext *s, int opc, int r,
> tcg_out_modrm_sib_offset(s, opc, r, rm, -1, 0, offset);
> }
>
> +static inline void tcg_out_vex_modrm_offset(TCGContext *s, int opc, int r,
> + int v, int rm, intptr_t offset)
> +{
> + tcg_out_vex_modrm_sib_offset(s, opc, r, v, rm, -1, 0, offset);
> +}
> +
> +static void tcg_out_maybe_vex_modrm(TCGContext *s, int opc, int r, int rm)
> +{
> + if (have_avx1) {
> + tcg_out_vex_modrm(s, opc, r, 0, rm);
> + } else {
> + tcg_out_modrm(s, opc, r, rm);
> + }
> +}
> +
> +static void tcg_out_maybe_vex_modrm_offset(TCGContext *s, int opc, int r,
> + int rm, intptr_t offset)
> +{
> + if (have_avx1) {
> + tcg_out_vex_modrm_offset(s, opc, r, 0, rm, offset);
> + } else {
> + tcg_out_modrm_offset(s, opc, r, rm, offset);
> + }
> +}
> +
> /* Generate dest op= src. Uses the same ARITH_* codes as tgen_arithi. */
> static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
> {
> @@ -673,12 +754,32 @@ static inline void tgen_arithr(TCGContext *s, int subop, int dest, int src)
> tcg_out_modrm(s, OPC_ARITH_GvEv + (subop << 3) + ext, dest, src);
> }
>
> -static inline void tcg_out_mov(TCGContext *s, TCGType type,
> - TCGReg ret, TCGReg arg)
> +static void tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
> {
> if (arg != ret) {
> - int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> - tcg_out_modrm(s, opc, ret, arg);
> + int opc = 0;
> +
> + switch (type) {
> + case TCG_TYPE_I64:
> + opc = P_REXW;
> + /* fallthru */
> + case TCG_TYPE_I32:
> + opc |= OPC_MOVL_GvEv;
> + tcg_out_modrm(s, opc, ret, arg);
> + break;
> +
> + case TCG_TYPE_V256:
> + opc = P_VEXL;
> + /* fallthru */
> + case TCG_TYPE_V128:
> + case TCG_TYPE_V64:
> + opc |= OPC_MOVDQA_GyMy;
> + tcg_out_maybe_vex_modrm(s, opc, ret, arg);
> + break;
> +
> + default:
> + g_assert_not_reached();
> + }
> }
> }
>
> @@ -687,6 +788,27 @@ static void tcg_out_movi(TCGContext *s, TCGType type,
> {
> tcg_target_long diff;
>
> + switch (type) {
> + case TCG_TYPE_I32:
> + case TCG_TYPE_I64:
> + break;
> +
> + case TCG_TYPE_V64:
> + case TCG_TYPE_V128:
> + case TCG_TYPE_V256:
> + /* ??? Revisit this as the implementation progresses. */
> + tcg_debug_assert(arg == 0);
> + if (have_avx1) {
> + tcg_out_vex_modrm(s, OPC_PXOR, ret, ret, ret);
> + } else {
> + tcg_out_modrm(s, OPC_PXOR, ret, ret);
> + }
> + return;
> +
> + default:
> + g_assert_not_reached();
> + }
> +
> if (arg == 0) {
> tgen_arithr(s, ARITH_XOR, ret, ret);
> return;
> @@ -750,18 +872,54 @@ static inline void tcg_out_pop(TCGContext *s, int reg)
> tcg_out_opc(s, OPC_POP_r32 + LOWREGMASK(reg), 0, reg, 0);
> }
>
> -static inline void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
> - TCGReg arg1, intptr_t arg2)
> +static void tcg_out_ld(TCGContext *s, TCGType type, TCGReg ret,
> + TCGReg arg1, intptr_t arg2)
> {
> - int opc = OPC_MOVL_GvEv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> - tcg_out_modrm_offset(s, opc, ret, arg1, arg2);
> + switch (type) {
> + case TCG_TYPE_I64:
> + tcg_out_modrm_offset(s, OPC_MOVL_GvEv | P_REXW, ret, arg1, arg2);
> + break;
> + case TCG_TYPE_I32:
> + tcg_out_modrm_offset(s, OPC_MOVL_GvEv, ret, arg1, arg2);
> + break;
> + case TCG_TYPE_V64:
> + tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_GyMy, ret, arg1, arg2);
> + break;
> + case TCG_TYPE_V128:
> + tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_GyMy, ret, arg1, arg2);
> + break;
> + case TCG_TYPE_V256:
> + tcg_out_vex_modrm_offset(s, OPC_MOVDQU_GyMy | P_VEXL,
> + ret, 0, arg1, arg2);
> + break;
> + default:
> + g_assert_not_reached();
> + }
> }
>
> -static inline void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> - TCGReg arg1, intptr_t arg2)
> +static void tcg_out_st(TCGContext *s, TCGType type, TCGReg arg,
> + TCGReg arg1, intptr_t arg2)
> {
> - int opc = OPC_MOVL_EvGv + (type == TCG_TYPE_I64 ? P_REXW : 0);
> - tcg_out_modrm_offset(s, opc, arg, arg1, arg2);
> + switch (type) {
> + case TCG_TYPE_I64:
> + tcg_out_modrm_offset(s, OPC_MOVL_EvGv | P_REXW, arg, arg1, arg2);
> + break;
> + case TCG_TYPE_I32:
> + tcg_out_modrm_offset(s, OPC_MOVL_EvGv, arg, arg1, arg2);
> + break;
> + case TCG_TYPE_V64:
> + tcg_out_maybe_vex_modrm_offset(s, OPC_MOVQ_MyGy, arg, arg1, arg2);
> + break;
> + case TCG_TYPE_V128:
> + tcg_out_maybe_vex_modrm_offset(s, OPC_MOVDQU_MyGy, arg, arg1, arg2);
> + break;
> + case TCG_TYPE_V256:
> + tcg_out_vex_modrm_offset(s, OPC_MOVDQU_MyGy | P_VEXL,
> + arg, 0, arg1, arg2);
> + break;
> + default:
> + g_assert_not_reached();
> + }
> }
>
> static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
> @@ -773,6 +931,8 @@ static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
> return false;
> }
> rexw = P_REXW;
> + } else if (type != TCG_TYPE_I32) {
> + return false;
> }
> tcg_out_modrm_offset(s, OPC_MOVL_EvIz | rexw, 0, base, ofs);
> tcg_out32(s, val);
> @@ -1914,6 +2074,15 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
> case glue(glue(INDEX_op_, x), _i32)
> #endif
>
> +#define OP_128_256(x) \
> + case glue(glue(INDEX_op_, x), _v256): \
> + rexw = P_VEXL; /* FALLTHRU */ \
> + case glue(glue(INDEX_op_, x), _v128)
> +
> +#define OP_64_128_256(x) \
> + OP_128_256(x): \
> + case glue(glue(INDEX_op_, x), _v64)
> +
> /* Hoist the loads of the most common arguments. */
> a0 = args[0];
> a1 = args[1];
> @@ -2379,19 +2548,94 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
> }
> break;
>
> + OP_64_128_256(add8):
> + c = OPC_PADDB;
> + goto gen_simd;
> + OP_64_128_256(add16):
> + c = OPC_PADDW;
> + goto gen_simd;
> + OP_64_128_256(add32):
> + c = OPC_PADDD;
> + goto gen_simd;
> + OP_128_256(add64):
> + c = OPC_PADDQ;
> + goto gen_simd;
> + OP_64_128_256(sub8):
> + c = OPC_PSUBB;
> + goto gen_simd;
> + OP_64_128_256(sub16):
> + c = OPC_PSUBW;
> + goto gen_simd;
> + OP_64_128_256(sub32):
> + c = OPC_PSUBD;
> + goto gen_simd;
> + OP_128_256(sub64):
> + c = OPC_PSUBQ;
> + goto gen_simd;
> + OP_64_128_256(and):
> + c = OPC_PAND;
> + goto gen_simd;
> + OP_64_128_256(andc):
> + c = OPC_PANDN;
> + goto gen_simd;
> + OP_64_128_256(or):
> + c = OPC_POR;
> + goto gen_simd;
> + OP_64_128_256(xor):
> + c = OPC_PXOR;
> + gen_simd:
> + if (have_avx1) {
> + tcg_out_vex_modrm(s, c, a0, a1, a2);
> + } else {
> + tcg_out_modrm(s, c, a0, a2);
> + }
> + break;
> +
> + case INDEX_op_ld_v64:
> + c = TCG_TYPE_V64;
> + goto gen_simd_ld;
> + case INDEX_op_ld_v128:
> + c = TCG_TYPE_V128;
> + goto gen_simd_ld;
> + case INDEX_op_ld_v256:
> + c = TCG_TYPE_V256;
> + gen_simd_ld:
> + tcg_out_ld(s, c, a0, a1, a2);
> + break;
> +
> + case INDEX_op_st_v64:
> + c = TCG_TYPE_V64;
> + goto gen_simd_st;
> + case INDEX_op_st_v128:
> + c = TCG_TYPE_V128;
> + goto gen_simd_st;
> + case INDEX_op_st_v256:
> + c = TCG_TYPE_V256;
> + gen_simd_st:
> + tcg_out_st(s, c, a0, a1, a2);
> + break;
> +
> case INDEX_op_mb:
> tcg_out_mb(s, a0);
> break;
> case INDEX_op_mov_i32: /* Always emitted via tcg_out_mov. */
> case INDEX_op_mov_i64:
> + case INDEX_op_mov_v64:
> + case INDEX_op_mov_v128:
> + case INDEX_op_mov_v256:
> case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi. */
> case INDEX_op_movi_i64:
> + case INDEX_op_movi_v64:
> + case INDEX_op_movi_v128:
> + case INDEX_op_movi_v256:
> case INDEX_op_call: /* Always emitted via tcg_out_call. */
> default:
> tcg_abort();
> }
>
> #undef OP_32_64
> +#undef OP_128_256
> +#undef OP_64_128_256
> }
>
> static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
> @@ -2417,6 +2661,9 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
> = { .args_ct_str = { "r", "r", "L", "L" } };
> static const TCGTargetOpDef L_L_L_L
> = { .args_ct_str = { "L", "L", "L", "L" } };
> + static const TCGTargetOpDef x_0_x = { .args_ct_str = { "x", "0", "x" } };
> + static const TCGTargetOpDef x_x_x = { .args_ct_str = { "x", "x", "x" } };
> + static const TCGTargetOpDef x_r = { .args_ct_str = { "x", "r" } };
>
> switch (op) {
> case INDEX_op_goto_ptr:
> @@ -2620,6 +2867,52 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
> return &s2;
> }
>
> + case INDEX_op_ld_v64:
> + case INDEX_op_ld_v128:
> + case INDEX_op_ld_v256:
> + case INDEX_op_st_v64:
> + case INDEX_op_st_v128:
> + case INDEX_op_st_v256:
> + return &x_r;
> +
> + case INDEX_op_add8_v64:
> + case INDEX_op_add8_v128:
> + case INDEX_op_add16_v64:
> + case INDEX_op_add16_v128:
> + case INDEX_op_add32_v64:
> + case INDEX_op_add32_v128:
> + case INDEX_op_add64_v128:
> + case INDEX_op_sub8_v64:
> + case INDEX_op_sub8_v128:
> + case INDEX_op_sub16_v64:
> + case INDEX_op_sub16_v128:
> + case INDEX_op_sub32_v64:
> + case INDEX_op_sub32_v128:
> + case INDEX_op_sub64_v128:
> + case INDEX_op_and_v64:
> + case INDEX_op_and_v128:
> + case INDEX_op_andc_v64:
> + case INDEX_op_andc_v128:
> + case INDEX_op_or_v64:
> + case INDEX_op_or_v128:
> + case INDEX_op_xor_v64:
> + case INDEX_op_xor_v128:
> + return have_avx1 ? &x_x_x : &x_0_x;
> +
> + case INDEX_op_add8_v256:
> + case INDEX_op_add16_v256:
> + case INDEX_op_add32_v256:
> + case INDEX_op_add64_v256:
> + case INDEX_op_sub8_v256:
> + case INDEX_op_sub16_v256:
> + case INDEX_op_sub32_v256:
> + case INDEX_op_sub64_v256:
> + case INDEX_op_and_v256:
> + case INDEX_op_andc_v256:
> + case INDEX_op_or_v256:
> + case INDEX_op_xor_v256:
> + return &x_x_x;
> +
> default:
> break;
> }
> @@ -2725,9 +3018,16 @@ static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
> static void tcg_target_init(TCGContext *s)
> {
> #ifdef CONFIG_CPUID_H
> - unsigned a, b, c, d;
> + unsigned a, b, c, d, b7 = 0;
> int max = __get_cpuid_max(0, 0);
>
> + if (max >= 7) {
> + /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs. */
> + __cpuid_count(7, 0, a, b7, c, d);
> + have_bmi1 = (b7 & bit_BMI) != 0;
> + have_bmi2 = (b7 & bit_BMI2) != 0;
> + }
> +
> if (max >= 1) {
> __cpuid(1, a, b, c, d);
> #ifndef have_cmov
> @@ -2736,17 +3036,26 @@ static void tcg_target_init(TCGContext *s)
> available, we'll use a small forward branch. */
> have_cmov = (d & bit_CMOV) != 0;
> #endif
> +#ifndef have_sse2
> + have_sse2 = (d & bit_SSE2) != 0;
> +#endif
> /* MOVBE is only available on Intel Atom and Haswell CPUs, so we
> need to probe for it. */
> have_movbe = (c & bit_MOVBE) != 0;
> have_popcnt = (c & bit_POPCNT) != 0;
> - }
>
> - if (max >= 7) {
> - /* BMI1 is available on AMD Piledriver and Intel Haswell CPUs. */
> - __cpuid_count(7, 0, a, b, c, d);
> - have_bmi1 = (b & bit_BMI) != 0;
> - have_bmi2 = (b & bit_BMI2) != 0;
> +#ifndef have_avx2
> + /* There are a number of things we must check before we can be
> + sure of not hitting invalid opcode. */
> + if (c & bit_OSXSAVE) {
> + unsigned xcrl, xcrh;
> + asm ("xgetbv" : "=a" (xcrl), "=d" (xcrh) : "c" (0));
> + if (xcrl & 6 == 6) {
> + have_avx1 = (c & bit_AVX) != 0;
> + have_avx2 = (b7 & bit_AVX2) != 0;
> + }
> + }
> +#endif
> }
>
> max = __get_cpuid_max(0x8000000, 0);
> @@ -2763,6 +3072,13 @@ static void tcg_target_init(TCGContext *s)
> } else {
> tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_I32], 0, 0xff);
> }
> + if (have_sse2) {
> + tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V64], 0, 0xff0000);
> + tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V128], 0, 0xff0000);
> + }
> + if (have_avx2) {
> + tcg_regset_set32(tcg_target_available_regs[TCG_TYPE_V256], 0, 0xff0000);
> + }
>
> tcg_regset_clear(tcg_target_call_clobber_regs);
> tcg_regset_set_reg(tcg_target_call_clobber_regs, TCG_REG_EAX);
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
2017-09-08 10:13 ` Alex Bennée
@ 2017-09-08 13:10 ` Alex Bennée
2017-09-10 2:44 ` Richard Henderson
0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-08 13:10 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Alex Bennée <alex.bennee@linaro.org> writes:
> Richard Henderson <richard.henderson@linaro.org> writes:
>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
<snip>
Also this commit breaks RISU:
qemu-aarch64 build/aarch64-linux-gnu/risu
testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin \
-t testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin.trace
Gives:
mismatch detail (master : apprentice):
V29 : 000000000000000005388083c1444242 vs 00000000000000002a000e0416a30018
The insn is:
37c: 6f56a29d umull2 v29.4s, v20.8h, v6.h[1]
Which is odd because I didn't think we'd touched that.
You can find my bundle of testcases with trace files at:
http://people.linaro.org/~alex.bennee/testcases/arm64.risu/aarch64-patterns-v8dot0.tar.xz
Which is used in our master RISU tracking job:
https://validation.linaro.org/results/query/~alex.bennee/master-aarch64-risu-results
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
` (7 preceding siblings ...)
2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
@ 2017-09-08 13:49 ` Alex Bennée
2017-09-08 16:05 ` Richard Henderson
8 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-08 13:49 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> When Alex and I started talking about this topic, this is the direction
> I was thinking. The primary difference from Alex's version is that the
> interface on the target/cpu/ side uses offsets and not a faux temp. The
> secondary difference is that, for smaller vector sizes at least, I will
> expand to inline host vector operations. The use of explicit offsets
> aids that.
<snip>
OK I think this is a lot more complete than my pass. I'm done with my
review for now, I look forward to the next version. It looks like most
of the pre-requisites are merged now?
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion
2017-09-08 13:49 ` [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Alex Bennée
@ 2017-09-08 16:05 ` Richard Henderson
0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-09-08 16:05 UTC (permalink / raw)
To: Alex Bennée; +Cc: qemu-devel, qemu-arm
On 09/08/2017 06:49 AM, Alex Bennée wrote:
>
> Richard Henderson <richard.henderson@linaro.org> writes:
>
>> When Alex and I started talking about this topic, this is the direction
>> I was thinking. The primary difference from Alex's version is that the
>> interface on the target/cpu/ side uses offsets and not a faux temp. The
>> secondary difference is that, for smaller vector sizes at least, I will
>> expand to inline host vector operations. The use of explicit offsets
>> aids that.
> <snip>
>
> OK I think this is a lot more complete than my pass. I'm done with my
> review for now, I look forward to the next version. It looks like most
> of the pre-requisites are merged now?
Yep, all lead-up patches are now in. Thanks for the review.
As I work on the next version, I'll do aarch64 host as well for comparison.
r~
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
2017-09-07 16:58 ` Alex Bennée
@ 2017-09-10 1:43 ` Richard Henderson
2017-09-11 9:12 ` Alex Bennée
0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-09-10 1:43 UTC (permalink / raw)
To: Alex Bennée; +Cc: qemu-devel, qemu-arm
On 09/07/2017 09:58 AM, Alex Bennée wrote:
>> + switch (size + 4 * is_u) {
>
> Hmm I find this switch a little too magical. I mean I can see that the
> encoding abuses size for the final opcode when I look at the manual but
> it reads badly.
>
>> + case 0: /* AND */
>> + gvec_op = tcg_gen_gvec_and8;
>> + goto do_gvec;
>> + case 1: /* BIC */
>> + gvec_op = tcg_gen_gvec_andc8;
>> + goto do_gvec;
>> + case 2: /* ORR */
>> + gvec_op = tcg_gen_gvec_or8;
>> + goto do_gvec;
>> + case 3: /* ORN */
>> + gvec_op = tcg_gen_gvec_orc8;
>> + goto do_gvec;
>> + case 4: /* EOR */
>> + gvec_op = tcg_gen_gvec_xor8;
>> + goto do_gvec;
>> + do_gvec:
>> + gvec_op(vec_full_reg_offset(s, rd),
>> + vec_full_reg_offset(s, rn),
>> + vec_full_reg_offset(s, rm),
>> + is_q ? 16 : 8, vec_full_reg_size(s));
>> + return;
>
> No default case (although I guess we just fall through). What's wrong
> with just having a !is_u test with gvec_op = tbl[size] and skipping all
> the goto stuff?
Because that would still leave EOR out in the woods.
I do think this is the cleanest way to filter out these 5 operations.
r~
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
2017-09-08 13:10 ` Alex Bennée
@ 2017-09-10 2:44 ` Richard Henderson
2017-09-11 9:07 ` Alex Bennée
0 siblings, 1 reply; 36+ messages in thread
From: Richard Henderson @ 2017-09-10 2:44 UTC (permalink / raw)
To: Alex Bennée; +Cc: qemu-devel, qemu-arm
On 09/08/2017 06:10 AM, Alex Bennée wrote:
> Also this commit breaks RISU:
>
> qemu-aarch64 build/aarch64-linux-gnu/risu
> testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin \
> -t testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin.trace
>
> Gives:
>
> mismatch detail (master : apprentice):
> V29 : 000000000000000005388083c1444242 vs 00000000000000002a000e0416a30018
>
> The insn is:
>
> 37c: 6f56a29d umull2 v29.4s, v20.8h, v6.h[1]
>
> Which is odd because I didn't think we'd touched that.
Indeed we didn't. Still, I'll check it out next week.
r~
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
2017-09-10 2:44 ` Richard Henderson
@ 2017-09-11 9:07 ` Alex Bennée
2017-09-12 13:52 ` Richard Henderson
0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-11 9:07 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> On 09/08/2017 06:10 AM, Alex Bennée wrote:
>> Also this commit breaks RISU:
>>
>> qemu-aarch64 build/aarch64-linux-gnu/risu
>> testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin \
>> -t testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin.trace
>>
>> Gives:
>>
>> mismatch detail (master : apprentice):
>> V29 : 000000000000000005388083c1444242 vs 00000000000000002a000e0416a30018
>>
>> The insn is:
>>
>> 37c: 6f56a29d umull2 v29.4s, v20.8h, v6.h[1]
>>
>> Which is odd because I didn't think we'd touched that.
>
> Indeed we didn't. Still, I'll check it out next week.
OK it would help if I had objdumped the right file:
36c: 0e781fdd bic v29.8b, v30.8b, v24.8b
370: 00005af0 .inst 0x00005af0 ; undefined
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
2017-09-10 1:43 ` Richard Henderson
@ 2017-09-11 9:12 ` Alex Bennée
2017-09-11 18:09 ` Richard Henderson
0 siblings, 1 reply; 36+ messages in thread
From: Alex Bennée @ 2017-09-11 9:12 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel, qemu-arm
Richard Henderson <richard.henderson@linaro.org> writes:
> On 09/07/2017 09:58 AM, Alex Bennée wrote:
>>> + switch (size + 4 * is_u) {
>>
>> Hmm I find this switch a little too magical. I mean I can see that the
>> encoding abuses size for the final opcode when I look at the manual but
>> it reads badly.
>>
>>> + case 0: /* AND */
>>> + gvec_op = tcg_gen_gvec_and8;
>>> + goto do_gvec;
>>> + case 1: /* BIC */
>>> + gvec_op = tcg_gen_gvec_andc8;
>>> + goto do_gvec;
>>> + case 2: /* ORR */
>>> + gvec_op = tcg_gen_gvec_or8;
>>> + goto do_gvec;
>>> + case 3: /* ORN */
>>> + gvec_op = tcg_gen_gvec_orc8;
>>> + goto do_gvec;
>>> + case 4: /* EOR */
>>> + gvec_op = tcg_gen_gvec_xor8;
>>> + goto do_gvec;
>>> + do_gvec:
>>> + gvec_op(vec_full_reg_offset(s, rd),
>>> + vec_full_reg_offset(s, rn),
>>> + vec_full_reg_offset(s, rm),
>>> + is_q ? 16 : 8, vec_full_reg_size(s));
>>> + return;
>>
>> No default case (although I guess we just fall through). What's wrong
>> with just having a !is_u test with gvec_op = tbl[size] and skipping all
>> the goto stuff?
>
> Because that would still leave EOR out in the woods.
> I do think this is the cleanest way to filter out these 5 operations.
Is this going to look better if the other operations in this branch of
the decode are converted as well?
--
Alex Bennée
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic
2017-09-11 9:12 ` Alex Bennée
@ 2017-09-11 18:09 ` Richard Henderson
0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-09-11 18:09 UTC (permalink / raw)
To: Alex Bennée; +Cc: qemu-devel, qemu-arm
On 09/11/2017 02:12 AM, Alex Bennée wrote:
>
> Richard Henderson <richard.henderson@linaro.org> writes:
>
>> On 09/07/2017 09:58 AM, Alex Bennée wrote:
>>>> + switch (size + 4 * is_u) {
>>>
>>> Hmm I find this switch a little too magical. I mean I can see that the
>>> encoding abuses size for the final opcode when I look at the manual but
>>> it reads badly.
>>>
>>>> + case 0: /* AND */
>>>> + gvec_op = tcg_gen_gvec_and8;
>>>> + goto do_gvec;
>>>> + case 1: /* BIC */
>>>> + gvec_op = tcg_gen_gvec_andc8;
>>>> + goto do_gvec;
>>>> + case 2: /* ORR */
>>>> + gvec_op = tcg_gen_gvec_or8;
>>>> + goto do_gvec;
>>>> + case 3: /* ORN */
>>>> + gvec_op = tcg_gen_gvec_orc8;
>>>> + goto do_gvec;
>>>> + case 4: /* EOR */
>>>> + gvec_op = tcg_gen_gvec_xor8;
>>>> + goto do_gvec;
>>>> + do_gvec:
>>>> + gvec_op(vec_full_reg_offset(s, rd),
>>>> + vec_full_reg_offset(s, rn),
>>>> + vec_full_reg_offset(s, rm),
>>>> + is_q ? 16 : 8, vec_full_reg_size(s));
>>>> + return;
>>>
>>> No default case (although I guess we just fall through). What's wrong
>>> with just having a !is_u test with gvec_op = tbl[size] and skipping all
>>> the goto stuff?
>>
>> Because that would still leave EOR out in the woods.
>> I do think this is the cleanest way to filter out these 5 operations.
>
> Is this going to look better if the other operations in this branch of
> the decode are converted as well?
It might do. I'll have to think about those some more.
Indeed, perhaps that's exactly what I ought to do. Those are complex logical
operations which certainly will not get their own opcodes, but are "simple" in
that they can be implemented in terms of vector xor+and, so there's no reason
we couldn't expand those inline for all vector supporting hosts.
r~
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations
2017-09-11 9:07 ` Alex Bennée
@ 2017-09-12 13:52 ` Richard Henderson
0 siblings, 0 replies; 36+ messages in thread
From: Richard Henderson @ 2017-09-12 13:52 UTC (permalink / raw)
To: Alex Bennée; +Cc: qemu-devel, qemu-arm
On 09/11/2017 02:07 AM, Alex Bennée wrote:
>
> Richard Henderson <richard.henderson@linaro.org> writes:
>
>> On 09/08/2017 06:10 AM, Alex Bennée wrote:
>>> Also this commit breaks RISU:
>>>
>>> qemu-aarch64 build/aarch64-linux-gnu/risu
>>> testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin \
>>> -t testcases.aarch64/insn_ANDSi_RES8_ANDS_RES_ANDv_ASRV__INC.risu.bin.trace
>>>
>>> Gives:
>>>
>>> mismatch detail (master : apprentice):
>>> V29 : 000000000000000005388083c1444242 vs 00000000000000002a000e0416a30018
>>>
>>> The insn is:
>>>
>>> 37c: 6f56a29d umull2 v29.4s, v20.8h, v6.h[1]
>>>
>>> Which is odd because I didn't think we'd touched that.
>>
>> Indeed we didn't. Still, I'll check it out next week.
>
> OK it would help if I had objdumped the right file:
>
> 36c: 0e781fdd bic v29.8b, v30.8b, v24.8b
> 370: 00005af0 .inst 0x00005af0 ; undefined
Thanks. The sse pandn operand order is ... surprising.
Even though I know that I still managed to get it wrong.
Fixed for v2.
r~
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2017-09-12 13:52 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-17 23:01 [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Richard Henderson
2017-08-17 23:01 ` [Qemu-devel] [PATCH 1/8] tcg: Add generic vector infrastructure and ops for add/sub/logic Richard Henderson
2017-08-30 1:31 ` Philippe Mathieu-Daudé
2017-09-01 20:38 ` Richard Henderson
2017-09-07 16:34 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 2/8] target/arm: Use generic vector infrastructure for aa64 add/sub/logic Richard Henderson
2017-09-07 16:58 ` Alex Bennée
2017-09-10 1:43 ` Richard Henderson
2017-09-11 9:12 ` Alex Bennée
2017-09-11 18:09 ` Richard Henderson
2017-08-17 23:01 ` [Qemu-devel] [PATCH 3/8] tcg: Add types for host vectors Richard Henderson
2017-08-17 23:46 ` Philippe Mathieu-Daudé
2017-09-07 18:18 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 4/8] tcg: Add operations " Richard Henderson
2017-08-30 1:34 ` Philippe Mathieu-Daudé
2017-09-07 19:00 ` Alex Bennée
2017-09-07 19:02 ` Richard Henderson
2017-09-08 9:28 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 5/8] tcg: Add tcg_op_supported Richard Henderson
2017-08-17 23:44 ` Philippe Mathieu-Daudé
2017-09-07 19:02 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 6/8] tcg: Add INDEX_op_invalid Richard Henderson
2017-08-17 23:45 ` Philippe Mathieu-Daudé
2017-09-08 9:30 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 7/8] tcg: Expand target vector ops with host vector ops Richard Henderson
2017-09-08 9:34 ` Alex Bennée
2017-08-17 23:01 ` [Qemu-devel] [PATCH 8/8] tcg/i386: Add vector operations Richard Henderson
2017-08-22 13:15 ` Alex Bennée
2017-08-23 19:02 ` Richard Henderson
2017-09-08 10:13 ` Alex Bennée
2017-09-08 13:10 ` Alex Bennée
2017-09-10 2:44 ` Richard Henderson
2017-09-11 9:07 ` Alex Bennée
2017-09-12 13:52 ` Richard Henderson
2017-09-08 13:49 ` [Qemu-devel] [PATCH 0/8] TCG vectorization and example conversion Alex Bennée
2017-09-08 16:05 ` Richard Henderson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.