All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/2] target/arm: make use of new gvec expanders
@ 2019-05-18 19:19 Richard Henderson
  2019-05-18 19:19 ` [Qemu-devel] [PATCH 1/2] target/arm: Vectorize USHL and SSHL Richard Henderson
  2019-05-18 19:19 ` [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel Richard Henderson
  0 siblings, 2 replies; 12+ messages in thread
From: Richard Henderson @ 2019-05-18 19:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm

Based-on: <20190518190157.21255-1-richard.henderson@linaro.org>
Aka "tcg: misc gvec improvements".

We've added (or are adding) generic support for variable vector shifts
and bitsel.  This trivially replaces the implementations of BSL, BIT,
and BSL.  It enables a reasonable implementation of {U,S}SHL.


r~


Richard Henderson (2):
  target/arm: Vectorize USHL and SSHL
  target/arm: Use tcg_gen_gvec_bitsel

 target/arm/helper.h        |  15 +-
 target/arm/translate-a64.h |   2 +
 target/arm/translate.h     |   9 +-
 target/arm/neon_helper.c   |  33 ----
 target/arm/translate-a64.c |  33 ++--
 target/arm/translate.c     | 362 ++++++++++++++++++++++++++++---------
 target/arm/vec_helper.c    | 176 ++++++++++++++++++
 7 files changed, 486 insertions(+), 144 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 1/2] target/arm: Vectorize USHL and SSHL
  2019-05-18 19:19 [Qemu-devel] [PATCH 0/2] target/arm: make use of new gvec expanders Richard Henderson
@ 2019-05-18 19:19 ` Richard Henderson
  2019-05-23 12:44   ` Peter Maydell
  2019-05-18 19:19 ` [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel Richard Henderson
  1 sibling, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2019-05-18 19:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm

These instructions shift left or right depending on the sign
of the input, and 7 bits are significant to the shift.  This
requires several masks and selects in addition to the actual
shifts to form the complete answer.

That said, the operation is still a small improvement even for
two 64-bit elements -- 13 vector operations instead of 2 * 7
integer operations.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/helper.h        |  15 +-
 target/arm/translate.h     |   6 +
 target/arm/neon_helper.c   |  33 -----
 target/arm/translate-a64.c |  18 +--
 target/arm/translate.c     | 284 +++++++++++++++++++++++++++++++++++--
 target/arm/vec_helper.c    | 176 +++++++++++++++++++++++
 6 files changed, 466 insertions(+), 66 deletions(-)

diff --git a/target/arm/helper.h b/target/arm/helper.h
index 132aa1682e..ac2d8fb407 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -297,14 +297,8 @@ DEF_HELPER_2(neon_abd_s16, i32, i32, i32)
 DEF_HELPER_2(neon_abd_u32, i32, i32, i32)
 DEF_HELPER_2(neon_abd_s32, i32, i32, i32)
 
-DEF_HELPER_2(neon_shl_u8, i32, i32, i32)
-DEF_HELPER_2(neon_shl_s8, i32, i32, i32)
 DEF_HELPER_2(neon_shl_u16, i32, i32, i32)
 DEF_HELPER_2(neon_shl_s16, i32, i32, i32)
-DEF_HELPER_2(neon_shl_u32, i32, i32, i32)
-DEF_HELPER_2(neon_shl_s32, i32, i32, i32)
-DEF_HELPER_2(neon_shl_u64, i64, i64, i64)
-DEF_HELPER_2(neon_shl_s64, i64, i64, i64)
 DEF_HELPER_2(neon_rshl_u8, i32, i32, i32)
 DEF_HELPER_2(neon_rshl_s8, i32, i32, i32)
 DEF_HELPER_2(neon_rshl_u16, i32, i32, i32)
@@ -691,6 +685,15 @@ DEF_HELPER_FLAGS_2(frint64_s, TCG_CALL_NO_RWG, f32, f32, ptr)
 DEF_HELPER_FLAGS_2(frint32_d, TCG_CALL_NO_RWG, f64, f64, ptr)
 DEF_HELPER_FLAGS_2(frint64_d, TCG_CALL_NO_RWG, f64, f64, ptr)
 
+DEF_HELPER_FLAGS_4(gvec_sshl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sshl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sshl_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sshl_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_ushl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_ushl_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_ushl_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #include "helper-sve.h"
diff --git a/target/arm/translate.h b/target/arm/translate.h
index c2348def0d..f357b767cb 100644
--- a/target/arm/translate.h
+++ b/target/arm/translate.h
@@ -244,6 +244,8 @@ extern const GVecGen3 bif_op;
 extern const GVecGen3 mla_op[4];
 extern const GVecGen3 mls_op[4];
 extern const GVecGen3 cmtst_op[4];
+extern const GVecGen3 sshl_op[4];
+extern const GVecGen3 ushl_op[4];
 extern const GVecGen2i ssra_op[4];
 extern const GVecGen2i usra_op[4];
 extern const GVecGen2i sri_op[4];
@@ -253,6 +255,10 @@ extern const GVecGen4 sqadd_op[4];
 extern const GVecGen4 uqsub_op[4];
 extern const GVecGen4 sqsub_op[4];
 void gen_cmtst_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b);
+void gen_ushl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b);
+void gen_sshl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b);
+void gen_ushl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b);
+void gen_sshl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b);
 
 /*
  * Forward to the isar_feature_* tests given a DisasContext pointer.
diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
index 4259056723..c581ffb7d3 100644
--- a/target/arm/neon_helper.c
+++ b/target/arm/neon_helper.c
@@ -615,24 +615,9 @@ NEON_VOP(abd_u32, neon_u32, 1)
     } else { \
         dest = src1 << tmp; \
     }} while (0)
-NEON_VOP(shl_u8, neon_u8, 4)
 NEON_VOP(shl_u16, neon_u16, 2)
-NEON_VOP(shl_u32, neon_u32, 1)
 #undef NEON_FN
 
-uint64_t HELPER(neon_shl_u64)(uint64_t val, uint64_t shiftop)
-{
-    int8_t shift = (int8_t)shiftop;
-    if (shift >= 64 || shift <= -64) {
-        val = 0;
-    } else if (shift < 0) {
-        val >>= -shift;
-    } else {
-        val <<= shift;
-    }
-    return val;
-}
-
 #define NEON_FN(dest, src1, src2) do { \
     int8_t tmp; \
     tmp = (int8_t)src2; \
@@ -645,27 +630,9 @@ uint64_t HELPER(neon_shl_u64)(uint64_t val, uint64_t shiftop)
     } else { \
         dest = src1 << tmp; \
     }} while (0)
-NEON_VOP(shl_s8, neon_s8, 4)
 NEON_VOP(shl_s16, neon_s16, 2)
-NEON_VOP(shl_s32, neon_s32, 1)
 #undef NEON_FN
 
-uint64_t HELPER(neon_shl_s64)(uint64_t valop, uint64_t shiftop)
-{
-    int8_t shift = (int8_t)shiftop;
-    int64_t val = valop;
-    if (shift >= 64) {
-        val = 0;
-    } else if (shift <= -64) {
-        val >>= 63;
-    } else if (shift < 0) {
-        val >>= -shift;
-    } else {
-        val <<= shift;
-    }
-    return val;
-}
-
 #define NEON_FN(dest, src1, src2) do { \
     int8_t tmp; \
     tmp = (int8_t)src2; \
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index b7c5a928b4..2c280243a9 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -8845,9 +8845,9 @@ static void handle_3same_64(DisasContext *s, int opcode, bool u,
         break;
     case 0x8: /* SSHL, USHL */
         if (u) {
-            gen_helper_neon_shl_u64(tcg_rd, tcg_rn, tcg_rm);
+            gen_ushl_i64(tcg_rd, tcg_rn, tcg_rm);
         } else {
-            gen_helper_neon_shl_s64(tcg_rd, tcg_rn, tcg_rm);
+            gen_sshl_i64(tcg_rd, tcg_rn, tcg_rm);
         }
         break;
     case 0x9: /* SQSHL, UQSHL */
@@ -11242,6 +11242,10 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
                        is_q ? 16 : 8, vec_full_reg_size(s),
                        (u ? uqsub_op : sqsub_op) + size);
         return;
+    case 0x08: /* SSHL, USHL */
+        gen_gvec_op3(s, is_q, rd, rn, rm,
+                     u ? &ushl_op[size] : &sshl_op[size]);
+        return;
     case 0x0c: /* SMAX, UMAX */
         if (u) {
             gen_gvec_fn3(s, is_q, rd, rn, rm, tcg_gen_gvec_umax, size);
@@ -11357,16 +11361,6 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
                 genfn = fns[size][u];
                 break;
             }
-            case 0x8: /* SSHL, USHL */
-            {
-                static NeonGenTwoOpFn * const fns[3][2] = {
-                    { gen_helper_neon_shl_s8, gen_helper_neon_shl_u8 },
-                    { gen_helper_neon_shl_s16, gen_helper_neon_shl_u16 },
-                    { gen_helper_neon_shl_s32, gen_helper_neon_shl_u32 },
-                };
-                genfn = fns[size][u];
-                break;
-            }
             case 0x9: /* SQSHL, UQSHL */
             {
                 static NeonGenTwoOpEnvFn * const fns[3][2] = {
diff --git a/target/arm/translate.c b/target/arm/translate.c
index dd053c80d6..49dfcdc90d 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -5374,13 +5374,13 @@ static inline void gen_neon_shift_narrow(int size, TCGv_i32 var, TCGv_i32 shift,
         if (u) {
             switch (size) {
             case 1: gen_helper_neon_shl_u16(var, var, shift); break;
-            case 2: gen_helper_neon_shl_u32(var, var, shift); break;
+            case 2: gen_ushl_i32(var, var, shift); break;
             default: abort();
             }
         } else {
             switch (size) {
             case 1: gen_helper_neon_shl_s16(var, var, shift); break;
-            case 2: gen_helper_neon_shl_s32(var, var, shift); break;
+            case 2: gen_sshl_i32(var, var, shift); break;
             default: abort();
             }
         }
@@ -6259,6 +6259,266 @@ const GVecGen3 cmtst_op[4] = {
       .vece = MO_64 },
 };
 
+void gen_ushl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b)
+{
+    TCGv_i32 lval = tcg_temp_new_i32();
+    TCGv_i32 rval = tcg_temp_new_i32();
+    TCGv_i32 lsh = tcg_temp_new_i32();
+    TCGv_i32 rsh = tcg_temp_new_i32();
+    TCGv_i32 zero = tcg_const_i32(0);
+    TCGv_i32 max = tcg_const_i32(32);
+
+    /*
+     * Perform possibly out of range shifts, trusting that the operation
+     * does not trap.  Discard unused results after the fact.
+     */
+    tcg_gen_ext8s_i32(lsh, b);
+    tcg_gen_neg_i32(rsh, lsh);
+    tcg_gen_shl_i32(lval, a, lsh);
+    tcg_gen_shr_i32(rval, a, rsh);
+    tcg_gen_movcond_i32(TCG_COND_LTU, d, lsh, max, lval, zero);
+    tcg_gen_movcond_i32(TCG_COND_LTU, d, rsh, max, rval, d);
+
+    tcg_temp_free_i32(lval);
+    tcg_temp_free_i32(rval);
+    tcg_temp_free_i32(lsh);
+    tcg_temp_free_i32(rsh);
+    tcg_temp_free_i32(zero);
+    tcg_temp_free_i32(max);
+}
+
+void gen_ushl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 lval = tcg_temp_new_i64();
+    TCGv_i64 rval = tcg_temp_new_i64();
+    TCGv_i64 lsh = tcg_temp_new_i64();
+    TCGv_i64 rsh = tcg_temp_new_i64();
+    TCGv_i64 zero = tcg_const_i64(0);
+    TCGv_i64 max = tcg_const_i64(64);
+
+    /*
+     * Perform possibly out of range shifts, trusting that the operation
+     * does not trap.  Discard unused results after the fact.
+     */
+    tcg_gen_ext8s_i64(lsh, b);
+    tcg_gen_neg_i64(rsh, lsh);
+    tcg_gen_shl_i64(lval, a, lsh);
+    tcg_gen_shr_i64(rval, a, rsh);
+    tcg_gen_movcond_i64(TCG_COND_LTU, d, lsh, max, lval, zero);
+    tcg_gen_movcond_i64(TCG_COND_LTU, d, rsh, max, rval, d);
+
+    tcg_temp_free_i64(lval);
+    tcg_temp_free_i64(rval);
+    tcg_temp_free_i64(lsh);
+    tcg_temp_free_i64(rsh);
+    tcg_temp_free_i64(zero);
+    tcg_temp_free_i64(max);
+}
+
+static void gen_ushl_vec(unsigned vece, TCGv_vec d, TCGv_vec a, TCGv_vec b)
+{
+    TCGv_vec lval = tcg_temp_new_vec_matching(d);
+    TCGv_vec rval = tcg_temp_new_vec_matching(d);
+    TCGv_vec lsh = tcg_temp_new_vec_matching(d);
+    TCGv_vec rsh = tcg_temp_new_vec_matching(d);
+    TCGv_vec msk, max;
+
+    /*
+     * Since we don't have a sign-extend vector primitive, negate and mask.
+     * With the out-of-range check below, we'll select the correct answer.
+     */
+    tcg_gen_neg_vec(vece, rsh, b);
+    if (vece == MO_8) {
+        tcg_gen_mov_vec(lsh, b);
+    } else {
+        msk = tcg_temp_new_vec_matching(d);
+        tcg_gen_dupi_vec(vece, msk, 0xff);
+        tcg_gen_and_vec(vece, lsh, b, msk);
+        tcg_gen_and_vec(vece, rsh, rsh, msk);
+        tcg_temp_free_vec(msk);
+    }
+
+    /*
+     * Perform possibly out of range shifts, trusting that the operation
+     * does not trap.  Discard unused results after the fact.
+     */
+    tcg_gen_shlv_vec(vece, lval, a, lsh);
+    tcg_gen_shrv_vec(vece, rval, a, rsh);
+
+    max = tcg_temp_new_vec_matching(d);
+    tcg_gen_dupi_vec(vece, max, 8 << vece);
+    tcg_gen_cmp_vec(TCG_COND_LTU, vece, lsh, lsh, max);
+    tcg_gen_cmp_vec(TCG_COND_LTU, vece, rsh, rsh, max);
+    tcg_temp_free_vec(max);
+
+    tcg_gen_and_vec(vece, lval, lval, lsh);
+    tcg_gen_and_vec(vece, rval, rval, rsh);
+    tcg_gen_or_vec(vece, d, lval, rval);
+
+    tcg_temp_free_vec(lval);
+    tcg_temp_free_vec(rval);
+    tcg_temp_free_vec(lsh);
+    tcg_temp_free_vec(rsh);
+}
+
+static const TCGOpcode ushl_list[] = {
+    INDEX_op_neg_vec, INDEX_op_shlv_vec,
+    INDEX_op_shrv_vec, INDEX_op_cmp_vec, 0
+};
+
+const GVecGen3 ushl_op[4] = {
+    { .fniv = gen_ushl_vec,
+      .fno = gen_helper_gvec_ushl_b,
+      .opt_opc = ushl_list,
+      .vece = MO_8 },
+    { .fniv = gen_ushl_vec,
+      .fno = gen_helper_gvec_ushl_h,
+      .opt_opc = ushl_list,
+      .vece = MO_16 },
+    { .fni4 = gen_ushl_i32,
+      .fniv = gen_ushl_vec,
+      .fno = gen_helper_gvec_ushl_s,
+      .opt_opc = ushl_list,
+      .vece = MO_32 },
+    { .fni8 = gen_ushl_i64,
+      .fniv = gen_ushl_vec,
+      .fno = gen_helper_gvec_ushl_d,
+      .opt_opc = ushl_list,
+      .vece = MO_64 },
+};
+
+void gen_sshl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b)
+{
+    TCGv_i32 lval = tcg_temp_new_i32();
+    TCGv_i32 rval = tcg_temp_new_i32();
+    TCGv_i32 lsh = tcg_temp_new_i32();
+    TCGv_i32 rsh = tcg_temp_new_i32();
+    TCGv_i32 zero = tcg_const_i32(0);
+    TCGv_i32 max = tcg_const_i32(31);
+
+    /*
+     * Perform possibly out of range shifts, trusting that the operation
+     * does not trap.  Discard unused results after the fact.
+     */
+    tcg_gen_ext8s_i32(lsh, b);
+    tcg_gen_neg_i32(rsh, lsh);
+    tcg_gen_shl_i32(lval, a, lsh);
+    tcg_gen_umin_i32(rsh, rsh, max);
+    tcg_gen_sar_i32(rval, a, rsh);
+    tcg_gen_movcond_i32(TCG_COND_LEU, lval, lsh, max, lval, zero);
+    tcg_gen_movcond_i32(TCG_COND_LT, d, lsh, zero, rval, lval);
+
+    tcg_temp_free_i32(lval);
+    tcg_temp_free_i32(rval);
+    tcg_temp_free_i32(lsh);
+    tcg_temp_free_i32(rsh);
+    tcg_temp_free_i32(zero);
+    tcg_temp_free_i32(max);
+}
+
+void gen_sshl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 lval = tcg_temp_new_i64();
+    TCGv_i64 rval = tcg_temp_new_i64();
+    TCGv_i64 lsh = tcg_temp_new_i64();
+    TCGv_i64 rsh = tcg_temp_new_i64();
+    TCGv_i64 zero = tcg_const_i64(0);
+    TCGv_i64 max = tcg_const_i64(63);
+
+    /*
+     * Perform possibly out of range shifts, trusting that the operation
+     * does not trap.  Discard unused results after the fact.
+     */
+    tcg_gen_ext8s_i64(lsh, b);
+    tcg_gen_neg_i64(rsh, lsh);
+    tcg_gen_shl_i64(lval, a, lsh);
+    tcg_gen_umin_i64(rsh, rsh, max);
+    tcg_gen_sar_i64(rval, a, rsh);
+    tcg_gen_movcond_i64(TCG_COND_LEU, lval, lsh, max, lval, zero);
+    tcg_gen_movcond_i64(TCG_COND_LT, d, lsh, zero, rval, lval);
+
+    tcg_temp_free_i64(lval);
+    tcg_temp_free_i64(rval);
+    tcg_temp_free_i64(lsh);
+    tcg_temp_free_i64(rsh);
+    tcg_temp_free_i64(zero);
+    tcg_temp_free_i64(max);
+}
+
+static void gen_sshl_vec(unsigned vece, TCGv_vec d, TCGv_vec a, TCGv_vec b)
+{
+    TCGv_vec lval = tcg_temp_new_vec_matching(d);
+    TCGv_vec rval = tcg_temp_new_vec_matching(d);
+    TCGv_vec lsh = tcg_temp_new_vec_matching(d);
+    TCGv_vec rsh = tcg_temp_new_vec_matching(d);
+    TCGv_vec tmp = tcg_temp_new_vec_matching(d);
+
+    /*
+     * Since we don't have a sign-extend vector primitive, negate and mask.
+     * With the out-of-range check below, we'll select the correct answer.
+     */
+    tcg_gen_neg_vec(vece, rsh, b);
+    if (vece == MO_8) {
+        tcg_gen_mov_vec(lsh, b);
+    } else {
+        tcg_gen_dupi_vec(vece, tmp, 0xff);
+        tcg_gen_and_vec(vece, lsh, b, tmp);
+        tcg_gen_and_vec(vece, rsh, rsh, tmp);
+    }
+
+    /* Bound rsh so out of bound right shift gets -1.  */
+    tcg_gen_dupi_vec(vece, tmp, (8 << vece) - 1);
+    tcg_gen_umin_vec(vece, rsh, rsh, tmp);
+    tcg_gen_cmp_vec(TCG_COND_GT, vece, tmp, lsh, tmp);
+
+    tcg_gen_shlv_vec(vece, lval, a, lsh);
+    tcg_gen_sarv_vec(vece, rval, a, rsh);
+
+    /* Select in-bound left shift.  */
+    tcg_gen_andc_vec(vece, lval, lval, tmp);
+
+    /* Select between left and right shift.  */
+    if (vece == MO_8) {
+        tcg_gen_dupi_vec(vece, tmp, 0);
+        tcg_gen_cmpsel_vec(TCG_COND_LT, vece, d, lsh, tmp, rval, lval);
+    } else {
+        tcg_gen_dupi_vec(vece, tmp, 0x80);
+        tcg_gen_cmpsel_vec(TCG_COND_LT, vece, d, lsh, tmp, lval, rval);
+    }
+
+    tcg_temp_free_vec(lval);
+    tcg_temp_free_vec(rval);
+    tcg_temp_free_vec(lsh);
+    tcg_temp_free_vec(rsh);
+    tcg_temp_free_vec(tmp);
+}
+
+static const TCGOpcode sshl_list[] = {
+    INDEX_op_neg_vec, INDEX_op_umin_vec, INDEX_op_shlv_vec,
+    INDEX_op_sarv_vec, INDEX_op_cmp_vec, INDEX_op_cmpsel_vec, 0
+};
+
+const GVecGen3 sshl_op[4] = {
+    { .fniv = gen_sshl_vec,
+      .fno = gen_helper_gvec_sshl_b,
+      .opt_opc = sshl_list,
+      .vece = MO_8 },
+    { .fniv = gen_sshl_vec,
+      .fno = gen_helper_gvec_sshl_h,
+      .opt_opc = sshl_list,
+      .vece = MO_16 },
+    { .fni4 = gen_sshl_i32,
+      .fniv = gen_sshl_vec,
+      .fno = gen_helper_gvec_sshl_s,
+      .opt_opc = sshl_list,
+      .vece = MO_32 },
+    { .fni8 = gen_sshl_i64,
+      .fniv = gen_sshl_vec,
+      .fno = gen_helper_gvec_sshl_d,
+      .opt_opc = sshl_list,
+      .vece = MO_64 },
+};
+
 static void gen_uqadd_vec(unsigned vece, TCGv_vec t, TCGv_vec sat,
                           TCGv_vec a, TCGv_vec b)
 {
@@ -6662,6 +6922,11 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                                   vec_size, vec_size);
             }
             return 0;
+
+        case NEON_3R_VSHL:
+            tcg_gen_gvec_3(rd_ofs, rn_ofs, rm_ofs, vec_size, vec_size,
+                           u ? &ushl_op[size] : &sshl_op[size]);
+            return 0;
         }
 
         if (size == 3) {
@@ -6670,13 +6935,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                 neon_load_reg64(cpu_V0, rn + pass);
                 neon_load_reg64(cpu_V1, rm + pass);
                 switch (op) {
-                case NEON_3R_VSHL:
-                    if (u) {
-                        gen_helper_neon_shl_u64(cpu_V0, cpu_V1, cpu_V0);
-                    } else {
-                        gen_helper_neon_shl_s64(cpu_V0, cpu_V1, cpu_V0);
-                    }
-                    break;
                 case NEON_3R_VQSHL:
                     if (u) {
                         gen_helper_neon_qshl_u64(cpu_V0, cpu_env,
@@ -6711,7 +6969,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
         }
         pairwise = 0;
         switch (op) {
-        case NEON_3R_VSHL:
         case NEON_3R_VQSHL:
         case NEON_3R_VRSHL:
         case NEON_3R_VQRSHL:
@@ -6791,9 +7048,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
         case NEON_3R_VHSUB:
             GEN_NEON_INTEGER_OP(hsub);
             break;
-        case NEON_3R_VSHL:
-            GEN_NEON_INTEGER_OP(shl);
-            break;
         case NEON_3R_VQSHL:
             GEN_NEON_INTEGER_OP_ENV(qshl);
             break;
@@ -7202,9 +7456,9 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                             }
                         } else {
                             if (input_unsigned) {
-                                gen_helper_neon_shl_u64(cpu_V0, in, tmp64);
+                                gen_ushl_i64(cpu_V0, in, tmp64);
                             } else {
-                                gen_helper_neon_shl_s64(cpu_V0, in, tmp64);
+                                gen_sshl_i64(cpu_V0, in, tmp64);
                             }
                         }
                         tmp = tcg_temp_new_i32();
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index dedef62403..9f8eee5611 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -1046,3 +1046,179 @@ void HELPER(gvec_fmlal_idx_a64)(void *vd, void *vn, void *vm,
     do_fmlal_idx(vd, vn, vm, &env->vfp.fp_status, desc,
                  get_flush_inputs_to_zero(&env->vfp.fp_status_f16));
 }
+
+void HELPER(gvec_sshl_b)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    int8_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz; ++i) {
+        int8_t mm = m[i];
+        int8_t nn = n[i];
+        int8_t res = 0;
+        if (mm >= 0) {
+            if (mm < 8) {
+                res = nn << mm;
+            }
+        } else {
+            res = nn >> (mm > -8 ? -mm : 7);
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_sshl_h)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    int16_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 2; ++i) {
+        int8_t mm = m[i];   /* only 8 bits of shift are significant */
+        int16_t nn = n[i];
+        int16_t res = 0;
+        if (mm >= 0) {
+            if (mm < 16) {
+                res = nn << mm;
+            }
+        } else {
+            res = nn >> (mm > -16 ? -mm : 15);
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_sshl_s)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    int32_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 4; ++i) {
+        int8_t mm = m[i];   /* only 8 bits of shift are significant */
+        int32_t nn = n[i];
+        int32_t res = 0;
+        if (mm >= 0) {
+            if (mm < 32) {
+                res = nn << mm;
+            }
+        } else {
+            res = nn >> (mm > -32 ? -mm : 31);
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_sshl_d)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    int64_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 8; ++i) {
+        int8_t mm = m[i];   /* only 8 bits of shift are significant */
+        int64_t nn = n[i];
+        int64_t res = 0;
+        if (mm >= 0) {
+            if (mm < 64) {
+                res = nn << mm;
+            }
+        } else {
+            res = nn >> (mm > -64 ? -mm : 63);
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_ushl_b)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    uint8_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz; ++i) {
+        int8_t mm = m[i];
+        uint8_t nn = n[i];
+        uint8_t res = 0;
+        if (mm >= 0) {
+            if (mm < 8) {
+                res = nn << mm;
+            }
+        } else {
+            if (mm > -8) {
+                res = nn >> -mm;
+            }
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_ushl_h)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    uint16_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 2; ++i) {
+        int8_t mm = m[i];   /* only 8 bits of shift are significant */
+        uint16_t nn = n[i];
+        uint16_t res = 0;
+        if (mm >= 0) {
+            if (mm < 16) {
+                res = nn << mm;
+            }
+        } else {
+            if (mm > -16) {
+                res = nn >> -mm;
+            }
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_ushl_s)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    uint32_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 4; ++i) {
+        int8_t mm = m[i];   /* only 8 bits of shift are significant */
+        uint32_t nn = n[i];
+        uint32_t res = 0;
+        if (mm >= 0) {
+            if (mm < 32) {
+                res = nn << mm;
+            }
+        } else {
+            if (mm > -32) {
+                res = nn >> -mm;
+            }
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_ushl_d)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    uint64_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 8; ++i) {
+        int8_t mm = m[i];   /* only 8 bits of shift are significant */
+        uint64_t nn = n[i];
+        uint64_t res = 0;
+        if (mm >= 0) {
+            if (mm < 64) {
+                res = nn << mm;
+            }
+        } else {
+            if (mm > -64) {
+                res = nn >> -mm;
+            }
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel
  2019-05-18 19:19 [Qemu-devel] [PATCH 0/2] target/arm: make use of new gvec expanders Richard Henderson
  2019-05-18 19:19 ` [Qemu-devel] [PATCH 1/2] target/arm: Vectorize USHL and SSHL Richard Henderson
@ 2019-05-18 19:19 ` Richard Henderson
  2019-05-23 12:46   ` Peter Maydell
  1 sibling, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2019-05-18 19:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm

This replaces 3 target-specific implementations for BIT, BIF, and BSL.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/translate-a64.h |  2 +
 target/arm/translate.h     |  3 --
 target/arm/translate-a64.c | 15 ++++++--
 target/arm/translate.c     | 78 +++-----------------------------------
 4 files changed, 20 insertions(+), 78 deletions(-)

diff --git a/target/arm/translate-a64.h b/target/arm/translate-a64.h
index 63d958cf50..9569bc5963 100644
--- a/target/arm/translate-a64.h
+++ b/target/arm/translate-a64.h
@@ -122,5 +122,7 @@ typedef void GVecGen2iFn(unsigned, uint32_t, uint32_t, int64_t,
                          uint32_t, uint32_t);
 typedef void GVecGen3Fn(unsigned, uint32_t, uint32_t,
                         uint32_t, uint32_t, uint32_t);
+typedef void GVecGen4Fn(unsigned, uint32_t, uint32_t, uint32_t,
+                        uint32_t, uint32_t, uint32_t);
 
 #endif /* TARGET_ARM_TRANSLATE_A64_H */
diff --git a/target/arm/translate.h b/target/arm/translate.h
index f357b767cb..01ae454dcf 100644
--- a/target/arm/translate.h
+++ b/target/arm/translate.h
@@ -238,9 +238,6 @@ static inline void gen_ss_advance(DisasContext *s)
 }
 
 /* Vector operations shared between ARM and AArch64.  */
-extern const GVecGen3 bsl_op;
-extern const GVecGen3 bit_op;
-extern const GVecGen3 bif_op;
 extern const GVecGen3 mla_op[4];
 extern const GVecGen3 mls_op[4];
 extern const GVecGen3 cmtst_op[4];
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 2c280243a9..955ab63ff8 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -704,6 +704,15 @@ static void gen_gvec_fn3(DisasContext *s, bool is_q, int rd, int rn, int rm,
             vec_full_reg_offset(s, rm), is_q ? 16 : 8, vec_full_reg_size(s));
 }
 
+/* Expand a 4-operand AdvSIMD vector operation using an expander function.  */
+static void gen_gvec_fn4(DisasContext *s, bool is_q, int rd, int rn, int rm,
+                         int rx, GVecGen4Fn *gvec_fn, int vece)
+{
+    gvec_fn(vece, vec_full_reg_offset(s, rd), vec_full_reg_offset(s, rn),
+            vec_full_reg_offset(s, rm), vec_full_reg_offset(s, rx),
+            is_q ? 16 : 8, vec_full_reg_size(s));
+}
+
 /* Expand a 2-operand + immediate AdvSIMD vector operation using
  * an op descriptor.
  */
@@ -10916,13 +10925,13 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
         return;
 
     case 5: /* BSL bitwise select */
-        gen_gvec_op3(s, is_q, rd, rn, rm, &bsl_op);
+        gen_gvec_fn4(s, is_q, rd, rd, rn, rm, tcg_gen_gvec_bitsel, 0);
         return;
     case 6: /* BIT, bitwise insert if true */
-        gen_gvec_op3(s, is_q, rd, rn, rm, &bit_op);
+        gen_gvec_fn4(s, is_q, rd, rm, rn, rd, tcg_gen_gvec_bitsel, 0);
         return;
     case 7: /* BIF, bitwise insert if false */
-        gen_gvec_op3(s, is_q, rd, rn, rm, &bif_op);
+        gen_gvec_fn4(s, is_q, rd, rm, rd, rn, tcg_gen_gvec_bitsel, 0);
         return;
 
     default:
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 49dfcdc90d..3abcae3a50 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -5755,72 +5755,6 @@ static int do_v81_helper(DisasContext *s, gen_helper_gvec_3_ptr *fn,
     return 1;
 }
 
-/*
- * Expanders for VBitOps_VBIF, VBIT, VBSL.
- */
-static void gen_bsl_i64(TCGv_i64 rd, TCGv_i64 rn, TCGv_i64 rm)
-{
-    tcg_gen_xor_i64(rn, rn, rm);
-    tcg_gen_and_i64(rn, rn, rd);
-    tcg_gen_xor_i64(rd, rm, rn);
-}
-
-static void gen_bit_i64(TCGv_i64 rd, TCGv_i64 rn, TCGv_i64 rm)
-{
-    tcg_gen_xor_i64(rn, rn, rd);
-    tcg_gen_and_i64(rn, rn, rm);
-    tcg_gen_xor_i64(rd, rd, rn);
-}
-
-static void gen_bif_i64(TCGv_i64 rd, TCGv_i64 rn, TCGv_i64 rm)
-{
-    tcg_gen_xor_i64(rn, rn, rd);
-    tcg_gen_andc_i64(rn, rn, rm);
-    tcg_gen_xor_i64(rd, rd, rn);
-}
-
-static void gen_bsl_vec(unsigned vece, TCGv_vec rd, TCGv_vec rn, TCGv_vec rm)
-{
-    tcg_gen_xor_vec(vece, rn, rn, rm);
-    tcg_gen_and_vec(vece, rn, rn, rd);
-    tcg_gen_xor_vec(vece, rd, rm, rn);
-}
-
-static void gen_bit_vec(unsigned vece, TCGv_vec rd, TCGv_vec rn, TCGv_vec rm)
-{
-    tcg_gen_xor_vec(vece, rn, rn, rd);
-    tcg_gen_and_vec(vece, rn, rn, rm);
-    tcg_gen_xor_vec(vece, rd, rd, rn);
-}
-
-static void gen_bif_vec(unsigned vece, TCGv_vec rd, TCGv_vec rn, TCGv_vec rm)
-{
-    tcg_gen_xor_vec(vece, rn, rn, rd);
-    tcg_gen_andc_vec(vece, rn, rn, rm);
-    tcg_gen_xor_vec(vece, rd, rd, rn);
-}
-
-const GVecGen3 bsl_op = {
-    .fni8 = gen_bsl_i64,
-    .fniv = gen_bsl_vec,
-    .prefer_i64 = TCG_TARGET_REG_BITS == 64,
-    .load_dest = true
-};
-
-const GVecGen3 bit_op = {
-    .fni8 = gen_bit_i64,
-    .fniv = gen_bit_vec,
-    .prefer_i64 = TCG_TARGET_REG_BITS == 64,
-    .load_dest = true
-};
-
-const GVecGen3 bif_op = {
-    .fni8 = gen_bif_i64,
-    .fniv = gen_bif_vec,
-    .prefer_i64 = TCG_TARGET_REG_BITS == 64,
-    .load_dest = true
-};
-
 static void gen_ssra8_i64(TCGv_i64 d, TCGv_i64 a, int64_t shift)
 {
     tcg_gen_vec_sar8i_i64(a, a, shift);
@@ -6830,16 +6764,16 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                                  vec_size, vec_size);
                 break;
             case 5: /* VBSL */
-                tcg_gen_gvec_3(rd_ofs, rn_ofs, rm_ofs,
-                               vec_size, vec_size, &bsl_op);
+                tcg_gen_gvec_bitsel(MO_8, rd_ofs, rd_ofs, rn_ofs, rm_ofs,
+                                    vec_size, vec_size);
                 break;
             case 6: /* VBIT */
-                tcg_gen_gvec_3(rd_ofs, rn_ofs, rm_ofs,
-                               vec_size, vec_size, &bit_op);
+                tcg_gen_gvec_bitsel(MO_8, rd_ofs, rm_ofs, rn_ofs, rd_ofs,
+                                    vec_size, vec_size);
                 break;
             case 7: /* VBIF */
-                tcg_gen_gvec_3(rd_ofs, rn_ofs, rm_ofs,
-                               vec_size, vec_size, &bif_op);
+                tcg_gen_gvec_bitsel(MO_8, rd_ofs, rm_ofs, rd_ofs, rn_ofs,
+                                    vec_size, vec_size);
                 break;
             }
             return 0;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] target/arm: Vectorize USHL and SSHL
  2019-05-18 19:19 ` [Qemu-devel] [PATCH 1/2] target/arm: Vectorize USHL and SSHL Richard Henderson
@ 2019-05-23 12:44   ` Peter Maydell
  2019-05-23 13:03     ` Peter Maydell
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2019-05-23 12:44 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Sat, 18 May 2019 at 20:19, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> These instructions shift left or right depending on the sign
> of the input, and 7 bits are significant to the shift.  This
> requires several masks and selects in addition to the actual
> shifts to form the complete answer.
>
> That said, the operation is still a small improvement even for
> two 64-bit elements -- 13 vector operations instead of 2 * 7
> integer operations.
>
> +void gen_ushl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b)
> +{
> +    TCGv_i32 lval = tcg_temp_new_i32();
> +    TCGv_i32 rval = tcg_temp_new_i32();
> +    TCGv_i32 lsh = tcg_temp_new_i32();
> +    TCGv_i32 rsh = tcg_temp_new_i32();
> +    TCGv_i32 zero = tcg_const_i32(0);
> +    TCGv_i32 max = tcg_const_i32(32);
> +
> +    /*
> +     * Perform possibly out of range shifts, trusting that the operation
> +     * does not trap.  Discard unused results after the fact.
> +     */

This comment reads to me like we're relying on a guarantee
that TCG doesn't make, but in fact the readme says it does:
out-of-range shifts are "unspecified behavior" which may give
bogus results but won't crash. Perhaps phrasing the comment
as "relying on the TCG guarantee that these are only
'unspecified behavior' and not 'undefined behavior' and so
won't crash" would be clearer ?

> +    tcg_gen_ext8s_i32(lsh, b);
> +    tcg_gen_neg_i32(rsh, lsh);
> +    tcg_gen_shl_i32(lval, a, lsh);
> +    tcg_gen_shr_i32(rval, a, rsh);

thanks
-- PMM


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel
  2019-05-18 19:19 ` [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel Richard Henderson
@ 2019-05-23 12:46   ` Peter Maydell
  2019-05-23 13:02     ` Richard Henderson
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2019-05-23 12:46 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Sat, 18 May 2019 at 20:19, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> This replaces 3 target-specific implementations for BIT, BIF, and BSL.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> @@ -10916,13 +10925,13 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
>          return;
>
>      case 5: /* BSL bitwise select */
> -        gen_gvec_op3(s, is_q, rd, rn, rm, &bsl_op);
> +        gen_gvec_fn4(s, is_q, rd, rd, rn, rm, tcg_gen_gvec_bitsel, 0);
>          return;
>      case 6: /* BIT, bitwise insert if true */
> -        gen_gvec_op3(s, is_q, rd, rn, rm, &bit_op);
> +        gen_gvec_fn4(s, is_q, rd, rm, rn, rd, tcg_gen_gvec_bitsel, 0);
>          return;
>      case 7: /* BIF, bitwise insert if false */
> -        gen_gvec_op3(s, is_q, rd, rn, rm, &bif_op);
> +        gen_gvec_fn4(s, is_q, rd, rm, rd, rn, tcg_gen_gvec_bitsel, 0);
>          return;

We were previously doing different operations for these three
different instructions. Now we seem to always be doing the same
thing but with randomly reshuffled register arguments. How
does this work ?


thanks
-- PMM


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel
  2019-05-23 12:46   ` Peter Maydell
@ 2019-05-23 13:02     ` Richard Henderson
  2019-05-23 13:08       ` Peter Maydell
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2019-05-23 13:02 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 5/23/19 8:46 AM, Peter Maydell wrote:
> On Sat, 18 May 2019 at 20:19, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> This replaces 3 target-specific implementations for BIT, BIF, and BSL.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> @@ -10916,13 +10925,13 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
>>          return;
>>
>>      case 5: /* BSL bitwise select */
>> -        gen_gvec_op3(s, is_q, rd, rn, rm, &bsl_op);
>> +        gen_gvec_fn4(s, is_q, rd, rd, rn, rm, tcg_gen_gvec_bitsel, 0);
>>          return;
>>      case 6: /* BIT, bitwise insert if true */
>> -        gen_gvec_op3(s, is_q, rd, rn, rm, &bit_op);
>> +        gen_gvec_fn4(s, is_q, rd, rm, rn, rd, tcg_gen_gvec_bitsel, 0);
>>          return;
>>      case 7: /* BIF, bitwise insert if false */
>> -        gen_gvec_op3(s, is_q, rd, rn, rm, &bif_op);
>> +        gen_gvec_fn4(s, is_q, rd, rm, rd, rn, tcg_gen_gvec_bitsel, 0);
>>          return;
> 
> We were previously doing different operations for these three
> different instructions. Now we seem to always be doing the same
> thing but with randomly reshuffled register arguments. How
> does this work ?

Because the three different instructions perform the same operation with
reshuffled register arguments.

I'm not sure why we were using different operations in the beginning.  Possibly
because those formulations do not require a temporary?  Possibly that's how
they're written in the ARM ARM pseudocode?


r~


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] target/arm: Vectorize USHL and SSHL
  2019-05-23 12:44   ` Peter Maydell
@ 2019-05-23 13:03     ` Peter Maydell
  2019-06-03 17:23       ` Richard Henderson
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2019-05-23 13:03 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Thu, 23 May 2019 at 13:44, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Sat, 18 May 2019 at 20:19, Richard Henderson
> <richard.henderson@linaro.org> wrote:
> >
> > These instructions shift left or right depending on the sign
> > of the input, and 7 bits are significant to the shift.  This
> > requires several masks and selects in addition to the actual
> > shifts to form the complete answer.
> >
> > That said, the operation is still a small improvement even for
> > two 64-bit elements -- 13 vector operations instead of 2 * 7
> > integer operations.
> >
> > +void gen_ushl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b)
> > +{
> > +    TCGv_i32 lval = tcg_temp_new_i32();
> > +    TCGv_i32 rval = tcg_temp_new_i32();
> > +    TCGv_i32 lsh = tcg_temp_new_i32();
> > +    TCGv_i32 rsh = tcg_temp_new_i32();
> > +    TCGv_i32 zero = tcg_const_i32(0);
> > +    TCGv_i32 max = tcg_const_i32(32);
> > +
> > +    /*
> > +     * Perform possibly out of range shifts, trusting that the operation
> > +     * does not trap.  Discard unused results after the fact.
> > +     */
>
> This comment reads to me like we're relying on a guarantee
> that TCG doesn't make, but in fact the readme says it does:
> out-of-range shifts are "unspecified behavior" which may give
> bogus results but won't crash. Perhaps phrasing the comment
> as "relying on the TCG guarantee that these are only
> 'unspecified behavior' and not 'undefined behavior' and so
> won't crash" would be clearer ?

I had a look through the rest of the patch, but there is
too much code here and I don't have enough context to
figure out how all the various new gvec helpers are
called and what jobs they are doing compared to the
actual instruction operation. Maybe I'll have another try later.

Why can we get rid of the 8-bit, 32-bit and 64-bit old shift
helpers, but not 16-bit ?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel
  2019-05-23 13:02     ` Richard Henderson
@ 2019-05-23 13:08       ` Peter Maydell
  2019-05-23 13:16         ` Richard Henderson
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2019-05-23 13:08 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Thu, 23 May 2019 at 14:02, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 5/23/19 8:46 AM, Peter Maydell wrote:
> > On Sat, 18 May 2019 at 20:19, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >>
> >> This replaces 3 target-specific implementations for BIT, BIF, and BSL.
> >>
> >> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> >> @@ -10916,13 +10925,13 @@ static void disas_simd_3same_logic(DisasContext *s, uint32_t insn)
> >>          return;
> >>
> >>      case 5: /* BSL bitwise select */
> >> -        gen_gvec_op3(s, is_q, rd, rn, rm, &bsl_op);
> >> +        gen_gvec_fn4(s, is_q, rd, rd, rn, rm, tcg_gen_gvec_bitsel, 0);
> >>          return;
> >>      case 6: /* BIT, bitwise insert if true */
> >> -        gen_gvec_op3(s, is_q, rd, rn, rm, &bit_op);
> >> +        gen_gvec_fn4(s, is_q, rd, rm, rn, rd, tcg_gen_gvec_bitsel, 0);
> >>          return;
> >>      case 7: /* BIF, bitwise insert if false */
> >> -        gen_gvec_op3(s, is_q, rd, rn, rm, &bif_op);
> >> +        gen_gvec_fn4(s, is_q, rd, rm, rd, rn, tcg_gen_gvec_bitsel, 0);
> >>          return;
> >
> > We were previously doing different operations for these three
> > different instructions. Now we seem to always be doing the same
> > thing but with randomly reshuffled register arguments. How
> > does this work ?
>
> Because the three different instructions perform the same operation with
> reshuffled register arguments.

Ah, so they do. Next question, how do I find out what the
order of arguments in the above code means so I can compare
it against the pseudocode expression we're implementing?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel
  2019-05-23 13:08       ` Peter Maydell
@ 2019-05-23 13:16         ` Richard Henderson
  2019-05-23 13:30           ` Peter Maydell
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2019-05-23 13:16 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 5/23/19 9:08 AM, Peter Maydell wrote:
>> Because the three different instructions perform the same operation with
>> reshuffled register arguments.
> 
> Ah, so they do. Next question, how do I find out what the
> order of arguments in the above code means so I can compare
> it against the pseudocode expression we're implementing?

>From tcg/README:

* bitsel_vec v0, v1, v2, v3

  Bitwise select, v0 = (v2 & v1) | (v3 & ~v1), across the entire vector.

The "selector" is second, the first input operand.


r~


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel
  2019-05-23 13:16         ` Richard Henderson
@ 2019-05-23 13:30           ` Peter Maydell
  2019-06-03 18:15             ` Richard Henderson
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2019-05-23 13:30 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Thu, 23 May 2019 at 14:16, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 5/23/19 9:08 AM, Peter Maydell wrote:
> >> Because the three different instructions perform the same operation with
> >> reshuffled register arguments.
> >
> > Ah, so they do. Next question, how do I find out what the
> > order of arguments in the above code means so I can compare
> > it against the pseudocode expression we're implementing?
>
> >From tcg/README:
>
> * bitsel_vec v0, v1, v2, v3
>
>   Bitwise select, v0 = (v2 & v1) | (v3 & ~v1), across the entire vector.
>
> The "selector" is second, the first input operand.

Oh, this series is based on another patchset.

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] target/arm: Vectorize USHL and SSHL
  2019-05-23 13:03     ` Peter Maydell
@ 2019-06-03 17:23       ` Richard Henderson
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Henderson @ 2019-06-03 17:23 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 5/23/19 8:03 AM, Peter Maydell wrote:
> On Thu, 23 May 2019 at 13:44, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> On Sat, 18 May 2019 at 20:19, Richard Henderson
>> <richard.henderson@linaro.org> wrote:
>>>
>>> These instructions shift left or right depending on the sign
>>> of the input, and 7 bits are significant to the shift.  This
>>> requires several masks and selects in addition to the actual
>>> shifts to form the complete answer.
>>>
>>> That said, the operation is still a small improvement even for
>>> two 64-bit elements -- 13 vector operations instead of 2 * 7
>>> integer operations.
>>>
>>> +void gen_ushl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b)
>>> +{
>>> +    TCGv_i32 lval = tcg_temp_new_i32();
>>> +    TCGv_i32 rval = tcg_temp_new_i32();
>>> +    TCGv_i32 lsh = tcg_temp_new_i32();
>>> +    TCGv_i32 rsh = tcg_temp_new_i32();
>>> +    TCGv_i32 zero = tcg_const_i32(0);
>>> +    TCGv_i32 max = tcg_const_i32(32);
>>> +
>>> +    /*
>>> +     * Perform possibly out of range shifts, trusting that the operation
>>> +     * does not trap.  Discard unused results after the fact.
>>> +     */
>>
>> This comment reads to me like we're relying on a guarantee
>> that TCG doesn't make, but in fact the readme says it does:
>> out-of-range shifts are "unspecified behavior" which may give
>> bogus results but won't crash. Perhaps phrasing the comment
>> as "relying on the TCG guarantee that these are only
>> 'unspecified behavior' and not 'undefined behavior' and so
>> won't crash" would be clearer ?

I've adjusted the comment along these lines.

> I had a look through the rest of the patch, but there is
> too much code here and I don't have enough context to
> figure out how all the various new gvec helpers are
> called and what jobs they are doing compared to the
> actual instruction operation. Maybe I'll have another try later.
If the host supports vectors, then the .fniv expander will be called.
Otherwise, .fni4 or .fni8 will be used to expand with 32-bit or 64-bit
integers.  Finally, the .fno expander calls an out-of-line helper.

Not strictly kosher perhaps, but in some places we Know that MAX_UNROLL is set
to 4 in tcg/tcg-op-gvec.c, and that since AdvSIMD uses 128-bit vectors, 4 *
32-bit will always be expanded inline, and so omit the out-of-line helper.

I'm not sure why I didn't do this here, since this is not one of the cases in
which we could share helpers with SVE.  (SVE dropped the right-shift as
negative shift count thing.)

> Why can we get rid of the 8-bit, 32-bit and 64-bit old shift
> helpers, but not 16-bit ?

It's still used by gen_neon_shift_narrow.


r~


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel
  2019-05-23 13:30           ` Peter Maydell
@ 2019-06-03 18:15             ` Richard Henderson
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Henderson @ 2019-06-03 18:15 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 5/23/19 8:30 AM, Peter Maydell wrote:
> On Thu, 23 May 2019 at 14:16, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 5/23/19 9:08 AM, Peter Maydell wrote:
>>>> Because the three different instructions perform the same operation with
>>>> reshuffled register arguments.
>>>
>>> Ah, so they do. Next question, how do I find out what the
>>> order of arguments in the above code means so I can compare
>>> it against the pseudocode expression we're implementing?
>>
>> >From tcg/README:
>>
>> * bitsel_vec v0, v1, v2, v3
>>
>>   Bitwise select, v0 = (v2 & v1) | (v3 & ~v1), across the entire vector.
>>
>> The "selector" is second, the first input operand.
> 
> Oh, this series is based on another patchset.
> 
> Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

The prerequisite for this patch is now in master.


r~



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-06-03 18:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-18 19:19 [Qemu-devel] [PATCH 0/2] target/arm: make use of new gvec expanders Richard Henderson
2019-05-18 19:19 ` [Qemu-devel] [PATCH 1/2] target/arm: Vectorize USHL and SSHL Richard Henderson
2019-05-23 12:44   ` Peter Maydell
2019-05-23 13:03     ` Peter Maydell
2019-06-03 17:23       ` Richard Henderson
2019-05-18 19:19 ` [Qemu-devel] [PATCH 2/2] target/arm: Use tcg_gen_gvec_bitsel Richard Henderson
2019-05-23 12:46   ` Peter Maydell
2019-05-23 13:02     ` Richard Henderson
2019-05-23 13:08       ` Peter Maydell
2019-05-23 13:16         ` Richard Henderson
2019-05-23 13:30           ` Peter Maydell
2019-06-03 18:15             ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.