qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] target/arm vector improvements
@ 2019-10-17  4:42 Richard Henderson
  2019-10-17  4:42 ` [PATCH 1/4] target/arm: Vectorize USHL and SSHL Richard Henderson
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Richard Henderson @ 2019-10-17  4:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm

The first patch has been seen before.

  https://patchwork.ozlabs.org/patch/1115039/

It had a bug and I didn't fix it right away and then forgot.
Fixed now; I had mixed up the operand ordering for aarch32.

The next 3 are something that I noticed while doing other stuff.

In particular, pmull is used heavily during https transfers.
While cloning a repository, the old code peaks at 27% of the
total runtime, as measured by perf top.  The new code does
not quite reach 3% repeating the same clone.

In addition, the new helper functions are in the form that
will be required for the implementation of SVE2.

The comment in patch 2 about ARMv8.4-DIT is perhaps a stretch,
but re-reading the pmull instruction description in the current
ARM ARM brought it to mind.

Since TCG is officially not in the security domain, it's
probably not a bug to just claim to support DIT without
actually doing anything to ensure the algorithms used are in
fact timing independent of the data.

On the other hand, I expect the bit distribution of stuff
going through these sort of hashing algorithms to approach
50% 1's and 0's, so I also don't think we gain anything on
average to terminate the loop early.

Thoughts on DIT specifically?


r~


Richard Henderson (4):
  target/arm: Vectorize USHL and SSHL
  target/arm: Convert PMUL.8 to gvec
  target/arm: Convert PMULL.64 to gvec
  target/arm: Convert PMULL.8 to gvec

 target/arm/helper-sve.h    |   2 +
 target/arm/helper.h        |  21 ++-
 target/arm/translate.h     |   6 +
 target/arm/neon_helper.c   | 117 -------------
 target/arm/translate-a64.c |  83 ++++-----
 target/arm/translate.c     | 350 ++++++++++++++++++++++++++++++++-----
 target/arm/vec_helper.c    | 211 ++++++++++++++++++++++
 7 files changed, 562 insertions(+), 228 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] target/arm: Vectorize USHL and SSHL
  2019-10-17  4:42 [PATCH 0/4] target/arm vector improvements Richard Henderson
@ 2019-10-17  4:42 ` Richard Henderson
  2019-10-17 16:01   ` Alex Bennée
  2019-10-17  4:42 ` [PATCH 2/4] target/arm: Convert PMUL.8 to gvec Richard Henderson
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2019-10-17  4:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm

These instructions shift left or right depending on the sign
of the input, and 7 bits are significant to the shift.  This
requires several masks and selects in addition to the actual
shifts to form the complete answer.

That said, the operation is still a small improvement even for
two 64-bit elements -- 13 vector operations instead of 2 * 7
integer operations.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
v2: Fix operand ordering for aa32 VSHL.
---
 target/arm/helper.h        |  11 +-
 target/arm/translate.h     |   6 +
 target/arm/neon_helper.c   |  33 ----
 target/arm/translate-a64.c |  18 +--
 target/arm/translate.c     | 301 +++++++++++++++++++++++++++++++++++--
 target/arm/vec_helper.c    |  88 +++++++++++
 6 files changed, 391 insertions(+), 66 deletions(-)

diff --git a/target/arm/helper.h b/target/arm/helper.h
index 1fb2cb5a77..fc0d594a14 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -296,14 +296,8 @@ DEF_HELPER_2(neon_abd_s16, i32, i32, i32)
 DEF_HELPER_2(neon_abd_u32, i32, i32, i32)
 DEF_HELPER_2(neon_abd_s32, i32, i32, i32)
 
-DEF_HELPER_2(neon_shl_u8, i32, i32, i32)
-DEF_HELPER_2(neon_shl_s8, i32, i32, i32)
 DEF_HELPER_2(neon_shl_u16, i32, i32, i32)
 DEF_HELPER_2(neon_shl_s16, i32, i32, i32)
-DEF_HELPER_2(neon_shl_u32, i32, i32, i32)
-DEF_HELPER_2(neon_shl_s32, i32, i32, i32)
-DEF_HELPER_2(neon_shl_u64, i64, i64, i64)
-DEF_HELPER_2(neon_shl_s64, i64, i64, i64)
 DEF_HELPER_2(neon_rshl_u8, i32, i32, i32)
 DEF_HELPER_2(neon_rshl_s8, i32, i32, i32)
 DEF_HELPER_2(neon_rshl_u16, i32, i32, i32)
@@ -690,6 +684,11 @@ DEF_HELPER_FLAGS_2(frint64_s, TCG_CALL_NO_RWG, f32, f32, ptr)
 DEF_HELPER_FLAGS_2(frint32_d, TCG_CALL_NO_RWG, f64, f64, ptr)
 DEF_HELPER_FLAGS_2(frint64_d, TCG_CALL_NO_RWG, f64, f64, ptr)
 
+DEF_HELPER_FLAGS_4(gvec_sshl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_sshl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_ushl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #include "helper-sve.h"
diff --git a/target/arm/translate.h b/target/arm/translate.h
index dd24f91f26..0c4e6e4bbd 100644
--- a/target/arm/translate.h
+++ b/target/arm/translate.h
@@ -274,6 +274,8 @@ uint64_t vfp_expand_imm(int size, uint8_t imm8);
 extern const GVecGen3 mla_op[4];
 extern const GVecGen3 mls_op[4];
 extern const GVecGen3 cmtst_op[4];
+extern const GVecGen3 sshl_op[4];
+extern const GVecGen3 ushl_op[4];
 extern const GVecGen2i ssra_op[4];
 extern const GVecGen2i usra_op[4];
 extern const GVecGen2i sri_op[4];
@@ -283,6 +285,10 @@ extern const GVecGen4 sqadd_op[4];
 extern const GVecGen4 uqsub_op[4];
 extern const GVecGen4 sqsub_op[4];
 void gen_cmtst_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b);
+void gen_ushl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b);
+void gen_sshl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b);
+void gen_ushl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b);
+void gen_sshl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b);
 
 /*
  * Forward to the isar_feature_* tests given a DisasContext pointer.
diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
index 4259056723..c581ffb7d3 100644
--- a/target/arm/neon_helper.c
+++ b/target/arm/neon_helper.c
@@ -615,24 +615,9 @@ NEON_VOP(abd_u32, neon_u32, 1)
     } else { \
         dest = src1 << tmp; \
     }} while (0)
-NEON_VOP(shl_u8, neon_u8, 4)
 NEON_VOP(shl_u16, neon_u16, 2)
-NEON_VOP(shl_u32, neon_u32, 1)
 #undef NEON_FN
 
-uint64_t HELPER(neon_shl_u64)(uint64_t val, uint64_t shiftop)
-{
-    int8_t shift = (int8_t)shiftop;
-    if (shift >= 64 || shift <= -64) {
-        val = 0;
-    } else if (shift < 0) {
-        val >>= -shift;
-    } else {
-        val <<= shift;
-    }
-    return val;
-}
-
 #define NEON_FN(dest, src1, src2) do { \
     int8_t tmp; \
     tmp = (int8_t)src2; \
@@ -645,27 +630,9 @@ uint64_t HELPER(neon_shl_u64)(uint64_t val, uint64_t shiftop)
     } else { \
         dest = src1 << tmp; \
     }} while (0)
-NEON_VOP(shl_s8, neon_s8, 4)
 NEON_VOP(shl_s16, neon_s16, 2)
-NEON_VOP(shl_s32, neon_s32, 1)
 #undef NEON_FN
 
-uint64_t HELPER(neon_shl_s64)(uint64_t valop, uint64_t shiftop)
-{
-    int8_t shift = (int8_t)shiftop;
-    int64_t val = valop;
-    if (shift >= 64) {
-        val = 0;
-    } else if (shift <= -64) {
-        val >>= 63;
-    } else if (shift < 0) {
-        val >>= -shift;
-    } else {
-        val <<= shift;
-    }
-    return val;
-}
-
 #define NEON_FN(dest, src1, src2) do { \
     int8_t tmp; \
     tmp = (int8_t)src2; \
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 2d6cd09634..255a168df6 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -8685,9 +8685,9 @@ static void handle_3same_64(DisasContext *s, int opcode, bool u,
         break;
     case 0x8: /* SSHL, USHL */
         if (u) {
-            gen_helper_neon_shl_u64(tcg_rd, tcg_rn, tcg_rm);
+            gen_ushl_i64(tcg_rd, tcg_rn, tcg_rm);
         } else {
-            gen_helper_neon_shl_s64(tcg_rd, tcg_rn, tcg_rm);
+            gen_sshl_i64(tcg_rd, tcg_rn, tcg_rm);
         }
         break;
     case 0x9: /* SQSHL, UQSHL */
@@ -11082,6 +11082,10 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
                        is_q ? 16 : 8, vec_full_reg_size(s),
                        (u ? uqsub_op : sqsub_op) + size);
         return;
+    case 0x08: /* SSHL, USHL */
+        gen_gvec_op3(s, is_q, rd, rn, rm,
+                     u ? &ushl_op[size] : &sshl_op[size]);
+        return;
     case 0x0c: /* SMAX, UMAX */
         if (u) {
             gen_gvec_fn3(s, is_q, rd, rn, rm, tcg_gen_gvec_umax, size);
@@ -11197,16 +11201,6 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
                 genfn = fns[size][u];
                 break;
             }
-            case 0x8: /* SSHL, USHL */
-            {
-                static NeonGenTwoOpFn * const fns[3][2] = {
-                    { gen_helper_neon_shl_s8, gen_helper_neon_shl_u8 },
-                    { gen_helper_neon_shl_s16, gen_helper_neon_shl_u16 },
-                    { gen_helper_neon_shl_s32, gen_helper_neon_shl_u32 },
-                };
-                genfn = fns[size][u];
-                break;
-            }
             case 0x9: /* SQSHL, UQSHL */
             {
                 static NeonGenTwoOpEnvFn * const fns[3][2] = {
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 698c594e8c..598bb1cc00 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -3580,13 +3580,13 @@ static inline void gen_neon_shift_narrow(int size, TCGv_i32 var, TCGv_i32 shift,
         if (u) {
             switch (size) {
             case 1: gen_helper_neon_shl_u16(var, var, shift); break;
-            case 2: gen_helper_neon_shl_u32(var, var, shift); break;
+            case 2: gen_ushl_i32(var, var, shift); break;
             default: abort();
             }
         } else {
             switch (size) {
             case 1: gen_helper_neon_shl_s16(var, var, shift); break;
-            case 2: gen_helper_neon_shl_s32(var, var, shift); break;
+            case 2: gen_sshl_i32(var, var, shift); break;
             default: abort();
             }
         }
@@ -4389,6 +4389,282 @@ const GVecGen3 cmtst_op[4] = {
       .vece = MO_64 },
 };
 
+void gen_ushl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b)
+{
+    TCGv_i32 lval = tcg_temp_new_i32();
+    TCGv_i32 rval = tcg_temp_new_i32();
+    TCGv_i32 lsh = tcg_temp_new_i32();
+    TCGv_i32 rsh = tcg_temp_new_i32();
+    TCGv_i32 zero = tcg_const_i32(0);
+    TCGv_i32 max = tcg_const_i32(32);
+
+    /*
+     * Rely on the TCG guarantee that out of range shifts produce
+     * unspecified results, not undefined behaviour (i.e. no trap).
+     * Discard out-of-range results after the fact.
+     */
+    tcg_gen_ext8s_i32(lsh, b);
+    tcg_gen_neg_i32(rsh, lsh);
+    tcg_gen_shl_i32(lval, a, lsh);
+    tcg_gen_shr_i32(rval, a, rsh);
+    tcg_gen_movcond_i32(TCG_COND_LTU, d, lsh, max, lval, zero);
+    tcg_gen_movcond_i32(TCG_COND_LTU, d, rsh, max, rval, d);
+
+    tcg_temp_free_i32(lval);
+    tcg_temp_free_i32(rval);
+    tcg_temp_free_i32(lsh);
+    tcg_temp_free_i32(rsh);
+    tcg_temp_free_i32(zero);
+    tcg_temp_free_i32(max);
+}
+
+void gen_ushl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 lval = tcg_temp_new_i64();
+    TCGv_i64 rval = tcg_temp_new_i64();
+    TCGv_i64 lsh = tcg_temp_new_i64();
+    TCGv_i64 rsh = tcg_temp_new_i64();
+    TCGv_i64 zero = tcg_const_i64(0);
+    TCGv_i64 max = tcg_const_i64(64);
+
+    /*
+     * Rely on the TCG guarantee that out of range shifts produce
+     * unspecified results, not undefined behaviour (i.e. no trap).
+     * Discard out-of-range results after the fact.
+     */
+    tcg_gen_ext8s_i64(lsh, b);
+    tcg_gen_neg_i64(rsh, lsh);
+    tcg_gen_shl_i64(lval, a, lsh);
+    tcg_gen_shr_i64(rval, a, rsh);
+    tcg_gen_movcond_i64(TCG_COND_LTU, d, lsh, max, lval, zero);
+    tcg_gen_movcond_i64(TCG_COND_LTU, d, rsh, max, rval, d);
+
+    tcg_temp_free_i64(lval);
+    tcg_temp_free_i64(rval);
+    tcg_temp_free_i64(lsh);
+    tcg_temp_free_i64(rsh);
+    tcg_temp_free_i64(zero);
+    tcg_temp_free_i64(max);
+}
+
+static void gen_ushl_vec(unsigned vece, TCGv_vec d, TCGv_vec a, TCGv_vec b)
+{
+    TCGv_vec lval = tcg_temp_new_vec_matching(d);
+    TCGv_vec rval = tcg_temp_new_vec_matching(d);
+    TCGv_vec lsh = tcg_temp_new_vec_matching(d);
+    TCGv_vec rsh = tcg_temp_new_vec_matching(d);
+    TCGv_vec msk, max;
+
+    /*
+     * Rely on the TCG guarantee that out of range shifts produce
+     * unspecified results, not undefined behaviour (i.e. no trap).
+     * Discard out-of-range results after the fact.
+     */
+    tcg_gen_neg_vec(vece, rsh, b);
+    if (vece == MO_8) {
+        tcg_gen_mov_vec(lsh, b);
+    } else {
+        msk = tcg_temp_new_vec_matching(d);
+        tcg_gen_dupi_vec(vece, msk, 0xff);
+        tcg_gen_and_vec(vece, lsh, b, msk);
+        tcg_gen_and_vec(vece, rsh, rsh, msk);
+        tcg_temp_free_vec(msk);
+    }
+
+    /*
+     * Perform possibly out of range shifts, trusting that the operation
+     * does not trap.  Discard unused results after the fact.
+     */
+    tcg_gen_shlv_vec(vece, lval, a, lsh);
+    tcg_gen_shrv_vec(vece, rval, a, rsh);
+
+    max = tcg_temp_new_vec_matching(d);
+    tcg_gen_dupi_vec(vece, max, 8 << vece);
+
+    /*
+     * The choice of LT (signed) and GEU (unsigned) are biased toward
+     * the instructions of the x86_64 host.  For MO_8, the whole byte
+     * is significant so we must use an unsigned compare; otherwise we
+     * have already masked to a byte and so a signed compare works.
+     * Other tcg hosts have a full set of comparisons and do not care.
+     */
+    if (vece == MO_8) {
+        tcg_gen_cmp_vec(TCG_COND_GEU, vece, lsh, lsh, max);
+        tcg_gen_cmp_vec(TCG_COND_GEU, vece, rsh, rsh, max);
+        tcg_gen_andc_vec(vece, lval, lval, lsh);
+        tcg_gen_andc_vec(vece, rval, rval, rsh);
+    } else {
+        tcg_gen_cmp_vec(TCG_COND_LT, vece, lsh, lsh, max);
+        tcg_gen_cmp_vec(TCG_COND_LT, vece, rsh, rsh, max);
+        tcg_gen_and_vec(vece, lval, lval, lsh);
+        tcg_gen_and_vec(vece, rval, rval, rsh);
+    }
+    tcg_gen_or_vec(vece, d, lval, rval);
+
+    tcg_temp_free_vec(max);
+    tcg_temp_free_vec(lval);
+    tcg_temp_free_vec(rval);
+    tcg_temp_free_vec(lsh);
+    tcg_temp_free_vec(rsh);
+}
+
+static const TCGOpcode ushl_list[] = {
+    INDEX_op_neg_vec, INDEX_op_shlv_vec,
+    INDEX_op_shrv_vec, INDEX_op_cmp_vec, 0
+};
+
+const GVecGen3 ushl_op[4] = {
+    { .fniv = gen_ushl_vec,
+      .fno = gen_helper_gvec_ushl_b,
+      .opt_opc = ushl_list,
+      .vece = MO_8 },
+    { .fniv = gen_ushl_vec,
+      .fno = gen_helper_gvec_ushl_h,
+      .opt_opc = ushl_list,
+      .vece = MO_16 },
+    { .fni4 = gen_ushl_i32,
+      .fniv = gen_ushl_vec,
+      .opt_opc = ushl_list,
+      .vece = MO_32 },
+    { .fni8 = gen_ushl_i64,
+      .fniv = gen_ushl_vec,
+      .opt_opc = ushl_list,
+      .vece = MO_64 },
+};
+
+void gen_sshl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b)
+{
+    TCGv_i32 lval = tcg_temp_new_i32();
+    TCGv_i32 rval = tcg_temp_new_i32();
+    TCGv_i32 lsh = tcg_temp_new_i32();
+    TCGv_i32 rsh = tcg_temp_new_i32();
+    TCGv_i32 zero = tcg_const_i32(0);
+    TCGv_i32 max = tcg_const_i32(31);
+
+    /*
+     * Rely on the TCG guarantee that out of range shifts produce
+     * unspecified results, not undefined behaviour (i.e. no trap).
+     * Discard out-of-range results after the fact.
+     */
+    tcg_gen_ext8s_i32(lsh, b);
+    tcg_gen_neg_i32(rsh, lsh);
+    tcg_gen_shl_i32(lval, a, lsh);
+    tcg_gen_umin_i32(rsh, rsh, max);
+    tcg_gen_sar_i32(rval, a, rsh);
+    tcg_gen_movcond_i32(TCG_COND_LEU, lval, lsh, max, lval, zero);
+    tcg_gen_movcond_i32(TCG_COND_LT, d, lsh, zero, rval, lval);
+
+    tcg_temp_free_i32(lval);
+    tcg_temp_free_i32(rval);
+    tcg_temp_free_i32(lsh);
+    tcg_temp_free_i32(rsh);
+    tcg_temp_free_i32(zero);
+    tcg_temp_free_i32(max);
+}
+
+void gen_sshl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
+{
+    TCGv_i64 lval = tcg_temp_new_i64();
+    TCGv_i64 rval = tcg_temp_new_i64();
+    TCGv_i64 lsh = tcg_temp_new_i64();
+    TCGv_i64 rsh = tcg_temp_new_i64();
+    TCGv_i64 zero = tcg_const_i64(0);
+    TCGv_i64 max = tcg_const_i64(63);
+
+    /*
+     * Rely on the TCG guarantee that out of range shifts produce
+     * unspecified results, not undefined behaviour (i.e. no trap).
+     * Discard out-of-range results after the fact.
+     */
+    tcg_gen_ext8s_i64(lsh, b);
+    tcg_gen_neg_i64(rsh, lsh);
+    tcg_gen_shl_i64(lval, a, lsh);
+    tcg_gen_umin_i64(rsh, rsh, max);
+    tcg_gen_sar_i64(rval, a, rsh);
+    tcg_gen_movcond_i64(TCG_COND_LEU, lval, lsh, max, lval, zero);
+    tcg_gen_movcond_i64(TCG_COND_LT, d, lsh, zero, rval, lval);
+
+    tcg_temp_free_i64(lval);
+    tcg_temp_free_i64(rval);
+    tcg_temp_free_i64(lsh);
+    tcg_temp_free_i64(rsh);
+    tcg_temp_free_i64(zero);
+    tcg_temp_free_i64(max);
+}
+
+static void gen_sshl_vec(unsigned vece, TCGv_vec d, TCGv_vec a, TCGv_vec b)
+{
+    TCGv_vec lval = tcg_temp_new_vec_matching(d);
+    TCGv_vec rval = tcg_temp_new_vec_matching(d);
+    TCGv_vec lsh = tcg_temp_new_vec_matching(d);
+    TCGv_vec rsh = tcg_temp_new_vec_matching(d);
+    TCGv_vec tmp = tcg_temp_new_vec_matching(d);
+
+    /*
+     * Rely on the TCG guarantee that out of range shifts produce
+     * unspecified results, not undefined behaviour (i.e. no trap).
+     * Discard out-of-range results after the fact.
+     */
+    tcg_gen_neg_vec(vece, rsh, b);
+    if (vece == MO_8) {
+        tcg_gen_mov_vec(lsh, b);
+    } else {
+        tcg_gen_dupi_vec(vece, tmp, 0xff);
+        tcg_gen_and_vec(vece, lsh, b, tmp);
+        tcg_gen_and_vec(vece, rsh, rsh, tmp);
+    }
+
+    /* Bound rsh so out of bound right shift gets -1.  */
+    tcg_gen_dupi_vec(vece, tmp, (8 << vece) - 1);
+    tcg_gen_umin_vec(vece, rsh, rsh, tmp);
+    tcg_gen_cmp_vec(TCG_COND_GT, vece, tmp, lsh, tmp);
+
+    tcg_gen_shlv_vec(vece, lval, a, lsh);
+    tcg_gen_sarv_vec(vece, rval, a, rsh);
+
+    /* Select in-bound left shift.  */
+    tcg_gen_andc_vec(vece, lval, lval, tmp);
+
+    /* Select between left and right shift.  */
+    if (vece == MO_8) {
+        tcg_gen_dupi_vec(vece, tmp, 0);
+        tcg_gen_cmpsel_vec(TCG_COND_LT, vece, d, lsh, tmp, rval, lval);
+    } else {
+        tcg_gen_dupi_vec(vece, tmp, 0x80);
+        tcg_gen_cmpsel_vec(TCG_COND_LT, vece, d, lsh, tmp, lval, rval);
+    }
+
+    tcg_temp_free_vec(lval);
+    tcg_temp_free_vec(rval);
+    tcg_temp_free_vec(lsh);
+    tcg_temp_free_vec(rsh);
+    tcg_temp_free_vec(tmp);
+}
+
+static const TCGOpcode sshl_list[] = {
+    INDEX_op_neg_vec, INDEX_op_umin_vec, INDEX_op_shlv_vec,
+    INDEX_op_sarv_vec, INDEX_op_cmp_vec, INDEX_op_cmpsel_vec, 0
+};
+
+const GVecGen3 sshl_op[4] = {
+    { .fniv = gen_sshl_vec,
+      .fno = gen_helper_gvec_sshl_b,
+      .opt_opc = sshl_list,
+      .vece = MO_8 },
+    { .fniv = gen_sshl_vec,
+      .fno = gen_helper_gvec_sshl_h,
+      .opt_opc = sshl_list,
+      .vece = MO_16 },
+    { .fni4 = gen_sshl_i32,
+      .fniv = gen_sshl_vec,
+      .opt_opc = sshl_list,
+      .vece = MO_32 },
+    { .fni8 = gen_sshl_i64,
+      .fniv = gen_sshl_vec,
+      .opt_opc = sshl_list,
+      .vece = MO_64 },
+};
+
 static void gen_uqadd_vec(unsigned vece, TCGv_vec t, TCGv_vec sat,
                           TCGv_vec a, TCGv_vec b)
 {
@@ -4792,6 +5068,12 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                                   vec_size, vec_size);
             }
             return 0;
+
+        case NEON_3R_VSHL:
+            /* Note the operation is vshl vd,vm,vn */
+            tcg_gen_gvec_3(rd_ofs, rm_ofs, rn_ofs, vec_size, vec_size,
+                           u ? &ushl_op[size] : &sshl_op[size]);
+            return 0;
         }
 
         if (size == 3) {
@@ -4800,13 +5082,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                 neon_load_reg64(cpu_V0, rn + pass);
                 neon_load_reg64(cpu_V1, rm + pass);
                 switch (op) {
-                case NEON_3R_VSHL:
-                    if (u) {
-                        gen_helper_neon_shl_u64(cpu_V0, cpu_V1, cpu_V0);
-                    } else {
-                        gen_helper_neon_shl_s64(cpu_V0, cpu_V1, cpu_V0);
-                    }
-                    break;
                 case NEON_3R_VQSHL:
                     if (u) {
                         gen_helper_neon_qshl_u64(cpu_V0, cpu_env,
@@ -4841,7 +5116,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
         }
         pairwise = 0;
         switch (op) {
-        case NEON_3R_VSHL:
         case NEON_3R_VQSHL:
         case NEON_3R_VRSHL:
         case NEON_3R_VQRSHL:
@@ -4921,9 +5195,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
         case NEON_3R_VHSUB:
             GEN_NEON_INTEGER_OP(hsub);
             break;
-        case NEON_3R_VSHL:
-            GEN_NEON_INTEGER_OP(shl);
-            break;
         case NEON_3R_VQSHL:
             GEN_NEON_INTEGER_OP_ENV(qshl);
             break;
@@ -5332,9 +5603,9 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                             }
                         } else {
                             if (input_unsigned) {
-                                gen_helper_neon_shl_u64(cpu_V0, in, tmp64);
+                                gen_ushl_i64(cpu_V0, in, tmp64);
                             } else {
-                                gen_helper_neon_shl_s64(cpu_V0, in, tmp64);
+                                gen_sshl_i64(cpu_V0, in, tmp64);
                             }
                         }
                         tmp = tcg_temp_new_i32();
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index dedef62403..fcb3663903 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -1046,3 +1046,91 @@ void HELPER(gvec_fmlal_idx_a64)(void *vd, void *vn, void *vm,
     do_fmlal_idx(vd, vn, vm, &env->vfp.fp_status, desc,
                  get_flush_inputs_to_zero(&env->vfp.fp_status_f16));
 }
+
+void HELPER(gvec_sshl_b)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    int8_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz; ++i) {
+        int8_t mm = m[i];
+        int8_t nn = n[i];
+        int8_t res = 0;
+        if (mm >= 0) {
+            if (mm < 8) {
+                res = nn << mm;
+            }
+        } else {
+            res = nn >> (mm > -8 ? -mm : 7);
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_sshl_h)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    int16_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 2; ++i) {
+        int8_t mm = m[i];   /* only 8 bits of shift are significant */
+        int16_t nn = n[i];
+        int16_t res = 0;
+        if (mm >= 0) {
+            if (mm < 16) {
+                res = nn << mm;
+            }
+        } else {
+            res = nn >> (mm > -16 ? -mm : 15);
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_ushl_b)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    uint8_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz; ++i) {
+        int8_t mm = m[i];
+        uint8_t nn = n[i];
+        uint8_t res = 0;
+        if (mm >= 0) {
+            if (mm < 8) {
+                res = nn << mm;
+            }
+        } else {
+            if (mm > -8) {
+                res = nn >> -mm;
+            }
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
+void HELPER(gvec_ushl_h)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    uint16_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 2; ++i) {
+        int8_t mm = m[i];   /* only 8 bits of shift are significant */
+        uint16_t nn = n[i];
+        uint16_t res = 0;
+        if (mm >= 0) {
+            if (mm < 16) {
+                res = nn << mm;
+            }
+        } else {
+            if (mm > -16) {
+                res = nn >> -mm;
+            }
+        }
+        d[i] = res;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] target/arm: Convert PMUL.8 to gvec
  2019-10-17  4:42 [PATCH 0/4] target/arm vector improvements Richard Henderson
  2019-10-17  4:42 ` [PATCH 1/4] target/arm: Vectorize USHL and SSHL Richard Henderson
@ 2019-10-17  4:42 ` Richard Henderson
  2019-10-18 13:40   ` Alex Bennée
  2019-10-17  4:42 ` [PATCH 3/4] target/arm: Convert PMULL.64 " Richard Henderson
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2019-10-17  4:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm

The gvec form will be needed for implementing SVE2.

Extend the implementation to operate on uint64_t instead of uint32_t.
Use a counted inner loop instead of terminating when op1 goes to zero,
looking toward the required implementation for ARMv8.4-DIT.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/helper.h        |  3 ++-
 target/arm/neon_helper.c   | 22 ----------------------
 target/arm/translate-a64.c | 10 +++-------
 target/arm/translate.c     | 11 ++++-------
 target/arm/vec_helper.c    | 30 ++++++++++++++++++++++++++++++
 5 files changed, 39 insertions(+), 37 deletions(-)

diff --git a/target/arm/helper.h b/target/arm/helper.h
index fc0d594a14..800446e537 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -335,7 +335,6 @@ DEF_HELPER_2(neon_sub_u8, i32, i32, i32)
 DEF_HELPER_2(neon_sub_u16, i32, i32, i32)
 DEF_HELPER_2(neon_mul_u8, i32, i32, i32)
 DEF_HELPER_2(neon_mul_u16, i32, i32, i32)
-DEF_HELPER_2(neon_mul_p8, i32, i32, i32)
 DEF_HELPER_2(neon_mull_p8, i64, i32, i32)
 
 DEF_HELPER_2(neon_tst_u8, i32, i32, i32)
@@ -689,6 +688,8 @@ DEF_HELPER_FLAGS_4(gvec_sshl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_4(gvec_ushl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
 
+DEF_HELPER_FLAGS_4(gvec_pmul_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #include "helper-sve.h"
diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
index c581ffb7d3..9e7a9a1ac5 100644
--- a/target/arm/neon_helper.c
+++ b/target/arm/neon_helper.c
@@ -1131,28 +1131,6 @@ NEON_VOP(mul_u16, neon_u16, 2)
 
 /* Polynomial multiplication is like integer multiplication except the
    partial products are XORed, not added.  */
-uint32_t HELPER(neon_mul_p8)(uint32_t op1, uint32_t op2)
-{
-    uint32_t mask;
-    uint32_t result;
-    result = 0;
-    while (op1) {
-        mask = 0;
-        if (op1 & 1)
-            mask |= 0xff;
-        if (op1 & (1 << 8))
-            mask |= (0xff << 8);
-        if (op1 & (1 << 16))
-            mask |= (0xff << 16);
-        if (op1 & (1 << 24))
-            mask |= (0xff << 24);
-        result ^= op2 & mask;
-        op1 = (op1 >> 1) & 0x7f7f7f7f;
-        op2 = (op2 << 1) & 0xfefefefe;
-    }
-    return result;
-}
-
 uint64_t HELPER(neon_mull_p8)(uint32_t op1, uint32_t op2)
 {
     uint64_t result = 0;
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 255a168df6..04e25cfe06 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -11110,9 +11110,10 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
     case 0x13: /* MUL, PMUL */
         if (!u) { /* MUL */
             gen_gvec_fn3(s, is_q, rd, rn, rm, tcg_gen_gvec_mul, size);
-            return;
+        } else {  /* PMUL */
+            gen_gvec_op3_ool(s, is_q, rd, rn, rm, 0, gen_helper_gvec_pmul_b);
         }
-        break;
+        return;
     case 0x12: /* MLA, MLS */
         if (u) {
             gen_gvec_op3(s, is_q, rd, rn, rm, &mls_op[size]);
@@ -11242,11 +11243,6 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
                 genfn = fns[size][u];
                 break;
             }
-            case 0x13: /* MUL, PMUL */
-                assert(u); /* PMUL */
-                assert(size == 0);
-                genfn = gen_helper_neon_mul_p8;
-                break;
             case 0x16: /* SQDMULH, SQRDMULH */
             {
                 static NeonGenTwoOpEnvFn * const fns[2][2] = {
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 598bb1cc00..b66a2f6b71 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -5014,16 +5014,17 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
 
         case NEON_3R_VMUL: /* VMUL */
             if (u) {
-                /* Polynomial case allows only P8 and is handled below.  */
+                /* Polynomial case allows only P8.  */
                 if (size != 0) {
                     return 1;
                 }
+                tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, vec_size, vec_size,
+                                   0, gen_helper_gvec_pmul_b);
             } else {
                 tcg_gen_gvec_mul(size, rd_ofs, rn_ofs, rm_ofs,
                                  vec_size, vec_size);
-                return 0;
             }
-            break;
+            return 0;
 
         case NEON_3R_VML: /* VMLA, VMLS */
             tcg_gen_gvec_3(rd_ofs, rn_ofs, rm_ofs, vec_size, vec_size,
@@ -5213,10 +5214,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
             tmp2 = neon_load_reg(rd, pass);
             gen_neon_add(size, tmp, tmp2);
             break;
-        case NEON_3R_VMUL:
-            /* VMUL.P8; other cases already eliminated.  */
-            gen_helper_neon_mul_p8(tmp, tmp, tmp2);
-            break;
         case NEON_3R_VPMAX:
             GEN_NEON_INTEGER_OP(pmax);
             break;
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index fcb3663903..d401282c6f 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -1134,3 +1134,33 @@ void HELPER(gvec_ushl_h)(void *vd, void *vn, void *vm, uint32_t desc)
     }
     clear_tail(d, opr_sz, simd_maxsz(desc));
 }
+
+/*
+ * 8x8->8 polynomial multiply.
+ *
+ * Polynomial multiplication is like integer multiplication except the
+ * partial products are XORed, not added.
+ *
+ * TODO: expose this as a generic vector operation, as it is a common
+ * crypto building block.
+ */
+void HELPER(gvec_pmul_b)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, j, opr_sz = simd_oprsz(desc);
+    uint64_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 8; ++i) {
+        uint64_t nn = n[i];
+        uint64_t mm = m[i];
+        uint64_t rr = 0;
+
+        for (j = 0; j < 8; ++j) {
+            uint64_t mask = (nn & 0x0101010101010101ull) * 0xff;
+            rr ^= mm & mask;
+            mm = (mm << 1) & 0xfefefefefefefefeull;
+            nn = (nn >> 1) & 0x7f7f7f7f7f7f7f7full;
+        }
+        d[i] = rr;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] target/arm: Convert PMULL.64 to gvec
  2019-10-17  4:42 [PATCH 0/4] target/arm vector improvements Richard Henderson
  2019-10-17  4:42 ` [PATCH 1/4] target/arm: Vectorize USHL and SSHL Richard Henderson
  2019-10-17  4:42 ` [PATCH 2/4] target/arm: Convert PMUL.8 to gvec Richard Henderson
@ 2019-10-17  4:42 ` Richard Henderson
  2019-10-18 12:24   ` Alex Bennée
  2019-10-17  4:42 ` [PATCH 4/4] target/arm: Convert PMULL.8 " Richard Henderson
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2019-10-17  4:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm

The gvec form will be needed for implementing SVE2.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/helper.h        |  4 +---
 target/arm/neon_helper.c   | 30 ------------------------------
 target/arm/translate-a64.c | 28 +++-------------------------
 target/arm/translate.c     | 16 ++--------------
 target/arm/vec_helper.c    | 33 +++++++++++++++++++++++++++++++++
 5 files changed, 39 insertions(+), 72 deletions(-)

diff --git a/target/arm/helper.h b/target/arm/helper.h
index 800446e537..d954399b7e 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -555,9 +555,6 @@ DEF_HELPER_FLAGS_3(crc32, TCG_CALL_NO_RWG_SE, i32, i32, i32, i32)
 DEF_HELPER_FLAGS_3(crc32c, TCG_CALL_NO_RWG_SE, i32, i32, i32, i32)
 DEF_HELPER_2(dc_zva, void, env, i64)
 
-DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
-DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
-
 DEF_HELPER_FLAGS_5(gvec_qrdmlah_s16, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_5(gvec_qrdmlsh_s16, TCG_CALL_NO_RWG,
@@ -689,6 +686,7 @@ DEF_HELPER_FLAGS_4(gvec_ushl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
 
 DEF_HELPER_FLAGS_4(gvec_pmul_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(gvec_pmull_q, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
 
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
index 9e7a9a1ac5..6a107da0e1 100644
--- a/target/arm/neon_helper.c
+++ b/target/arm/neon_helper.c
@@ -2152,33 +2152,3 @@ void HELPER(neon_zip16)(void *vd, void *vm)
     rm[0] = m0;
     rd[0] = d0;
 }
-
-/* Helper function for 64 bit polynomial multiply case:
- * perform PolynomialMult(op1, op2) and return either the top or
- * bottom half of the 128 bit result.
- */
-uint64_t HELPER(neon_pmull_64_lo)(uint64_t op1, uint64_t op2)
-{
-    int bitnum;
-    uint64_t res = 0;
-
-    for (bitnum = 0; bitnum < 64; bitnum++) {
-        if (op1 & (1ULL << bitnum)) {
-            res ^= op2 << bitnum;
-        }
-    }
-    return res;
-}
-uint64_t HELPER(neon_pmull_64_hi)(uint64_t op1, uint64_t op2)
-{
-    int bitnum;
-    uint64_t res = 0;
-
-    /* bit 0 of op1 can't influence the high 64 bits at all */
-    for (bitnum = 1; bitnum < 64; bitnum++) {
-        if (op1 & (1ULL << bitnum)) {
-            res ^= op2 >> (64 - bitnum);
-        }
-    }
-    return res;
-}
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 04e25cfe06..12588d18df 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -10598,30 +10598,6 @@ static void handle_3rd_narrowing(DisasContext *s, int is_q, int is_u, int size,
     clear_vec_high(s, is_q, rd);
 }
 
-static void handle_pmull_64(DisasContext *s, int is_q, int rd, int rn, int rm)
-{
-    /* PMULL of 64 x 64 -> 128 is an odd special case because it
-     * is the only three-reg-diff instruction which produces a
-     * 128-bit wide result from a single operation. However since
-     * it's possible to calculate the two halves more or less
-     * separately we just use two helper calls.
-     */
-    TCGv_i64 tcg_op1 = tcg_temp_new_i64();
-    TCGv_i64 tcg_op2 = tcg_temp_new_i64();
-    TCGv_i64 tcg_res = tcg_temp_new_i64();
-
-    read_vec_element(s, tcg_op1, rn, is_q, MO_64);
-    read_vec_element(s, tcg_op2, rm, is_q, MO_64);
-    gen_helper_neon_pmull_64_lo(tcg_res, tcg_op1, tcg_op2);
-    write_vec_element(s, tcg_res, rd, 0, MO_64);
-    gen_helper_neon_pmull_64_hi(tcg_res, tcg_op1, tcg_op2);
-    write_vec_element(s, tcg_res, rd, 1, MO_64);
-
-    tcg_temp_free_i64(tcg_op1);
-    tcg_temp_free_i64(tcg_op2);
-    tcg_temp_free_i64(tcg_res);
-}
-
 /* AdvSIMD three different
  *   31  30  29 28       24 23  22  21 20  16 15    12 11 10 9    5 4    0
  * +---+---+---+-----------+------+---+------+--------+-----+------+------+
@@ -10686,7 +10662,9 @@ static void disas_simd_three_reg_diff(DisasContext *s, uint32_t insn)
             if (!fp_access_check(s)) {
                 return;
             }
-            handle_pmull_64(s, is_q, rd, rn, rm);
+            /* The Q field specifies lo/hi half input for this insn.  */
+            gen_gvec_op3_ool(s, true, rd, rn, rm, is_q,
+                             gen_helper_gvec_pmull_q);
             return;
         }
         goto is_widening;
diff --git a/target/arm/translate.c b/target/arm/translate.c
index b66a2f6b71..4e34249672 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -5877,23 +5877,11 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                  * outside the loop below as it only performs a single pass.
                  */
                 if (op == 14 && size == 2) {
-                    TCGv_i64 tcg_rn, tcg_rm, tcg_rd;
-
                     if (!dc_isar_feature(aa32_pmull, s)) {
                         return 1;
                     }
-                    tcg_rn = tcg_temp_new_i64();
-                    tcg_rm = tcg_temp_new_i64();
-                    tcg_rd = tcg_temp_new_i64();
-                    neon_load_reg64(tcg_rn, rn);
-                    neon_load_reg64(tcg_rm, rm);
-                    gen_helper_neon_pmull_64_lo(tcg_rd, tcg_rn, tcg_rm);
-                    neon_store_reg64(tcg_rd, rd);
-                    gen_helper_neon_pmull_64_hi(tcg_rd, tcg_rn, tcg_rm);
-                    neon_store_reg64(tcg_rd, rd + 1);
-                    tcg_temp_free_i64(tcg_rn);
-                    tcg_temp_free_i64(tcg_rm);
-                    tcg_temp_free_i64(tcg_rd);
+                    tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, 16, 16,
+                                       0, gen_helper_gvec_pmull_q);
                     return 0;
                 }
 
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index d401282c6f..5c1074374e 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -1164,3 +1164,36 @@ void HELPER(gvec_pmul_b)(void *vd, void *vn, void *vm, uint32_t desc)
     }
     clear_tail(d, opr_sz, simd_maxsz(desc));
 }
+
+/*
+ * 64x64->128 polynomial multiply.
+ * Because of the lanes are not accessed in strict columns,
+ * this probably cannot be turned into a generic helper.
+ */
+void HELPER(gvec_pmull_q)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    intptr_t i, j, opr_sz = simd_oprsz(desc);
+    intptr_t hi = simd_data(desc);
+    uint64_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 8; i += 2) {
+        uint64_t nn = n[i + hi];
+        uint64_t mm = m[i + hi];
+        uint64_t rhi = 0;
+        uint64_t rlo = 0;
+
+        /* Bit 0 can only influence the low 64-bit result.  */
+        if (nn & 1) {
+            rlo = mm;
+        }
+
+        for (j = 1; j < 64; ++j) {
+            uint64_t mask = -((nn >> j) & 1);
+            rlo ^= (mm << j) & mask;
+            rhi ^= (mm >> (64 - j)) & mask;
+        }
+        d[i] = rlo;
+        d[i + 1] = rhi;
+    }
+    clear_tail(d, opr_sz, simd_maxsz(desc));
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] target/arm: Convert PMULL.8 to gvec
  2019-10-17  4:42 [PATCH 0/4] target/arm vector improvements Richard Henderson
                   ` (2 preceding siblings ...)
  2019-10-17  4:42 ` [PATCH 3/4] target/arm: Convert PMULL.64 " Richard Henderson
@ 2019-10-17  4:42 ` Richard Henderson
  2019-10-18 17:54   ` Alex Bennée
  2019-10-17  5:21 ` [PATCH 0/4] target/arm vector improvements no-reply
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2019-10-17  4:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm

We still need two different helpers, since NEON and SVE2 get the
inputs from different locations within the source vector.  However,
we can convert both to the same internal form for computation.

The sve2 helper is not used yet, but adding it with this patch
helps illustrate why the neon changes are helpful.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/helper-sve.h    |  2 ++
 target/arm/helper.h        |  3 +-
 target/arm/neon_helper.c   | 32 --------------------
 target/arm/translate-a64.c | 27 +++++++++++------
 target/arm/translate.c     | 26 ++++++++---------
 target/arm/vec_helper.c    | 60 ++++++++++++++++++++++++++++++++++++++
 6 files changed, 95 insertions(+), 55 deletions(-)

diff --git a/target/arm/helper-sve.h b/target/arm/helper-sve.h
index 9e79182ab4..2f47279155 100644
--- a/target/arm/helper-sve.h
+++ b/target/arm/helper-sve.h
@@ -1574,3 +1574,5 @@ DEF_HELPER_FLAGS_6(sve_stdd_le_zd, TCG_CALL_NO_WG,
                    void, env, ptr, ptr, ptr, tl, i32)
 DEF_HELPER_FLAGS_6(sve_stdd_be_zd, TCG_CALL_NO_WG,
                    void, env, ptr, ptr, ptr, tl, i32)
+
+DEF_HELPER_FLAGS_4(sve2_pmull_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index d954399b7e..8a8517cf34 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -335,7 +335,6 @@ DEF_HELPER_2(neon_sub_u8, i32, i32, i32)
 DEF_HELPER_2(neon_sub_u16, i32, i32, i32)
 DEF_HELPER_2(neon_mul_u8, i32, i32, i32)
 DEF_HELPER_2(neon_mul_u16, i32, i32, i32)
-DEF_HELPER_2(neon_mull_p8, i64, i32, i32)
 
 DEF_HELPER_2(neon_tst_u8, i32, i32, i32)
 DEF_HELPER_2(neon_tst_u16, i32, i32, i32)
@@ -688,6 +687,8 @@ DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_4(gvec_pmul_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_4(gvec_pmull_q, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
 
+DEF_HELPER_FLAGS_4(neon_pmull_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #include "helper-sve.h"
diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
index 6a107da0e1..c7a8438b42 100644
--- a/target/arm/neon_helper.c
+++ b/target/arm/neon_helper.c
@@ -1129,38 +1129,6 @@ NEON_VOP(mul_u8, neon_u8, 4)
 NEON_VOP(mul_u16, neon_u16, 2)
 #undef NEON_FN
 
-/* Polynomial multiplication is like integer multiplication except the
-   partial products are XORed, not added.  */
-uint64_t HELPER(neon_mull_p8)(uint32_t op1, uint32_t op2)
-{
-    uint64_t result = 0;
-    uint64_t mask;
-    uint64_t op2ex = op2;
-    op2ex = (op2ex & 0xff) |
-        ((op2ex & 0xff00) << 8) |
-        ((op2ex & 0xff0000) << 16) |
-        ((op2ex & 0xff000000) << 24);
-    while (op1) {
-        mask = 0;
-        if (op1 & 1) {
-            mask |= 0xffff;
-        }
-        if (op1 & (1 << 8)) {
-            mask |= (0xffffU << 16);
-        }
-        if (op1 & (1 << 16)) {
-            mask |= (0xffffULL << 32);
-        }
-        if (op1 & (1 << 24)) {
-            mask |= (0xffffULL << 48);
-        }
-        result ^= op2ex & mask;
-        op1 = (op1 >> 1) & 0x7f7f7f7f;
-        op2ex <<= 1;
-    }
-    return result;
-}
-
 #define NEON_FN(dest, src1, src2) dest = (src1 & src2) ? -1 : 0
 NEON_VOP(tst_u8, neon_u8, 4)
 NEON_VOP(tst_u16, neon_u16, 2)
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 12588d18df..2934e4fc16 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -10483,10 +10483,6 @@ static void handle_3rd_widening(DisasContext *s, int is_q, int is_u, int size,
                 gen_helper_neon_addl_saturate_s32(tcg_passres, cpu_env,
                                                   tcg_passres, tcg_passres);
                 break;
-            case 14: /* PMULL */
-                assert(size == 0);
-                gen_helper_neon_mull_p8(tcg_passres, tcg_op1, tcg_op2);
-                break;
             default:
                 g_assert_not_reached();
             }
@@ -10650,11 +10646,21 @@ static void disas_simd_three_reg_diff(DisasContext *s, uint32_t insn)
         handle_3rd_narrowing(s, is_q, is_u, size, opcode, rd, rn, rm);
         break;
     case 14: /* PMULL, PMULL2 */
-        if (is_u || size == 1 || size == 2) {
+        if (is_u) {
             unallocated_encoding(s);
             return;
         }
-        if (size == 3) {
+        switch (size) {
+        case 0: /* PMULL.P8 */
+            if (!fp_access_check(s)) {
+                return;
+            }
+            /* The Q field specifies lo/hi half input for this insn.  */
+            gen_gvec_op3_ool(s, true, rd, rn, rm, is_q,
+                             gen_helper_neon_pmull_h);
+            break;
+
+        case 3: /* PMULL.P64 */
             if (!dc_isar_feature(aa64_pmull, s)) {
                 unallocated_encoding(s);
                 return;
@@ -10665,9 +10671,13 @@ static void disas_simd_three_reg_diff(DisasContext *s, uint32_t insn)
             /* The Q field specifies lo/hi half input for this insn.  */
             gen_gvec_op3_ool(s, true, rd, rn, rm, is_q,
                              gen_helper_gvec_pmull_q);
-            return;
+            break;
+
+        default:
+            unallocated_encoding(s);
+            break;
         }
-        goto is_widening;
+        return;
     case 9: /* SQDMLAL, SQDMLAL2 */
     case 11: /* SQDMLSL, SQDMLSL2 */
     case 13: /* SQDMULL, SQDMULL2 */
@@ -10688,7 +10698,6 @@ static void disas_simd_three_reg_diff(DisasContext *s, uint32_t insn)
             unallocated_encoding(s);
             return;
         }
-    is_widening:
         if (!fp_access_check(s)) {
             return;
         }
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 4e34249672..c3abf130cc 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -5873,15 +5873,20 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                     return 1;
                 }
 
-                /* Handle VMULL.P64 (Polynomial 64x64 to 128 bit multiply)
-                 * outside the loop below as it only performs a single pass.
-                 */
-                if (op == 14 && size == 2) {
-                    if (!dc_isar_feature(aa32_pmull, s)) {
-                        return 1;
+                /* Handle polynomial VMULL in a single pass.  */
+                if (op == 14) {
+                    if (size == 0) {
+                        /* VMULL.P8 */
+                        tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, 16, 16,
+                                           0, gen_helper_neon_pmull_h);
+                    } else {
+                        /* VMULL.P64 */
+                        if (!dc_isar_feature(aa32_pmull, s)) {
+                            return 1;
+                        }
+                        tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, 16, 16,
+                                           0, gen_helper_gvec_pmull_q);
                     }
-                    tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, 16, 16,
-                                       0, gen_helper_gvec_pmull_q);
                     return 0;
                 }
 
@@ -5959,11 +5964,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
                         /* VMLAL, VQDMLAL, VMLSL, VQDMLSL, VMULL, VQDMULL */
                         gen_neon_mull(cpu_V0, tmp, tmp2, size, u);
                         break;
-                    case 14: /* Polynomial VMULL */
-                        gen_helper_neon_mull_p8(cpu_V0, tmp, tmp2);
-                        tcg_temp_free_i32(tmp2);
-                        tcg_temp_free_i32(tmp);
-                        break;
                     default: /* 15 is RESERVED: caught earlier  */
                         abort();
                     }
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index 5c1074374e..04b4d7402d 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -1197,3 +1197,63 @@ void HELPER(gvec_pmull_q)(void *vd, void *vn, void *vm, uint32_t desc)
     }
     clear_tail(d, opr_sz, simd_maxsz(desc));
 }
+
+/*
+ * 8x8->16 polynomial multiply.
+ *
+ * The byte inputs are expanded to (or extracted from) half-words.
+ * Note that neon and sve2 get the inputs from different positions.
+ * This allows 4 bytes to be processed in parallel with uint64_t.
+ */
+
+static uint64_t expand_byte_to_half(uint64_t x)
+{
+    return  (x & 0x000000ff)
+         | ((x & 0x0000ff00) << 8)
+         | ((x & 0x00ff0000) << 16)
+         | ((x & 0xff000000) << 24);
+}
+
+static uint64_t pmull_h(uint64_t op1, uint64_t op2)
+{
+    uint64_t result = 0;
+    int i;
+
+    for (i = 0; i < 8; ++i) {
+        uint64_t mask = (op1 & 0x0001000100010001ull) * 0xffff;
+        result ^= op2 & mask;
+        op1 >>= 1;
+        op2 <<= 1;
+    }
+    return result;
+}
+
+void HELPER(neon_pmull_h)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    int hi = simd_data(desc);
+    uint64_t *d = vd, *n = vn, *m = vm;
+    uint64_t nn = n[hi], mm = m[hi];
+
+    d[0] = pmull_h(expand_byte_to_half(nn), expand_byte_to_half(mm));
+    nn >>= 32;
+    mm >>= 32;
+    d[1] = pmull_h(expand_byte_to_half(nn), expand_byte_to_half(mm));
+
+    clear_tail(d, 16, simd_maxsz(desc));
+}
+
+#ifdef TARGET_AARCH64
+void HELPER(sve2_pmull_h)(void *vd, void *vn, void *vm, uint32_t desc)
+{
+    int shift = simd_data(desc) * 8;
+    intptr_t i, opr_sz = simd_oprsz(desc);
+    uint64_t *d = vd, *n = vn, *m = vm;
+
+    for (i = 0; i < opr_sz / 8; ++i) {
+        uint64_t nn = (n[i] >> shift) & 0x00ff00ff00ff00ffull;
+        uint64_t mm = (m[i] >> shift) & 0x00ff00ff00ff00ffull;
+
+        d[i] = pmull_h(nn, mm);
+    }
+}
+#endif
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] target/arm vector improvements
  2019-10-17  4:42 [PATCH 0/4] target/arm vector improvements Richard Henderson
                   ` (3 preceding siblings ...)
  2019-10-17  4:42 ` [PATCH 4/4] target/arm: Convert PMULL.8 " Richard Henderson
@ 2019-10-17  5:21 ` no-reply
  2019-10-18 17:58 ` Alex Bennée
  2019-11-18 16:26 ` Peter Maydell
  6 siblings, 0 replies; 15+ messages in thread
From: no-reply @ 2019-10-17  5:21 UTC (permalink / raw)
  To: richard.henderson; +Cc: peter.maydell, qemu-arm, qemu-devel

Patchew URL: https://patchew.org/QEMU/20191017044232.27601-1-richard.henderson@linaro.org/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Subject: [PATCH 0/4] target/arm vector improvements
Type: series
Message-id: 20191017044232.27601-1-richard.henderson@linaro.org

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
3ea6653 target/arm: Convert PMULL.8 to gvec
b6cf8ea target/arm: Convert PMULL.64 to gvec
5eddaf2 target/arm: Convert PMUL.8 to gvec
bd8f967 target/arm: Vectorize USHL and SSHL

=== OUTPUT BEGIN ===
1/4 Checking commit bd8f967551b6 (target/arm: Vectorize USHL and SSHL)
ERROR: trailing statements should be on next line
#160: FILE: target/arm/translate.c:3583:
+            case 2: gen_ushl_i32(var, var, shift); break;

ERROR: trailing statements should be on next line
#167: FILE: target/arm/translate.c:3589:
+            case 2: gen_sshl_i32(var, var, shift); break;

total: 2 errors, 0 warnings, 571 lines checked

Patch 1/4 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

2/4 Checking commit 5eddaf2661e3 (target/arm: Convert PMUL.8 to gvec)
3/4 Checking commit b6cf8ea095db (target/arm: Convert PMULL.64 to gvec)
4/4 Checking commit 3ea665336e6c (target/arm: Convert PMULL.8 to gvec)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20191017044232.27601-1-richard.henderson@linaro.org/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] target/arm: Vectorize USHL and SSHL
  2019-10-17  4:42 ` [PATCH 1/4] target/arm: Vectorize USHL and SSHL Richard Henderson
@ 2019-10-17 16:01   ` Alex Bennée
  2019-10-18 14:47     ` Richard Henderson
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Bennée @ 2019-10-17 16:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> These instructions shift left or right depending on the sign
> of the input, and 7 bits are significant to the shift.  This
> requires several masks and selects in addition to the actual
> shifts to form the complete answer.
>
> That said, the operation is still a small improvement even for
> two 64-bit elements -- 13 vector operations instead of 2 * 7
> integer operations.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> v2: Fix operand ordering for aa32 VSHL.
> ---
>  target/arm/helper.h        |  11 +-
>  target/arm/translate.h     |   6 +
>  target/arm/neon_helper.c   |  33 ----
>  target/arm/translate-a64.c |  18 +--
>  target/arm/translate.c     | 301 +++++++++++++++++++++++++++++++++++--
>  target/arm/vec_helper.c    |  88 +++++++++++
>  6 files changed, 391 insertions(+), 66 deletions(-)
>
> diff --git a/target/arm/helper.h b/target/arm/helper.h
> index 1fb2cb5a77..fc0d594a14 100644
> --- a/target/arm/helper.h
> +++ b/target/arm/helper.h
> @@ -296,14 +296,8 @@ DEF_HELPER_2(neon_abd_s16, i32, i32, i32)
>  DEF_HELPER_2(neon_abd_u32, i32, i32, i32)
>  DEF_HELPER_2(neon_abd_s32, i32, i32, i32)
>
> -DEF_HELPER_2(neon_shl_u8, i32, i32, i32)
> -DEF_HELPER_2(neon_shl_s8, i32, i32, i32)
>  DEF_HELPER_2(neon_shl_u16, i32, i32, i32)
>  DEF_HELPER_2(neon_shl_s16, i32, i32, i32)
> -DEF_HELPER_2(neon_shl_u32, i32, i32, i32)
> -DEF_HELPER_2(neon_shl_s32, i32, i32, i32)
> -DEF_HELPER_2(neon_shl_u64, i64, i64, i64)
> -DEF_HELPER_2(neon_shl_s64, i64, i64, i64)
>  DEF_HELPER_2(neon_rshl_u8, i32, i32, i32)
>  DEF_HELPER_2(neon_rshl_s8, i32, i32, i32)
>  DEF_HELPER_2(neon_rshl_u16, i32, i32, i32)
> @@ -690,6 +684,11 @@ DEF_HELPER_FLAGS_2(frint64_s, TCG_CALL_NO_RWG, f32, f32, ptr)
>  DEF_HELPER_FLAGS_2(frint32_d, TCG_CALL_NO_RWG, f64, f64, ptr)
>  DEF_HELPER_FLAGS_2(frint64_d, TCG_CALL_NO_RWG, f64, f64, ptr)
>
> +DEF_HELPER_FLAGS_4(gvec_sshl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_sshl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_ushl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
>  #ifdef TARGET_AARCH64
>  #include "helper-a64.h"
>  #include "helper-sve.h"
> diff --git a/target/arm/translate.h b/target/arm/translate.h
> index dd24f91f26..0c4e6e4bbd 100644
> --- a/target/arm/translate.h
> +++ b/target/arm/translate.h
> @@ -274,6 +274,8 @@ uint64_t vfp_expand_imm(int size, uint8_t imm8);
>  extern const GVecGen3 mla_op[4];
>  extern const GVecGen3 mls_op[4];
>  extern const GVecGen3 cmtst_op[4];
> +extern const GVecGen3 sshl_op[4];
> +extern const GVecGen3 ushl_op[4];
>  extern const GVecGen2i ssra_op[4];
>  extern const GVecGen2i usra_op[4];
>  extern const GVecGen2i sri_op[4];
> @@ -283,6 +285,10 @@ extern const GVecGen4 sqadd_op[4];
>  extern const GVecGen4 uqsub_op[4];
>  extern const GVecGen4 sqsub_op[4];
>  void gen_cmtst_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b);
> +void gen_ushl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b);
> +void gen_sshl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b);
> +void gen_ushl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b);
> +void gen_sshl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b);
>
>  /*
>   * Forward to the isar_feature_* tests given a DisasContext pointer.
> diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
> index 4259056723..c581ffb7d3 100644
> --- a/target/arm/neon_helper.c
> +++ b/target/arm/neon_helper.c
> @@ -615,24 +615,9 @@ NEON_VOP(abd_u32, neon_u32, 1)
>      } else { \
>          dest = src1 << tmp; \
>      }} while (0)
> -NEON_VOP(shl_u8, neon_u8, 4)
>  NEON_VOP(shl_u16, neon_u16, 2)
> -NEON_VOP(shl_u32, neon_u32, 1)
>  #undef NEON_FN
>
> -uint64_t HELPER(neon_shl_u64)(uint64_t val, uint64_t shiftop)
> -{
> -    int8_t shift = (int8_t)shiftop;
> -    if (shift >= 64 || shift <= -64) {
> -        val = 0;
> -    } else if (shift < 0) {
> -        val >>= -shift;
> -    } else {
> -        val <<= shift;
> -    }
> -    return val;
> -}
> -
>  #define NEON_FN(dest, src1, src2) do { \
>      int8_t tmp; \
>      tmp = (int8_t)src2; \
> @@ -645,27 +630,9 @@ uint64_t HELPER(neon_shl_u64)(uint64_t val, uint64_t shiftop)
>      } else { \
>          dest = src1 << tmp; \
>      }} while (0)
> -NEON_VOP(shl_s8, neon_s8, 4)
>  NEON_VOP(shl_s16, neon_s16, 2)
> -NEON_VOP(shl_s32, neon_s32, 1)
>  #undef NEON_FN
>
> -uint64_t HELPER(neon_shl_s64)(uint64_t valop, uint64_t shiftop)
> -{
> -    int8_t shift = (int8_t)shiftop;
> -    int64_t val = valop;
> -    if (shift >= 64) {
> -        val = 0;
> -    } else if (shift <= -64) {
> -        val >>= 63;
> -    } else if (shift < 0) {
> -        val >>= -shift;
> -    } else {
> -        val <<= shift;
> -    }
> -    return val;
> -}
> -
>  #define NEON_FN(dest, src1, src2) do { \
>      int8_t tmp; \
>      tmp = (int8_t)src2; \
> diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
> index 2d6cd09634..255a168df6 100644
> --- a/target/arm/translate-a64.c
> +++ b/target/arm/translate-a64.c
> @@ -8685,9 +8685,9 @@ static void handle_3same_64(DisasContext *s, int opcode, bool u,
>          break;
>      case 0x8: /* SSHL, USHL */
>          if (u) {
> -            gen_helper_neon_shl_u64(tcg_rd, tcg_rn, tcg_rm);
> +            gen_ushl_i64(tcg_rd, tcg_rn, tcg_rm);
>          } else {
> -            gen_helper_neon_shl_s64(tcg_rd, tcg_rn, tcg_rm);
> +            gen_sshl_i64(tcg_rd, tcg_rn, tcg_rm);
>          }
>          break;
>      case 0x9: /* SQSHL, UQSHL */
> @@ -11082,6 +11082,10 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
>                         is_q ? 16 : 8, vec_full_reg_size(s),
>                         (u ? uqsub_op : sqsub_op) + size);
>          return;
> +    case 0x08: /* SSHL, USHL */
> +        gen_gvec_op3(s, is_q, rd, rn, rm,
> +                     u ? &ushl_op[size] : &sshl_op[size]);
> +        return;
>      case 0x0c: /* SMAX, UMAX */
>          if (u) {
>              gen_gvec_fn3(s, is_q, rd, rn, rm, tcg_gen_gvec_umax, size);
> @@ -11197,16 +11201,6 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
>                  genfn = fns[size][u];
>                  break;
>              }
> -            case 0x8: /* SSHL, USHL */
> -            {
> -                static NeonGenTwoOpFn * const fns[3][2] = {
> -                    { gen_helper_neon_shl_s8, gen_helper_neon_shl_u8 },
> -                    { gen_helper_neon_shl_s16, gen_helper_neon_shl_u16 },
> -                    { gen_helper_neon_shl_s32, gen_helper_neon_shl_u32 },
> -                };
> -                genfn = fns[size][u];
> -                break;
> -            }
>              case 0x9: /* SQSHL, UQSHL */
>              {
>                  static NeonGenTwoOpEnvFn * const fns[3][2] = {
> diff --git a/target/arm/translate.c b/target/arm/translate.c
> index 698c594e8c..598bb1cc00 100644
> --- a/target/arm/translate.c
> +++ b/target/arm/translate.c
> @@ -3580,13 +3580,13 @@ static inline void gen_neon_shift_narrow(int size, TCGv_i32 var, TCGv_i32 shift,
>          if (u) {
>              switch (size) {
>              case 1: gen_helper_neon_shl_u16(var, var, shift); break;
> -            case 2: gen_helper_neon_shl_u32(var, var, shift); break;
> +            case 2: gen_ushl_i32(var, var, shift); break;
>              default: abort();
>              }
>          } else {
>              switch (size) {
>              case 1: gen_helper_neon_shl_s16(var, var, shift); break;
> -            case 2: gen_helper_neon_shl_s32(var, var, shift); break;
> +            case 2: gen_sshl_i32(var, var, shift); break;
>              default: abort();
>              }
>          }
> @@ -4389,6 +4389,282 @@ const GVecGen3 cmtst_op[4] = {
>        .vece = MO_64 },
>  };
>
> +void gen_ushl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b)

nit: this would have been nicer to read if the ops where dst, src, shift
or some such.

> +{
> +    TCGv_i32 lval = tcg_temp_new_i32();
> +    TCGv_i32 rval = tcg_temp_new_i32();
> +    TCGv_i32 lsh = tcg_temp_new_i32();
> +    TCGv_i32 rsh = tcg_temp_new_i32();
> +    TCGv_i32 zero = tcg_const_i32(0);
> +    TCGv_i32 max = tcg_const_i32(32);
> +
> +    /*
> +     * Rely on the TCG guarantee that out of range shifts produce
> +     * unspecified results, not undefined behaviour (i.e. no trap).
> +     * Discard out-of-range results after the fact.
> +     */
> +    tcg_gen_ext8s_i32(lsh, b);
> +    tcg_gen_neg_i32(rsh, lsh);
> +    tcg_gen_shl_i32(lval, a, lsh);
> +    tcg_gen_shr_i32(rval, a, rsh);
> +    tcg_gen_movcond_i32(TCG_COND_LTU, d, lsh, max, lval, zero);
> +    tcg_gen_movcond_i32(TCG_COND_LTU, d, rsh, max, rval, d);

Do these get dead coded away if the shift is a const?

> +
> +    tcg_temp_free_i32(lval);
> +    tcg_temp_free_i32(rval);
> +    tcg_temp_free_i32(lsh);
> +    tcg_temp_free_i32(rsh);
> +    tcg_temp_free_i32(zero);
> +    tcg_temp_free_i32(max);
> +}
> +
> +void gen_ushl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 lval = tcg_temp_new_i64();
> +    TCGv_i64 rval = tcg_temp_new_i64();
> +    TCGv_i64 lsh = tcg_temp_new_i64();
> +    TCGv_i64 rsh = tcg_temp_new_i64();
> +    TCGv_i64 zero = tcg_const_i64(0);
> +    TCGv_i64 max = tcg_const_i64(64);
> +
> +    /*
> +     * Rely on the TCG guarantee that out of range shifts produce
> +     * unspecified results, not undefined behaviour (i.e. no trap).
> +     * Discard out-of-range results after the fact.
> +     */
> +    tcg_gen_ext8s_i64(lsh, b);
> +    tcg_gen_neg_i64(rsh, lsh);
> +    tcg_gen_shl_i64(lval, a, lsh);
> +    tcg_gen_shr_i64(rval, a, rsh);
> +    tcg_gen_movcond_i64(TCG_COND_LTU, d, lsh, max, lval, zero);
> +    tcg_gen_movcond_i64(TCG_COND_LTU, d, rsh, max, rval, d);
> +
> +    tcg_temp_free_i64(lval);
> +    tcg_temp_free_i64(rval);
> +    tcg_temp_free_i64(lsh);
> +    tcg_temp_free_i64(rsh);
> +    tcg_temp_free_i64(zero);
> +    tcg_temp_free_i64(max);
> +}
> +
> +static void gen_ushl_vec(unsigned vece, TCGv_vec d, TCGv_vec a, TCGv_vec b)
> +{
> +    TCGv_vec lval = tcg_temp_new_vec_matching(d);
> +    TCGv_vec rval = tcg_temp_new_vec_matching(d);
> +    TCGv_vec lsh = tcg_temp_new_vec_matching(d);
> +    TCGv_vec rsh = tcg_temp_new_vec_matching(d);
> +    TCGv_vec msk, max;
> +
> +    /*
> +     * Rely on the TCG guarantee that out of range shifts produce
> +     * unspecified results, not undefined behaviour (i.e. no trap).
> +     * Discard out-of-range results after the fact.
> +     */
> +    tcg_gen_neg_vec(vece, rsh, b);
> +    if (vece == MO_8) {
> +        tcg_gen_mov_vec(lsh, b);
> +    } else {
> +        msk = tcg_temp_new_vec_matching(d);
> +        tcg_gen_dupi_vec(vece, msk, 0xff);
> +        tcg_gen_and_vec(vece, lsh, b, msk);
> +        tcg_gen_and_vec(vece, rsh, rsh, msk);
> +        tcg_temp_free_vec(msk);
> +    }
> +
> +    /*
> +     * Perform possibly out of range shifts, trusting that the operation
> +     * does not trap.  Discard unused results after the fact.
> +     */
> +    tcg_gen_shlv_vec(vece, lval, a, lsh);
> +    tcg_gen_shrv_vec(vece, rval, a, rsh);
> +
> +    max = tcg_temp_new_vec_matching(d);
> +    tcg_gen_dupi_vec(vece, max, 8 << vece);
> +
> +    /*
> +     * The choice of LT (signed) and GEU (unsigned) are biased toward
> +     * the instructions of the x86_64 host.  For MO_8, the whole byte
> +     * is significant so we must use an unsigned compare; otherwise we
> +     * have already masked to a byte and so a signed compare works.
> +     * Other tcg hosts have a full set of comparisons and do not care.
> +     */
> +    if (vece == MO_8) {
> +        tcg_gen_cmp_vec(TCG_COND_GEU, vece, lsh, lsh, max);
> +        tcg_gen_cmp_vec(TCG_COND_GEU, vece, rsh, rsh, max);
> +        tcg_gen_andc_vec(vece, lval, lval, lsh);
> +        tcg_gen_andc_vec(vece, rval, rval, rsh);
> +    } else {
> +        tcg_gen_cmp_vec(TCG_COND_LT, vece, lsh, lsh, max);
> +        tcg_gen_cmp_vec(TCG_COND_LT, vece, rsh, rsh, max);
> +        tcg_gen_and_vec(vece, lval, lval, lsh);
> +        tcg_gen_and_vec(vece, rval, rval, rsh);
> +    }
> +    tcg_gen_or_vec(vece, d, lval, rval);
> +
> +    tcg_temp_free_vec(max);
> +    tcg_temp_free_vec(lval);
> +    tcg_temp_free_vec(rval);
> +    tcg_temp_free_vec(lsh);
> +    tcg_temp_free_vec(rsh);
> +}
> +
> +static const TCGOpcode ushl_list[] = {
> +    INDEX_op_neg_vec, INDEX_op_shlv_vec,
> +    INDEX_op_shrv_vec, INDEX_op_cmp_vec, 0
> +};
> +
> +const GVecGen3 ushl_op[4] = {
> +    { .fniv = gen_ushl_vec,
> +      .fno = gen_helper_gvec_ushl_b,
> +      .opt_opc = ushl_list,
> +      .vece = MO_8 },
> +    { .fniv = gen_ushl_vec,
> +      .fno = gen_helper_gvec_ushl_h,
> +      .opt_opc = ushl_list,
> +      .vece = MO_16 },
> +    { .fni4 = gen_ushl_i32,
> +      .fniv = gen_ushl_vec,
> +      .opt_opc = ushl_list,
> +      .vece = MO_32 },
> +    { .fni8 = gen_ushl_i64,
> +      .fniv = gen_ushl_vec,
> +      .opt_opc = ushl_list,
> +      .vece = MO_64 },
> +};
> +
> +void gen_sshl_i32(TCGv_i32 d, TCGv_i32 a, TCGv_i32 b)
> +{
> +    TCGv_i32 lval = tcg_temp_new_i32();
> +    TCGv_i32 rval = tcg_temp_new_i32();
> +    TCGv_i32 lsh = tcg_temp_new_i32();
> +    TCGv_i32 rsh = tcg_temp_new_i32();
> +    TCGv_i32 zero = tcg_const_i32(0);
> +    TCGv_i32 max = tcg_const_i32(31);
> +
> +    /*
> +     * Rely on the TCG guarantee that out of range shifts produce
> +     * unspecified results, not undefined behaviour (i.e. no trap).
> +     * Discard out-of-range results after the fact.
> +     */
> +    tcg_gen_ext8s_i32(lsh, b);
> +    tcg_gen_neg_i32(rsh, lsh);
> +    tcg_gen_shl_i32(lval, a, lsh);
> +    tcg_gen_umin_i32(rsh, rsh, max);
> +    tcg_gen_sar_i32(rval, a, rsh);
> +    tcg_gen_movcond_i32(TCG_COND_LEU, lval, lsh, max, lval, zero);
> +    tcg_gen_movcond_i32(TCG_COND_LT, d, lsh, zero, rval, lval);
> +
> +    tcg_temp_free_i32(lval);
> +    tcg_temp_free_i32(rval);
> +    tcg_temp_free_i32(lsh);
> +    tcg_temp_free_i32(rsh);
> +    tcg_temp_free_i32(zero);
> +    tcg_temp_free_i32(max);
> +}
> +
> +void gen_sshl_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
> +{
> +    TCGv_i64 lval = tcg_temp_new_i64();
> +    TCGv_i64 rval = tcg_temp_new_i64();
> +    TCGv_i64 lsh = tcg_temp_new_i64();
> +    TCGv_i64 rsh = tcg_temp_new_i64();
> +    TCGv_i64 zero = tcg_const_i64(0);
> +    TCGv_i64 max = tcg_const_i64(63);
> +
> +    /*
> +     * Rely on the TCG guarantee that out of range shifts produce
> +     * unspecified results, not undefined behaviour (i.e. no trap).
> +     * Discard out-of-range results after the fact.
> +     */
> +    tcg_gen_ext8s_i64(lsh, b);
> +    tcg_gen_neg_i64(rsh, lsh);
> +    tcg_gen_shl_i64(lval, a, lsh);
> +    tcg_gen_umin_i64(rsh, rsh, max);
> +    tcg_gen_sar_i64(rval, a, rsh);
> +    tcg_gen_movcond_i64(TCG_COND_LEU, lval, lsh, max, lval, zero);
> +    tcg_gen_movcond_i64(TCG_COND_LT, d, lsh, zero, rval, lval);
> +
> +    tcg_temp_free_i64(lval);
> +    tcg_temp_free_i64(rval);
> +    tcg_temp_free_i64(lsh);
> +    tcg_temp_free_i64(rsh);
> +    tcg_temp_free_i64(zero);
> +    tcg_temp_free_i64(max);
> +}
> +
> +static void gen_sshl_vec(unsigned vece, TCGv_vec d, TCGv_vec a, TCGv_vec b)
> +{
> +    TCGv_vec lval = tcg_temp_new_vec_matching(d);
> +    TCGv_vec rval = tcg_temp_new_vec_matching(d);
> +    TCGv_vec lsh = tcg_temp_new_vec_matching(d);
> +    TCGv_vec rsh = tcg_temp_new_vec_matching(d);
> +    TCGv_vec tmp = tcg_temp_new_vec_matching(d);
> +
> +    /*
> +     * Rely on the TCG guarantee that out of range shifts produce
> +     * unspecified results, not undefined behaviour (i.e. no trap).
> +     * Discard out-of-range results after the fact.
> +     */
> +    tcg_gen_neg_vec(vece, rsh, b);
> +    if (vece == MO_8) {
> +        tcg_gen_mov_vec(lsh, b);
> +    } else {
> +        tcg_gen_dupi_vec(vece, tmp, 0xff);
> +        tcg_gen_and_vec(vece, lsh, b, tmp);
> +        tcg_gen_and_vec(vece, rsh, rsh, tmp);
> +    }
> +
> +    /* Bound rsh so out of bound right shift gets -1.  */
> +    tcg_gen_dupi_vec(vece, tmp, (8 << vece) - 1);
> +    tcg_gen_umin_vec(vece, rsh, rsh, tmp);
> +    tcg_gen_cmp_vec(TCG_COND_GT, vece, tmp, lsh, tmp);
> +
> +    tcg_gen_shlv_vec(vece, lval, a, lsh);
> +    tcg_gen_sarv_vec(vece, rval, a, rsh);
> +
> +    /* Select in-bound left shift.  */
> +    tcg_gen_andc_vec(vece, lval, lval, tmp);
> +
> +    /* Select between left and right shift.  */
> +    if (vece == MO_8) {
> +        tcg_gen_dupi_vec(vece, tmp, 0);
> +        tcg_gen_cmpsel_vec(TCG_COND_LT, vece, d, lsh, tmp, rval, lval);
> +    } else {
> +        tcg_gen_dupi_vec(vece, tmp, 0x80);
> +        tcg_gen_cmpsel_vec(TCG_COND_LT, vece, d, lsh, tmp, lval, rval);
> +    }
> +
> +    tcg_temp_free_vec(lval);
> +    tcg_temp_free_vec(rval);
> +    tcg_temp_free_vec(lsh);
> +    tcg_temp_free_vec(rsh);
> +    tcg_temp_free_vec(tmp);
> +}
> +
> +static const TCGOpcode sshl_list[] = {
> +    INDEX_op_neg_vec, INDEX_op_umin_vec, INDEX_op_shlv_vec,
> +    INDEX_op_sarv_vec, INDEX_op_cmp_vec, INDEX_op_cmpsel_vec, 0
> +};
> +
> +const GVecGen3 sshl_op[4] = {
> +    { .fniv = gen_sshl_vec,
> +      .fno = gen_helper_gvec_sshl_b,
> +      .opt_opc = sshl_list,
> +      .vece = MO_8 },
> +    { .fniv = gen_sshl_vec,
> +      .fno = gen_helper_gvec_sshl_h,
> +      .opt_opc = sshl_list,
> +      .vece = MO_16 },
> +    { .fni4 = gen_sshl_i32,
> +      .fniv = gen_sshl_vec,
> +      .opt_opc = sshl_list,
> +      .vece = MO_32 },
> +    { .fni8 = gen_sshl_i64,
> +      .fniv = gen_sshl_vec,
> +      .opt_opc = sshl_list,
> +      .vece = MO_64 },
> +};
> +
>  static void gen_uqadd_vec(unsigned vece, TCGv_vec t, TCGv_vec sat,
>                            TCGv_vec a, TCGv_vec b)
>  {
> @@ -4792,6 +5068,12 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>                                    vec_size, vec_size);
>              }
>              return 0;
> +
> +        case NEON_3R_VSHL:
> +            /* Note the operation is vshl vd,vm,vn */
> +            tcg_gen_gvec_3(rd_ofs, rm_ofs, rn_ofs, vec_size, vec_size,
> +                           u ? &ushl_op[size] : &sshl_op[size]);
> +            return 0;
>          }
>
>          if (size == 3) {
> @@ -4800,13 +5082,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>                  neon_load_reg64(cpu_V0, rn + pass);
>                  neon_load_reg64(cpu_V1, rm + pass);
>                  switch (op) {
> -                case NEON_3R_VSHL:
> -                    if (u) {
> -                        gen_helper_neon_shl_u64(cpu_V0, cpu_V1, cpu_V0);
> -                    } else {
> -                        gen_helper_neon_shl_s64(cpu_V0, cpu_V1, cpu_V0);
> -                    }
> -                    break;
>                  case NEON_3R_VQSHL:
>                      if (u) {
>                          gen_helper_neon_qshl_u64(cpu_V0, cpu_env,
> @@ -4841,7 +5116,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>          }
>          pairwise = 0;
>          switch (op) {
> -        case NEON_3R_VSHL:
>          case NEON_3R_VQSHL:
>          case NEON_3R_VRSHL:
>          case NEON_3R_VQRSHL:
> @@ -4921,9 +5195,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>          case NEON_3R_VHSUB:
>              GEN_NEON_INTEGER_OP(hsub);
>              break;
> -        case NEON_3R_VSHL:
> -            GEN_NEON_INTEGER_OP(shl);
> -            break;
>          case NEON_3R_VQSHL:
>              GEN_NEON_INTEGER_OP_ENV(qshl);
>              break;
> @@ -5332,9 +5603,9 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>                              }
>                          } else {
>                              if (input_unsigned) {
> -                                gen_helper_neon_shl_u64(cpu_V0, in, tmp64);
> +                                gen_ushl_i64(cpu_V0, in, tmp64);
>                              } else {
> -                                gen_helper_neon_shl_s64(cpu_V0, in, tmp64);
> +                                gen_sshl_i64(cpu_V0, in, tmp64);
>                              }
>                          }
>                          tmp = tcg_temp_new_i32();
> diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
> index dedef62403..fcb3663903 100644
> --- a/target/arm/vec_helper.c
> +++ b/target/arm/vec_helper.c
> @@ -1046,3 +1046,91 @@ void HELPER(gvec_fmlal_idx_a64)(void *vd, void *vn, void *vm,
>      do_fmlal_idx(vd, vn, vm, &env->vfp.fp_status, desc,
>                   get_flush_inputs_to_zero(&env->vfp.fp_status_f16));
>  }
> +
> +void HELPER(gvec_sshl_b)(void *vd, void *vn, void *vm, uint32_t desc)
> +{
> +    intptr_t i, opr_sz = simd_oprsz(desc);
> +    int8_t *d = vd, *n = vn, *m = vm;
> +
> +    for (i = 0; i < opr_sz; ++i) {
> +        int8_t mm = m[i];
> +        int8_t nn = n[i];
> +        int8_t res = 0;
> +        if (mm >= 0) {
> +            if (mm < 8) {
> +                res = nn << mm;
> +            }
> +        } else {
> +            res = nn >> (mm > -8 ? -mm : 7);
> +        }
> +        d[i] = res;
> +    }
> +    clear_tail(d, opr_sz, simd_maxsz(desc));
> +}
> +
> +void HELPER(gvec_sshl_h)(void *vd, void *vn, void *vm, uint32_t desc)
> +{
> +    intptr_t i, opr_sz = simd_oprsz(desc);
> +    int16_t *d = vd, *n = vn, *m = vm;
> +
> +    for (i = 0; i < opr_sz / 2; ++i) {
> +        int8_t mm = m[i];   /* only 8 bits of shift are significant */
> +        int16_t nn = n[i];
> +        int16_t res = 0;
> +        if (mm >= 0) {
> +            if (mm < 16) {
> +                res = nn << mm;
> +            }
> +        } else {
> +            res = nn >> (mm > -16 ? -mm : 15);
> +        }
> +        d[i] = res;
> +    }
> +    clear_tail(d, opr_sz, simd_maxsz(desc));
> +}
> +
> +void HELPER(gvec_ushl_b)(void *vd, void *vn, void *vm, uint32_t desc)
> +{
> +    intptr_t i, opr_sz = simd_oprsz(desc);
> +    uint8_t *d = vd, *n = vn, *m = vm;
> +
> +    for (i = 0; i < opr_sz; ++i) {
> +        int8_t mm = m[i];
> +        uint8_t nn = n[i];
> +        uint8_t res = 0;
> +        if (mm >= 0) {
> +            if (mm < 8) {
> +                res = nn << mm;
> +            }
> +        } else {
> +            if (mm > -8) {
> +                res = nn >> -mm;
> +            }
> +        }
> +        d[i] = res;
> +    }
> +    clear_tail(d, opr_sz, simd_maxsz(desc));
> +}
> +
> +void HELPER(gvec_ushl_h)(void *vd, void *vn, void *vm, uint32_t desc)
> +{
> +    intptr_t i, opr_sz = simd_oprsz(desc);
> +    uint16_t *d = vd, *n = vn, *m = vm;
> +
> +    for (i = 0; i < opr_sz / 2; ++i) {
> +        int8_t mm = m[i];   /* only 8 bits of shift are significant */
> +        uint16_t nn = n[i];
> +        uint16_t res = 0;
> +        if (mm >= 0) {
> +            if (mm < 16) {
> +                res = nn << mm;
> +            }
> +        } else {
> +            if (mm > -16) {
> +                res = nn >> -mm;
> +            }
> +        }
> +        d[i] = res;
> +    }
> +    clear_tail(d, opr_sz, simd_maxsz(desc));
> +}

Anyway otherwise it LGTM:

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

--
Alex Bennée


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] target/arm: Convert PMULL.64 to gvec
  2019-10-17  4:42 ` [PATCH 3/4] target/arm: Convert PMULL.64 " Richard Henderson
@ 2019-10-18 12:24   ` Alex Bennée
  2019-10-18 13:40     ` Alex Bennée
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Bennée @ 2019-10-18 12:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> The gvec form will be needed for implementing SVE2.

Hmm I'm seeing a failure against:

  aarch32-all-v80/insn_VMULL__INC.risu.bin

From:

  https://fileserver.linaro.org/owncloud/index.php/s/hvEXM2eJ3uZVhlH
  https://fileserver.linaro.org/owncloud/index.php/s/hvEXM2eJ3uZVhlH/download?path=%2F&files=aarch32-all-v80.tar.xz

And some others. But this seems to be broken in master as well so I
don't know if this is a regression or because I have my -cpu wrong for
qemu-arm for something recorded on a cortex-a53 in aarch32.

>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  target/arm/helper.h        |  4 +---
>  target/arm/neon_helper.c   | 30 ------------------------------
>  target/arm/translate-a64.c | 28 +++-------------------------
>  target/arm/translate.c     | 16 ++--------------
>  target/arm/vec_helper.c    | 33 +++++++++++++++++++++++++++++++++
>  5 files changed, 39 insertions(+), 72 deletions(-)
>
> diff --git a/target/arm/helper.h b/target/arm/helper.h
> index 800446e537..d954399b7e 100644
> --- a/target/arm/helper.h
> +++ b/target/arm/helper.h
> @@ -555,9 +555,6 @@ DEF_HELPER_FLAGS_3(crc32, TCG_CALL_NO_RWG_SE, i32, i32, i32, i32)
>  DEF_HELPER_FLAGS_3(crc32c, TCG_CALL_NO_RWG_SE, i32, i32, i32, i32)
>  DEF_HELPER_2(dc_zva, void, env, i64)
>
> -DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
> -DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
> -
>  DEF_HELPER_FLAGS_5(gvec_qrdmlah_s16, TCG_CALL_NO_RWG,
>                     void, ptr, ptr, ptr, ptr, i32)
>  DEF_HELPER_FLAGS_5(gvec_qrdmlsh_s16, TCG_CALL_NO_RWG,
> @@ -689,6 +686,7 @@ DEF_HELPER_FLAGS_4(gvec_ushl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>  DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>
>  DEF_HELPER_FLAGS_4(gvec_pmul_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_4(gvec_pmull_q, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>
>  #ifdef TARGET_AARCH64
>  #include "helper-a64.h"
> diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
> index 9e7a9a1ac5..6a107da0e1 100644
> --- a/target/arm/neon_helper.c
> +++ b/target/arm/neon_helper.c
> @@ -2152,33 +2152,3 @@ void HELPER(neon_zip16)(void *vd, void *vm)
>      rm[0] = m0;
>      rd[0] = d0;
>  }
> -
> -/* Helper function for 64 bit polynomial multiply case:
> - * perform PolynomialMult(op1, op2) and return either the top or
> - * bottom half of the 128 bit result.
> - */
> -uint64_t HELPER(neon_pmull_64_lo)(uint64_t op1, uint64_t op2)
> -{
> -    int bitnum;
> -    uint64_t res = 0;
> -
> -    for (bitnum = 0; bitnum < 64; bitnum++) {
> -        if (op1 & (1ULL << bitnum)) {
> -            res ^= op2 << bitnum;
> -        }
> -    }
> -    return res;
> -}
> -uint64_t HELPER(neon_pmull_64_hi)(uint64_t op1, uint64_t op2)
> -{
> -    int bitnum;
> -    uint64_t res = 0;
> -
> -    /* bit 0 of op1 can't influence the high 64 bits at all */
> -    for (bitnum = 1; bitnum < 64; bitnum++) {
> -        if (op1 & (1ULL << bitnum)) {
> -            res ^= op2 >> (64 - bitnum);
> -        }
> -    }
> -    return res;
> -}
> diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
> index 04e25cfe06..12588d18df 100644
> --- a/target/arm/translate-a64.c
> +++ b/target/arm/translate-a64.c
> @@ -10598,30 +10598,6 @@ static void handle_3rd_narrowing(DisasContext *s, int is_q, int is_u, int size,
>      clear_vec_high(s, is_q, rd);
>  }
>
> -static void handle_pmull_64(DisasContext *s, int is_q, int rd, int rn, int rm)
> -{
> -    /* PMULL of 64 x 64 -> 128 is an odd special case because it
> -     * is the only three-reg-diff instruction which produces a
> -     * 128-bit wide result from a single operation. However since
> -     * it's possible to calculate the two halves more or less
> -     * separately we just use two helper calls.
> -     */
> -    TCGv_i64 tcg_op1 = tcg_temp_new_i64();
> -    TCGv_i64 tcg_op2 = tcg_temp_new_i64();
> -    TCGv_i64 tcg_res = tcg_temp_new_i64();
> -
> -    read_vec_element(s, tcg_op1, rn, is_q, MO_64);
> -    read_vec_element(s, tcg_op2, rm, is_q, MO_64);
> -    gen_helper_neon_pmull_64_lo(tcg_res, tcg_op1, tcg_op2);
> -    write_vec_element(s, tcg_res, rd, 0, MO_64);
> -    gen_helper_neon_pmull_64_hi(tcg_res, tcg_op1, tcg_op2);
> -    write_vec_element(s, tcg_res, rd, 1, MO_64);
> -
> -    tcg_temp_free_i64(tcg_op1);
> -    tcg_temp_free_i64(tcg_op2);
> -    tcg_temp_free_i64(tcg_res);
> -}
> -
>  /* AdvSIMD three different
>   *   31  30  29 28       24 23  22  21 20  16 15    12 11 10 9    5 4    0
>   * +---+---+---+-----------+------+---+------+--------+-----+------+------+
> @@ -10686,7 +10662,9 @@ static void disas_simd_three_reg_diff(DisasContext *s, uint32_t insn)
>              if (!fp_access_check(s)) {
>                  return;
>              }
> -            handle_pmull_64(s, is_q, rd, rn, rm);
> +            /* The Q field specifies lo/hi half input for this insn.  */
> +            gen_gvec_op3_ool(s, true, rd, rn, rm, is_q,
> +                             gen_helper_gvec_pmull_q);
>              return;
>          }
>          goto is_widening;
> diff --git a/target/arm/translate.c b/target/arm/translate.c
> index b66a2f6b71..4e34249672 100644
> --- a/target/arm/translate.c
> +++ b/target/arm/translate.c
> @@ -5877,23 +5877,11 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>                   * outside the loop below as it only performs a single pass.
>                   */
>                  if (op == 14 && size == 2) {
> -                    TCGv_i64 tcg_rn, tcg_rm, tcg_rd;
> -
>                      if (!dc_isar_feature(aa32_pmull, s)) {
>                          return 1;
>                      }
> -                    tcg_rn = tcg_temp_new_i64();
> -                    tcg_rm = tcg_temp_new_i64();
> -                    tcg_rd = tcg_temp_new_i64();
> -                    neon_load_reg64(tcg_rn, rn);
> -                    neon_load_reg64(tcg_rm, rm);
> -                    gen_helper_neon_pmull_64_lo(tcg_rd, tcg_rn, tcg_rm);
> -                    neon_store_reg64(tcg_rd, rd);
> -                    gen_helper_neon_pmull_64_hi(tcg_rd, tcg_rn, tcg_rm);
> -                    neon_store_reg64(tcg_rd, rd + 1);
> -                    tcg_temp_free_i64(tcg_rn);
> -                    tcg_temp_free_i64(tcg_rm);
> -                    tcg_temp_free_i64(tcg_rd);
> +                    tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, 16, 16,
> +                                       0, gen_helper_gvec_pmull_q);
>                      return 0;
>                  }
>
> diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
> index d401282c6f..5c1074374e 100644
> --- a/target/arm/vec_helper.c
> +++ b/target/arm/vec_helper.c
> @@ -1164,3 +1164,36 @@ void HELPER(gvec_pmul_b)(void *vd, void *vn, void *vm, uint32_t desc)
>      }
>      clear_tail(d, opr_sz, simd_maxsz(desc));
>  }
> +
> +/*
> + * 64x64->128 polynomial multiply.
> + * Because of the lanes are not accessed in strict columns,
> + * this probably cannot be turned into a generic helper.
> + */
> +void HELPER(gvec_pmull_q)(void *vd, void *vn, void *vm, uint32_t desc)
> +{
> +    intptr_t i, j, opr_sz = simd_oprsz(desc);
> +    intptr_t hi = simd_data(desc);
> +    uint64_t *d = vd, *n = vn, *m = vm;
> +
> +    for (i = 0; i < opr_sz / 8; i += 2) {
> +        uint64_t nn = n[i + hi];
> +        uint64_t mm = m[i + hi];
> +        uint64_t rhi = 0;
> +        uint64_t rlo = 0;
> +
> +        /* Bit 0 can only influence the low 64-bit result.  */
> +        if (nn & 1) {
> +            rlo = mm;
> +        }
> +
> +        for (j = 1; j < 64; ++j) {
> +            uint64_t mask = -((nn >> j) & 1);
> +            rlo ^= (mm << j) & mask;
> +            rhi ^= (mm >> (64 - j)) & mask;
> +        }
> +        d[i] = rlo;
> +        d[i + 1] = rhi;
> +    }
> +    clear_tail(d, opr_sz, simd_maxsz(desc));
> +}


--
Alex Bennée


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] target/arm: Convert PMULL.64 to gvec
  2019-10-18 12:24   ` Alex Bennée
@ 2019-10-18 13:40     ` Alex Bennée
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Bennée @ 2019-10-18 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm


Alex Bennée <alex.bennee@linaro.org> writes:

> Richard Henderson <richard.henderson@linaro.org> writes:
>
>> The gvec form will be needed for implementing SVE2.
>
> Hmm I'm seeing a failure against:
>
>   aarch32-all-v80/insn_VMULL__INC.risu.bin

I take it back, after monkey patching cortex-a53 into qemu-arm it
passes.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>

>
> From:
>
>   https://fileserver.linaro.org/owncloud/index.php/s/hvEXM2eJ3uZVhlH
>   https://fileserver.linaro.org/owncloud/index.php/s/hvEXM2eJ3uZVhlH/download?path=%2F&files=aarch32-all-v80.tar.xz
>
> And some others. But this seems to be broken in master as well so I
> don't know if this is a regression or because I have my -cpu wrong for
> qemu-arm for something recorded on a cortex-a53 in aarch32.
>
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>  target/arm/helper.h        |  4 +---
>>  target/arm/neon_helper.c   | 30 ------------------------------
>>  target/arm/translate-a64.c | 28 +++-------------------------
>>  target/arm/translate.c     | 16 ++--------------
>>  target/arm/vec_helper.c    | 33 +++++++++++++++++++++++++++++++++
>>  5 files changed, 39 insertions(+), 72 deletions(-)
>>
>> diff --git a/target/arm/helper.h b/target/arm/helper.h
>> index 800446e537..d954399b7e 100644
>> --- a/target/arm/helper.h
>> +++ b/target/arm/helper.h
>> @@ -555,9 +555,6 @@ DEF_HELPER_FLAGS_3(crc32, TCG_CALL_NO_RWG_SE, i32, i32, i32, i32)
>>  DEF_HELPER_FLAGS_3(crc32c, TCG_CALL_NO_RWG_SE, i32, i32, i32, i32)
>>  DEF_HELPER_2(dc_zva, void, env, i64)
>>
>> -DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>> -DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>> -
>>  DEF_HELPER_FLAGS_5(gvec_qrdmlah_s16, TCG_CALL_NO_RWG,
>>                     void, ptr, ptr, ptr, ptr, i32)
>>  DEF_HELPER_FLAGS_5(gvec_qrdmlsh_s16, TCG_CALL_NO_RWG,
>> @@ -689,6 +686,7 @@ DEF_HELPER_FLAGS_4(gvec_ushl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>>  DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>>
>>  DEF_HELPER_FLAGS_4(gvec_pmul_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>> +DEF_HELPER_FLAGS_4(gvec_pmull_q, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>>
>>  #ifdef TARGET_AARCH64
>>  #include "helper-a64.h"
>> diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
>> index 9e7a9a1ac5..6a107da0e1 100644
>> --- a/target/arm/neon_helper.c
>> +++ b/target/arm/neon_helper.c
>> @@ -2152,33 +2152,3 @@ void HELPER(neon_zip16)(void *vd, void *vm)
>>      rm[0] = m0;
>>      rd[0] = d0;
>>  }
>> -
>> -/* Helper function for 64 bit polynomial multiply case:
>> - * perform PolynomialMult(op1, op2) and return either the top or
>> - * bottom half of the 128 bit result.
>> - */
>> -uint64_t HELPER(neon_pmull_64_lo)(uint64_t op1, uint64_t op2)
>> -{
>> -    int bitnum;
>> -    uint64_t res = 0;
>> -
>> -    for (bitnum = 0; bitnum < 64; bitnum++) {
>> -        if (op1 & (1ULL << bitnum)) {
>> -            res ^= op2 << bitnum;
>> -        }
>> -    }
>> -    return res;
>> -}
>> -uint64_t HELPER(neon_pmull_64_hi)(uint64_t op1, uint64_t op2)
>> -{
>> -    int bitnum;
>> -    uint64_t res = 0;
>> -
>> -    /* bit 0 of op1 can't influence the high 64 bits at all */
>> -    for (bitnum = 1; bitnum < 64; bitnum++) {
>> -        if (op1 & (1ULL << bitnum)) {
>> -            res ^= op2 >> (64 - bitnum);
>> -        }
>> -    }
>> -    return res;
>> -}
>> diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
>> index 04e25cfe06..12588d18df 100644
>> --- a/target/arm/translate-a64.c
>> +++ b/target/arm/translate-a64.c
>> @@ -10598,30 +10598,6 @@ static void handle_3rd_narrowing(DisasContext *s, int is_q, int is_u, int size,
>>      clear_vec_high(s, is_q, rd);
>>  }
>>
>> -static void handle_pmull_64(DisasContext *s, int is_q, int rd, int rn, int rm)
>> -{
>> -    /* PMULL of 64 x 64 -> 128 is an odd special case because it
>> -     * is the only three-reg-diff instruction which produces a
>> -     * 128-bit wide result from a single operation. However since
>> -     * it's possible to calculate the two halves more or less
>> -     * separately we just use two helper calls.
>> -     */
>> -    TCGv_i64 tcg_op1 = tcg_temp_new_i64();
>> -    TCGv_i64 tcg_op2 = tcg_temp_new_i64();
>> -    TCGv_i64 tcg_res = tcg_temp_new_i64();
>> -
>> -    read_vec_element(s, tcg_op1, rn, is_q, MO_64);
>> -    read_vec_element(s, tcg_op2, rm, is_q, MO_64);
>> -    gen_helper_neon_pmull_64_lo(tcg_res, tcg_op1, tcg_op2);
>> -    write_vec_element(s, tcg_res, rd, 0, MO_64);
>> -    gen_helper_neon_pmull_64_hi(tcg_res, tcg_op1, tcg_op2);
>> -    write_vec_element(s, tcg_res, rd, 1, MO_64);
>> -
>> -    tcg_temp_free_i64(tcg_op1);
>> -    tcg_temp_free_i64(tcg_op2);
>> -    tcg_temp_free_i64(tcg_res);
>> -}
>> -
>>  /* AdvSIMD three different
>>   *   31  30  29 28       24 23  22  21 20  16 15    12 11 10 9    5 4    0
>>   * +---+---+---+-----------+------+---+------+--------+-----+------+------+
>> @@ -10686,7 +10662,9 @@ static void disas_simd_three_reg_diff(DisasContext *s, uint32_t insn)
>>              if (!fp_access_check(s)) {
>>                  return;
>>              }
>> -            handle_pmull_64(s, is_q, rd, rn, rm);
>> +            /* The Q field specifies lo/hi half input for this insn.  */
>> +            gen_gvec_op3_ool(s, true, rd, rn, rm, is_q,
>> +                             gen_helper_gvec_pmull_q);
>>              return;
>>          }
>>          goto is_widening;
>> diff --git a/target/arm/translate.c b/target/arm/translate.c
>> index b66a2f6b71..4e34249672 100644
>> --- a/target/arm/translate.c
>> +++ b/target/arm/translate.c
>> @@ -5877,23 +5877,11 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>>                   * outside the loop below as it only performs a single pass.
>>                   */
>>                  if (op == 14 && size == 2) {
>> -                    TCGv_i64 tcg_rn, tcg_rm, tcg_rd;
>> -
>>                      if (!dc_isar_feature(aa32_pmull, s)) {
>>                          return 1;
>>                      }
>> -                    tcg_rn = tcg_temp_new_i64();
>> -                    tcg_rm = tcg_temp_new_i64();
>> -                    tcg_rd = tcg_temp_new_i64();
>> -                    neon_load_reg64(tcg_rn, rn);
>> -                    neon_load_reg64(tcg_rm, rm);
>> -                    gen_helper_neon_pmull_64_lo(tcg_rd, tcg_rn, tcg_rm);
>> -                    neon_store_reg64(tcg_rd, rd);
>> -                    gen_helper_neon_pmull_64_hi(tcg_rd, tcg_rn, tcg_rm);
>> -                    neon_store_reg64(tcg_rd, rd + 1);
>> -                    tcg_temp_free_i64(tcg_rn);
>> -                    tcg_temp_free_i64(tcg_rm);
>> -                    tcg_temp_free_i64(tcg_rd);
>> +                    tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, 16, 16,
>> +                                       0, gen_helper_gvec_pmull_q);
>>                      return 0;
>>                  }
>>
>> diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
>> index d401282c6f..5c1074374e 100644
>> --- a/target/arm/vec_helper.c
>> +++ b/target/arm/vec_helper.c
>> @@ -1164,3 +1164,36 @@ void HELPER(gvec_pmul_b)(void *vd, void *vn, void *vm, uint32_t desc)
>>      }
>>      clear_tail(d, opr_sz, simd_maxsz(desc));
>>  }
>> +
>> +/*
>> + * 64x64->128 polynomial multiply.
>> + * Because of the lanes are not accessed in strict columns,
>> + * this probably cannot be turned into a generic helper.
>> + */
>> +void HELPER(gvec_pmull_q)(void *vd, void *vn, void *vm, uint32_t desc)
>> +{
>> +    intptr_t i, j, opr_sz = simd_oprsz(desc);
>> +    intptr_t hi = simd_data(desc);
>> +    uint64_t *d = vd, *n = vn, *m = vm;
>> +
>> +    for (i = 0; i < opr_sz / 8; i += 2) {
>> +        uint64_t nn = n[i + hi];
>> +        uint64_t mm = m[i + hi];
>> +        uint64_t rhi = 0;
>> +        uint64_t rlo = 0;
>> +
>> +        /* Bit 0 can only influence the low 64-bit result.  */
>> +        if (nn & 1) {
>> +            rlo = mm;
>> +        }
>> +
>> +        for (j = 1; j < 64; ++j) {
>> +            uint64_t mask = -((nn >> j) & 1);
>> +            rlo ^= (mm << j) & mask;
>> +            rhi ^= (mm >> (64 - j)) & mask;
>> +        }
>> +        d[i] = rlo;
>> +        d[i + 1] = rhi;
>> +    }
>> +    clear_tail(d, opr_sz, simd_maxsz(desc));
>> +}


--
Alex Bennée


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] target/arm: Convert PMUL.8 to gvec
  2019-10-17  4:42 ` [PATCH 2/4] target/arm: Convert PMUL.8 to gvec Richard Henderson
@ 2019-10-18 13:40   ` Alex Bennée
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Bennée @ 2019-10-18 13:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, qemu-arm


Richard Henderson <richard.henderson@linaro.org> writes:

> The gvec form will be needed for implementing SVE2.
>
> Extend the implementation to operate on uint64_t instead of uint32_t.
> Use a counted inner loop instead of terminating when op1 goes to zero,
> looking toward the required implementation for ARMv8.4-DIT.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  target/arm/helper.h        |  3 ++-
>  target/arm/neon_helper.c   | 22 ----------------------
>  target/arm/translate-a64.c | 10 +++-------
>  target/arm/translate.c     | 11 ++++-------
>  target/arm/vec_helper.c    | 30 ++++++++++++++++++++++++++++++
>  5 files changed, 39 insertions(+), 37 deletions(-)
>
> diff --git a/target/arm/helper.h b/target/arm/helper.h
> index fc0d594a14..800446e537 100644
> --- a/target/arm/helper.h
> +++ b/target/arm/helper.h
> @@ -335,7 +335,6 @@ DEF_HELPER_2(neon_sub_u8, i32, i32, i32)
>  DEF_HELPER_2(neon_sub_u16, i32, i32, i32)
>  DEF_HELPER_2(neon_mul_u8, i32, i32, i32)
>  DEF_HELPER_2(neon_mul_u16, i32, i32, i32)
> -DEF_HELPER_2(neon_mul_p8, i32, i32, i32)
>  DEF_HELPER_2(neon_mull_p8, i64, i32, i32)
>
>  DEF_HELPER_2(neon_tst_u8, i32, i32, i32)
> @@ -689,6 +688,8 @@ DEF_HELPER_FLAGS_4(gvec_sshl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>  DEF_HELPER_FLAGS_4(gvec_ushl_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>  DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>
> +DEF_HELPER_FLAGS_4(gvec_pmul_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
>  #ifdef TARGET_AARCH64
>  #include "helper-a64.h"
>  #include "helper-sve.h"
> diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
> index c581ffb7d3..9e7a9a1ac5 100644
> --- a/target/arm/neon_helper.c
> +++ b/target/arm/neon_helper.c
> @@ -1131,28 +1131,6 @@ NEON_VOP(mul_u16, neon_u16, 2)
>
>  /* Polynomial multiplication is like integer multiplication except the
>     partial products are XORed, not added.  */
> -uint32_t HELPER(neon_mul_p8)(uint32_t op1, uint32_t op2)
> -{
> -    uint32_t mask;
> -    uint32_t result;
> -    result = 0;
> -    while (op1) {
> -        mask = 0;
> -        if (op1 & 1)
> -            mask |= 0xff;
> -        if (op1 & (1 << 8))
> -            mask |= (0xff << 8);
> -        if (op1 & (1 << 16))
> -            mask |= (0xff << 16);
> -        if (op1 & (1 << 24))
> -            mask |= (0xff << 24);
> -        result ^= op2 & mask;
> -        op1 = (op1 >> 1) & 0x7f7f7f7f;
> -        op2 = (op2 << 1) & 0xfefefefe;
> -    }
> -    return result;
> -}
> -
>  uint64_t HELPER(neon_mull_p8)(uint32_t op1, uint32_t op2)
>  {
>      uint64_t result = 0;
> diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
> index 255a168df6..04e25cfe06 100644
> --- a/target/arm/translate-a64.c
> +++ b/target/arm/translate-a64.c
> @@ -11110,9 +11110,10 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
>      case 0x13: /* MUL, PMUL */
>          if (!u) { /* MUL */
>              gen_gvec_fn3(s, is_q, rd, rn, rm, tcg_gen_gvec_mul, size);
> -            return;
> +        } else {  /* PMUL */
> +            gen_gvec_op3_ool(s, is_q, rd, rn, rm, 0, gen_helper_gvec_pmul_b);
>          }
> -        break;
> +        return;
>      case 0x12: /* MLA, MLS */
>          if (u) {
>              gen_gvec_op3(s, is_q, rd, rn, rm, &mls_op[size]);
> @@ -11242,11 +11243,6 @@ static void disas_simd_3same_int(DisasContext *s, uint32_t insn)
>                  genfn = fns[size][u];
>                  break;
>              }
> -            case 0x13: /* MUL, PMUL */
> -                assert(u); /* PMUL */
> -                assert(size == 0);
> -                genfn = gen_helper_neon_mul_p8;
> -                break;
>              case 0x16: /* SQDMULH, SQRDMULH */
>              {
>                  static NeonGenTwoOpEnvFn * const fns[2][2] = {
> diff --git a/target/arm/translate.c b/target/arm/translate.c
> index 598bb1cc00..b66a2f6b71 100644
> --- a/target/arm/translate.c
> +++ b/target/arm/translate.c
> @@ -5014,16 +5014,17 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>
>          case NEON_3R_VMUL: /* VMUL */
>              if (u) {
> -                /* Polynomial case allows only P8 and is handled below.  */
> +                /* Polynomial case allows only P8.  */
>                  if (size != 0) {
>                      return 1;
>                  }
> +                tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, vec_size, vec_size,
> +                                   0, gen_helper_gvec_pmul_b);
>              } else {
>                  tcg_gen_gvec_mul(size, rd_ofs, rn_ofs, rm_ofs,
>                                   vec_size, vec_size);
> -                return 0;
>              }
> -            break;
> +            return 0;
>
>          case NEON_3R_VML: /* VMLA, VMLS */
>              tcg_gen_gvec_3(rd_ofs, rn_ofs, rm_ofs, vec_size, vec_size,
> @@ -5213,10 +5214,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>              tmp2 = neon_load_reg(rd, pass);
>              gen_neon_add(size, tmp, tmp2);
>              break;
> -        case NEON_3R_VMUL:
> -            /* VMUL.P8; other cases already eliminated.  */
> -            gen_helper_neon_mul_p8(tmp, tmp, tmp2);
> -            break;
>          case NEON_3R_VPMAX:
>              GEN_NEON_INTEGER_OP(pmax);
>              break;
> diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
> index fcb3663903..d401282c6f 100644
> --- a/target/arm/vec_helper.c
> +++ b/target/arm/vec_helper.c
> @@ -1134,3 +1134,33 @@ void HELPER(gvec_ushl_h)(void *vd, void *vn, void *vm, uint32_t desc)
>      }
>      clear_tail(d, opr_sz, simd_maxsz(desc));
>  }
> +
> +/*
> + * 8x8->8 polynomial multiply.
> + *
> + * Polynomial multiplication is like integer multiplication except the
> + * partial products are XORed, not added.
> + *
> + * TODO: expose this as a generic vector operation, as it is a common
> + * crypto building block.
> + */
> +void HELPER(gvec_pmul_b)(void *vd, void *vn, void *vm, uint32_t desc)
> +{
> +    intptr_t i, j, opr_sz = simd_oprsz(desc);
> +    uint64_t *d = vd, *n = vn, *m = vm;
> +
> +    for (i = 0; i < opr_sz / 8; ++i) {
> +        uint64_t nn = n[i];
> +        uint64_t mm = m[i];
> +        uint64_t rr = 0;
> +
> +        for (j = 0; j < 8; ++j) {
> +            uint64_t mask = (nn & 0x0101010101010101ull) * 0xff;
> +            rr ^= mm & mask;
> +            mm = (mm << 1) & 0xfefefefefefefefeull;
> +            nn = (nn >> 1) & 0x7f7f7f7f7f7f7f7full;
> +        }
> +        d[i] = rr;
> +    }
> +    clear_tail(d, opr_sz, simd_maxsz(desc));
> +}


--
Alex Bennée


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] target/arm: Vectorize USHL and SSHL
  2019-10-17 16:01   ` Alex Bennée
@ 2019-10-18 14:47     ` Richard Henderson
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2019-10-18 14:47 UTC (permalink / raw)
  To: Alex Bennée, qemu-devel; +Cc: peter.maydell, qemu-arm

On 10/17/19 9:01 AM, Alex Bennée wrote:
>> +    /*
>> +     * Rely on the TCG guarantee that out of range shifts produce
>> +     * unspecified results, not undefined behaviour (i.e. no trap).
>> +     * Discard out-of-range results after the fact.
>> +     */
>> +    tcg_gen_ext8s_i32(lsh, b);
>> +    tcg_gen_neg_i32(rsh, lsh);
>> +    tcg_gen_shl_i32(lval, a, lsh);
>> +    tcg_gen_shr_i32(rval, a, rsh);
>> +    tcg_gen_movcond_i32(TCG_COND_LTU, d, lsh, max, lval, zero);
>> +    tcg_gen_movcond_i32(TCG_COND_LTU, d, rsh, max, rval, d);
> 
> Do these get dead coded away if the shift is a const?

They would, although because of the form of the instruction, as a variable
shift from a vector element, I don't expect it to ever be const.  We will have
just read the value from env memory.


r~


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] target/arm: Convert PMULL.8 to gvec
  2019-10-17  4:42 ` [PATCH 4/4] target/arm: Convert PMULL.8 " Richard Henderson
@ 2019-10-18 17:54   ` Alex Bennée
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Bennée @ 2019-10-18 17:54 UTC (permalink / raw)
  To: qemu-arm; +Cc: peter.maydell, qemu-devel


Richard Henderson <richard.henderson@linaro.org> writes:

> We still need two different helpers, since NEON and SVE2 get the
> inputs from different locations within the source vector.  However,
> we can convert both to the same internal form for computation.
>
> The sve2 helper is not used yet, but adding it with this patch
> helps illustrate why the neon changes are helpful.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  target/arm/helper-sve.h    |  2 ++
>  target/arm/helper.h        |  3 +-
>  target/arm/neon_helper.c   | 32 --------------------
>  target/arm/translate-a64.c | 27 +++++++++++------
>  target/arm/translate.c     | 26 ++++++++---------
>  target/arm/vec_helper.c    | 60 ++++++++++++++++++++++++++++++++++++++
>  6 files changed, 95 insertions(+), 55 deletions(-)
>
> diff --git a/target/arm/helper-sve.h b/target/arm/helper-sve.h
> index 9e79182ab4..2f47279155 100644
> --- a/target/arm/helper-sve.h
> +++ b/target/arm/helper-sve.h
> @@ -1574,3 +1574,5 @@ DEF_HELPER_FLAGS_6(sve_stdd_le_zd, TCG_CALL_NO_WG,
>                     void, env, ptr, ptr, ptr, tl, i32)
>  DEF_HELPER_FLAGS_6(sve_stdd_be_zd, TCG_CALL_NO_WG,
>                     void, env, ptr, ptr, ptr, tl, i32)
> +
> +DEF_HELPER_FLAGS_4(sve2_pmull_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> diff --git a/target/arm/helper.h b/target/arm/helper.h
> index d954399b7e..8a8517cf34 100644
> --- a/target/arm/helper.h
> +++ b/target/arm/helper.h
> @@ -335,7 +335,6 @@ DEF_HELPER_2(neon_sub_u8, i32, i32, i32)
>  DEF_HELPER_2(neon_sub_u16, i32, i32, i32)
>  DEF_HELPER_2(neon_mul_u8, i32, i32, i32)
>  DEF_HELPER_2(neon_mul_u16, i32, i32, i32)
> -DEF_HELPER_2(neon_mull_p8, i64, i32, i32)
>
>  DEF_HELPER_2(neon_tst_u8, i32, i32, i32)
>  DEF_HELPER_2(neon_tst_u16, i32, i32, i32)
> @@ -688,6 +687,8 @@ DEF_HELPER_FLAGS_4(gvec_ushl_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>  DEF_HELPER_FLAGS_4(gvec_pmul_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>  DEF_HELPER_FLAGS_4(gvec_pmull_q, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
>
> +DEF_HELPER_FLAGS_4(neon_pmull_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
> +
>  #ifdef TARGET_AARCH64
>  #include "helper-a64.h"
>  #include "helper-sve.h"
> diff --git a/target/arm/neon_helper.c b/target/arm/neon_helper.c
> index 6a107da0e1..c7a8438b42 100644
> --- a/target/arm/neon_helper.c
> +++ b/target/arm/neon_helper.c
> @@ -1129,38 +1129,6 @@ NEON_VOP(mul_u8, neon_u8, 4)
>  NEON_VOP(mul_u16, neon_u16, 2)
>  #undef NEON_FN
>
> -/* Polynomial multiplication is like integer multiplication except the
> -   partial products are XORed, not added.  */
> -uint64_t HELPER(neon_mull_p8)(uint32_t op1, uint32_t op2)
> -{
> -    uint64_t result = 0;
> -    uint64_t mask;
> -    uint64_t op2ex = op2;
> -    op2ex = (op2ex & 0xff) |
> -        ((op2ex & 0xff00) << 8) |
> -        ((op2ex & 0xff0000) << 16) |
> -        ((op2ex & 0xff000000) << 24);
> -    while (op1) {
> -        mask = 0;
> -        if (op1 & 1) {
> -            mask |= 0xffff;
> -        }
> -        if (op1 & (1 << 8)) {
> -            mask |= (0xffffU << 16);
> -        }
> -        if (op1 & (1 << 16)) {
> -            mask |= (0xffffULL << 32);
> -        }
> -        if (op1 & (1 << 24)) {
> -            mask |= (0xffffULL << 48);
> -        }
> -        result ^= op2ex & mask;
> -        op1 = (op1 >> 1) & 0x7f7f7f7f;
> -        op2ex <<= 1;
> -    }
> -    return result;
> -}
> -
>  #define NEON_FN(dest, src1, src2) dest = (src1 & src2) ? -1 : 0
>  NEON_VOP(tst_u8, neon_u8, 4)
>  NEON_VOP(tst_u16, neon_u16, 2)
> diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
> index 12588d18df..2934e4fc16 100644
> --- a/target/arm/translate-a64.c
> +++ b/target/arm/translate-a64.c
> @@ -10483,10 +10483,6 @@ static void handle_3rd_widening(DisasContext *s, int is_q, int is_u, int size,
>                  gen_helper_neon_addl_saturate_s32(tcg_passres, cpu_env,
>                                                    tcg_passres, tcg_passres);
>                  break;
> -            case 14: /* PMULL */
> -                assert(size == 0);
> -                gen_helper_neon_mull_p8(tcg_passres, tcg_op1, tcg_op2);
> -                break;
>              default:
>                  g_assert_not_reached();
>              }
> @@ -10650,11 +10646,21 @@ static void disas_simd_three_reg_diff(DisasContext *s, uint32_t insn)
>          handle_3rd_narrowing(s, is_q, is_u, size, opcode, rd, rn, rm);
>          break;
>      case 14: /* PMULL, PMULL2 */
> -        if (is_u || size == 1 || size == 2) {
> +        if (is_u) {
>              unallocated_encoding(s);
>              return;
>          }
> -        if (size == 3) {
> +        switch (size) {
> +        case 0: /* PMULL.P8 */
> +            if (!fp_access_check(s)) {
> +                return;
> +            }
> +            /* The Q field specifies lo/hi half input for this insn.  */
> +            gen_gvec_op3_ool(s, true, rd, rn, rm, is_q,
> +                             gen_helper_neon_pmull_h);
> +            break;
> +
> +        case 3: /* PMULL.P64 */
>              if (!dc_isar_feature(aa64_pmull, s)) {
>                  unallocated_encoding(s);
>                  return;
> @@ -10665,9 +10671,13 @@ static void disas_simd_three_reg_diff(DisasContext *s, uint32_t insn)
>              /* The Q field specifies lo/hi half input for this insn.  */
>              gen_gvec_op3_ool(s, true, rd, rn, rm, is_q,
>                               gen_helper_gvec_pmull_q);
> -            return;
> +            break;
> +
> +        default:
> +            unallocated_encoding(s);
> +            break;
>          }
> -        goto is_widening;
> +        return;
>      case 9: /* SQDMLAL, SQDMLAL2 */
>      case 11: /* SQDMLSL, SQDMLSL2 */
>      case 13: /* SQDMULL, SQDMULL2 */
> @@ -10688,7 +10698,6 @@ static void disas_simd_three_reg_diff(DisasContext *s, uint32_t insn)
>              unallocated_encoding(s);
>              return;
>          }
> -    is_widening:
>          if (!fp_access_check(s)) {
>              return;
>          }
> diff --git a/target/arm/translate.c b/target/arm/translate.c
> index 4e34249672..c3abf130cc 100644
> --- a/target/arm/translate.c
> +++ b/target/arm/translate.c
> @@ -5873,15 +5873,20 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>                      return 1;
>                  }
>
> -                /* Handle VMULL.P64 (Polynomial 64x64 to 128 bit multiply)
> -                 * outside the loop below as it only performs a single pass.
> -                 */
> -                if (op == 14 && size == 2) {
> -                    if (!dc_isar_feature(aa32_pmull, s)) {
> -                        return 1;
> +                /* Handle polynomial VMULL in a single pass.  */
> +                if (op == 14) {
> +                    if (size == 0) {
> +                        /* VMULL.P8 */
> +                        tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, 16, 16,
> +                                           0, gen_helper_neon_pmull_h);
> +                    } else {
> +                        /* VMULL.P64 */
> +                        if (!dc_isar_feature(aa32_pmull, s)) {
> +                            return 1;
> +                        }
> +                        tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, 16, 16,
> +                                           0, gen_helper_gvec_pmull_q);
>                      }
> -                    tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, 16, 16,
> -                                       0, gen_helper_gvec_pmull_q);
>                      return 0;
>                  }
>
> @@ -5959,11 +5964,6 @@ static int disas_neon_data_insn(DisasContext *s, uint32_t insn)
>                          /* VMLAL, VQDMLAL, VMLSL, VQDMLSL, VMULL, VQDMULL */
>                          gen_neon_mull(cpu_V0, tmp, tmp2, size, u);
>                          break;
> -                    case 14: /* Polynomial VMULL */
> -                        gen_helper_neon_mull_p8(cpu_V0, tmp, tmp2);
> -                        tcg_temp_free_i32(tmp2);
> -                        tcg_temp_free_i32(tmp);
> -                        break;
>                      default: /* 15 is RESERVED: caught earlier  */
>                          abort();
>                      }
> diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
> index 5c1074374e..04b4d7402d 100644
> --- a/target/arm/vec_helper.c
> +++ b/target/arm/vec_helper.c
> @@ -1197,3 +1197,63 @@ void HELPER(gvec_pmull_q)(void *vd, void *vn, void *vm, uint32_t desc)
>      }
>      clear_tail(d, opr_sz, simd_maxsz(desc));
>  }
> +
> +/*
> + * 8x8->16 polynomial multiply.
> + *
> + * The byte inputs are expanded to (or extracted from) half-words.
> + * Note that neon and sve2 get the inputs from different positions.
> + * This allows 4 bytes to be processed in parallel with uint64_t.
> + */
> +
> +static uint64_t expand_byte_to_half(uint64_t x)
> +{
> +    return  (x & 0x000000ff)
> +         | ((x & 0x0000ff00) << 8)
> +         | ((x & 0x00ff0000) << 16)
> +         | ((x & 0xff000000) << 24);
> +}
> +
> +static uint64_t pmull_h(uint64_t op1, uint64_t op2)
> +{
> +    uint64_t result = 0;
> +    int i;
> +
> +    for (i = 0; i < 8; ++i) {
> +        uint64_t mask = (op1 & 0x0001000100010001ull) * 0xffff;
> +        result ^= op2 & mask;
> +        op1 >>= 1;
> +        op2 <<= 1;
> +    }
> +    return result;
> +}
> +
> +void HELPER(neon_pmull_h)(void *vd, void *vn, void *vm, uint32_t desc)
> +{
> +    int hi = simd_data(desc);
> +    uint64_t *d = vd, *n = vn, *m = vm;
> +    uint64_t nn = n[hi], mm = m[hi];
> +
> +    d[0] = pmull_h(expand_byte_to_half(nn), expand_byte_to_half(mm));
> +    nn >>= 32;
> +    mm >>= 32;
> +    d[1] = pmull_h(expand_byte_to_half(nn), expand_byte_to_half(mm));
> +
> +    clear_tail(d, 16, simd_maxsz(desc));
> +}
> +
> +#ifdef TARGET_AARCH64
> +void HELPER(sve2_pmull_h)(void *vd, void *vn, void *vm, uint32_t desc)
> +{
> +    int shift = simd_data(desc) * 8;
> +    intptr_t i, opr_sz = simd_oprsz(desc);
> +    uint64_t *d = vd, *n = vn, *m = vm;
> +
> +    for (i = 0; i < opr_sz / 8; ++i) {
> +        uint64_t nn = (n[i] >> shift) & 0x00ff00ff00ff00ffull;
> +        uint64_t mm = (m[i] >> shift) & 0x00ff00ff00ff00ffull;
> +
> +        d[i] = pmull_h(nn, mm);
> +    }
> +}
> +#endif


--
Alex Bennée


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] target/arm vector improvements
  2019-10-17  4:42 [PATCH 0/4] target/arm vector improvements Richard Henderson
                   ` (4 preceding siblings ...)
  2019-10-17  5:21 ` [PATCH 0/4] target/arm vector improvements no-reply
@ 2019-10-18 17:58 ` Alex Bennée
  2019-11-18 16:26 ` Peter Maydell
  6 siblings, 0 replies; 15+ messages in thread
From: Alex Bennée @ 2019-10-18 17:58 UTC (permalink / raw)
  To: qemu-arm; +Cc: peter.maydell, qemu-devel


Richard Henderson <richard.henderson@linaro.org> writes:

> The first patch has been seen before.
>
>   https://patchwork.ozlabs.org/patch/1115039/
>
> It had a bug and I didn't fix it right away and then forgot.
> Fixed now; I had mixed up the operand ordering for aarch32.
>
> The next 3 are something that I noticed while doing other stuff.
>
> In particular, pmull is used heavily during https transfers.
> While cloning a repository, the old code peaks at 27% of the
> total runtime, as measured by perf top.  The new code does
> not quite reach 3% repeating the same clone.
>
> In addition, the new helper functions are in the form that
> will be required for the implementation of SVE2.
>
> The comment in patch 2 about ARMv8.4-DIT is perhaps a stretch,
> but re-reading the pmull instruction description in the current
> ARM ARM brought it to mind.
>
> Since TCG is officially not in the security domain, it's
> probably not a bug to just claim to support DIT without
> actually doing anything to ensure the algorithms used are in
> fact timing independent of the data.
>
> On the other hand, I expect the bit distribution of stuff
> going through these sort of hashing algorithms to approach
> 50% 1's and 0's, so I also don't think we gain anything on
> average to terminate the loop early.
>
> Thoughts on DIT specifically?

It would be an interesting academic exercise to see if someone could
exploit the JIT from the guest but as you say not a security domain
so....

>
>
> r~
>
>
> Richard Henderson (4):
>   target/arm: Vectorize USHL and SSHL
>   target/arm: Convert PMUL.8 to gvec
>   target/arm: Convert PMULL.64 to gvec
>   target/arm: Convert PMULL.8 to gvec
>
>  target/arm/helper-sve.h    |   2 +
>  target/arm/helper.h        |  21 ++-
>  target/arm/translate.h     |   6 +
>  target/arm/neon_helper.c   | 117 -------------
>  target/arm/translate-a64.c |  83 ++++-----
>  target/arm/translate.c     | 350 ++++++++++++++++++++++++++++++++-----
>  target/arm/vec_helper.c    | 211 ++++++++++++++++++++++
>  7 files changed, 562 insertions(+), 228 deletions(-)


--
Alex Bennée


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] target/arm vector improvements
  2019-10-17  4:42 [PATCH 0/4] target/arm vector improvements Richard Henderson
                   ` (5 preceding siblings ...)
  2019-10-18 17:58 ` Alex Bennée
@ 2019-11-18 16:26 ` Peter Maydell
  2019-11-18 20:05   ` Richard Henderson
  6 siblings, 1 reply; 15+ messages in thread
From: Peter Maydell @ 2019-11-18 16:26 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Thu, 17 Oct 2019 at 05:42, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> The first patch has been seen before.
>
>   https://patchwork.ozlabs.org/patch/1115039/
>
> It had a bug and I didn't fix it right away and then forgot.
> Fixed now; I had mixed up the operand ordering for aarch32.

Since Alex had a nit on patch 1 I'm going to assume you'll
respin this series once we're reopen for 5.0 development.

> The comment in patch 2 about ARMv8.4-DIT is perhaps a stretch,
> but re-reading the pmull instruction description in the current
> ARM ARM brought it to mind.
>
> Since TCG is officially not in the security domain, it's
> probably not a bug to just claim to support DIT without
> actually doing anything to ensure the algorithms used are in
> fact timing independent of the data.
>
> On the other hand, I expect the bit distribution of stuff
> going through these sort of hashing algorithms to approach
> 50% 1's and 0's, so I also don't think we gain anything on
> average to terminate the loop early.
>
> Thoughts on DIT specifically?

I think we have two plausible choices for DIT:
 (1) don't implement it, since we can't make the timing
     guarantees it suggests
 (2) implement it but not actually change our behaviour:
     this is technically a lie to the guest, but it means
     that guest OS handling of the PSTATE.DIT bit etc can
     be tested

At the moment we by default do (1). I think we should
probably postpone doing (2) until somebody actually
asks us for it.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] target/arm vector improvements
  2019-11-18 16:26 ` Peter Maydell
@ 2019-11-18 20:05   ` Richard Henderson
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2019-11-18 20:05 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 11/18/19 5:26 PM, Peter Maydell wrote:
> On Thu, 17 Oct 2019 at 05:42, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> The first patch has been seen before.
>>
>>   https://patchwork.ozlabs.org/patch/1115039/
>>
>> It had a bug and I didn't fix it right away and then forgot.
>> Fixed now; I had mixed up the operand ordering for aarch32.
> 
> Since Alex had a nit on patch 1 I'm going to assume you'll
> respin this series once we're reopen for 5.0 development.

Oops, I forgot about it.  Yes, we can postpone to 5.0.

> I think we have two plausible choices for DIT:
>  (1) don't implement it, since we can't make the timing
>      guarantees it suggests
>  (2) implement it but not actually change our behaviour:
>      this is technically a lie to the guest, but it means
>      that guest OS handling of the PSTATE.DIT bit etc can
>      be tested
> 
> At the moment we by default do (1). I think we should
> probably postpone doing (2) until somebody actually
> asks us for it.

Fair enough.


r~


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-11-18 20:06 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-17  4:42 [PATCH 0/4] target/arm vector improvements Richard Henderson
2019-10-17  4:42 ` [PATCH 1/4] target/arm: Vectorize USHL and SSHL Richard Henderson
2019-10-17 16:01   ` Alex Bennée
2019-10-18 14:47     ` Richard Henderson
2019-10-17  4:42 ` [PATCH 2/4] target/arm: Convert PMUL.8 to gvec Richard Henderson
2019-10-18 13:40   ` Alex Bennée
2019-10-17  4:42 ` [PATCH 3/4] target/arm: Convert PMULL.64 " Richard Henderson
2019-10-18 12:24   ` Alex Bennée
2019-10-18 13:40     ` Alex Bennée
2019-10-17  4:42 ` [PATCH 4/4] target/arm: Convert PMULL.8 " Richard Henderson
2019-10-18 17:54   ` Alex Bennée
2019-10-17  5:21 ` [PATCH 0/4] target/arm vector improvements no-reply
2019-10-18 17:58 ` Alex Bennée
2019-11-18 16:26 ` Peter Maydell
2019-11-18 20:05   ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).