All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] target/arm: Use TCG vector ops for MVE
@ 2021-09-02 15:09 Peter Maydell
  2021-09-02 15:09 ` [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated" Peter Maydell
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Peter Maydell @ 2021-09-02 15:09 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

This patchset uses the TCG vector ops for some MVE
instructions. We can only do this when we know that none
of the MVE lanes are predicated, ie when neither tail
predication nor VPT predication nor ECI partial insn
execution are happening.

Patch 1 adds a TB flag so we can track whether we can
safely assume that the insn operates on the entire vector,
and patches 2, 3, 4 use that to vectorize some simple
cases (bitwise logic ops, integer add, sub, mul, min, max,
abs and neg).

This small initial patchset is intended as a check that
the general structure for handling this makes sense;
there are almost certainly other places we could improve
the codegen for the non-predicated case.

Richard: if you have time to scan through the MVE insns
and suggest which ones would benefit from a vectorized
version that would be very helpful...

thanks
-- PMM

Peter Maydell (4):
  target/arm: Add TB flag for "MVE insns not predicated"
  target/arm: Optimize MVE logic ops
  target/arm: Optimize MVE arithmetic ops
  target/arm: Optimize MVE VNEG, VABS

 target/arm/cpu.h              |   4 +-
 target/arm/translate.h        |   2 +
 target/arm/helper.c           |  23 ++++++++
 target/arm/translate-m-nocp.c |   8 +++
 target/arm/translate-mve.c    | 102 ++++++++++++++++++++++------------
 target/arm/translate-vfp.c    |   3 +
 target/arm/translate.c        |   4 ++
 7 files changed, 110 insertions(+), 36 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-02 15:09 [PATCH 0/4] target/arm: Use TCG vector ops for MVE Peter Maydell
@ 2021-09-02 15:09 ` Peter Maydell
  2021-09-03 13:57   ` Richard Henderson
  2021-09-02 15:09 ` [PATCH 2/4] target/arm: Optimize MVE logic ops Peter Maydell
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2021-09-02 15:09 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

Our current codegen for MVE always calls out to helper functions,
because some byte lanes might be predicated. The common case is
that in fact there is no predication active and all lanes should
be updated together, so we can produce better code by detecting
that and using the TCG generic vector infrastructure.

Add a TB flag that is set when we can guarantee that there
is no active MVE predication, and a bool in the DisasContext.
Subsequent patches will use this flag to generate improved
code for some instructions.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
Another of those awkward-to-review patches where the interesting
question is "did I miss anywhere where we set v7m.vpr to non-0
or v7m.ltpsize to non-4 while in the middle of a TB ?" ...

I've left the TB flag as non-cached because there seemed to
be an irritatingly large set of places where we'd have to do
an hflags recomputation.
---
 target/arm/cpu.h              |  4 +++-
 target/arm/translate.h        |  2 ++
 target/arm/helper.c           | 23 +++++++++++++++++++++++
 target/arm/translate-m-nocp.c |  8 ++++++++
 target/arm/translate-mve.c    |  9 ++++++++-
 target/arm/translate-vfp.c    |  3 +++
 target/arm/translate.c        |  4 ++++
 7 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 6a987f65e41..460d611e65e 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -3440,7 +3440,7 @@ typedef ARMCPU ArchCPU;
  * | TBFLAG_AM32 |          +-----+----------+
  * |             |                |TBFLAG_M32|
  * +-------------+----------------+----------+
- *  31         23                5 4        0
+ *  31         23                6 5        0
  *
  * Unless otherwise noted, these bits are cached in env->hflags.
  */
@@ -3497,6 +3497,8 @@ FIELD(TBFLAG_M32, LSPACT, 2, 1)                 /* Not cached. */
 FIELD(TBFLAG_M32, NEW_FP_CTXT_NEEDED, 3, 1)     /* Not cached. */
 /* Set if FPCCR.S does not match current security state */
 FIELD(TBFLAG_M32, FPCCR_S_WRONG, 4, 1)          /* Not cached. */
+/* Set if MVE insns are definitely not predicated */
+FIELD(TBFLAG_M32, MVE_NO_PRED, 5, 1)            /* Not cached. */
 
 /*
  * Bit usage when in AArch64 state
diff --git a/target/arm/translate.h b/target/arm/translate.h
index 8636c20c3b4..908b8ded8ee 100644
--- a/target/arm/translate.h
+++ b/target/arm/translate.h
@@ -98,6 +98,8 @@ typedef struct DisasContext {
     bool hstr_active;
     /* True if memory operations require alignment */
     bool align_mem;
+    /* True if MVE insns are definitely not predicated for any reason */
+    bool mve_no_pred;
     /*
      * >= 0, a copy of PSTATE.BTYPE, which will be 0 without v8.5-BTI.
      *  < 0, set by the current instruction.
diff --git a/target/arm/helper.c b/target/arm/helper.c
index a7ae78146d4..cf41b746b64 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -13673,6 +13673,25 @@ static inline void assert_hflags_rebuild_correctly(CPUARMState *env)
 #endif
 }
 
+static bool mve_no_predication(CPUARMState *env)
+{
+    /*
+     * Return true if there is definitely no predication of MVE
+     * instructions for any reason. (Returning false even if there
+     * isn't any predication is OK; generated code will just be
+     * a little worse.)
+     * We do not account here for partial insn execution due to
+     * ECI bits as those are already in the TB flags elsewhere.
+     */
+    if (env->v7m.vpr) {
+        return false;
+    }
+    if (env->v7m.ltpsize < 4) {
+        return false;
+    }
+    return true;
+}
+
 void cpu_get_tb_cpu_state(CPUARMState *env, target_ulong *pc,
                           target_ulong *cs_base, uint32_t *pflags)
 {
@@ -13712,6 +13731,10 @@ void cpu_get_tb_cpu_state(CPUARMState *env, target_ulong *pc,
             if (env->v7m.fpccr[is_secure] & R_V7M_FPCCR_LSPACT_MASK) {
                 DP_TBFLAG_M32(flags, LSPACT, 1);
             }
+
+            if (mve_no_predication(env)) {
+                DP_TBFLAG_M32(flags, MVE_NO_PRED, 1);
+            }
         } else {
             /*
              * Note that XSCALE_CPAR shares bits with VECSTRIDE.
diff --git a/target/arm/translate-m-nocp.c b/target/arm/translate-m-nocp.c
index 5eab04832cd..79613fc150a 100644
--- a/target/arm/translate-m-nocp.c
+++ b/target/arm/translate-m-nocp.c
@@ -92,6 +92,8 @@ static bool trans_VLLDM_VLSTM(DisasContext *s, arg_VLLDM_VLSTM *a)
         gen_helper_v7m_vlstm(cpu_env, fptr);
     }
     tcg_temp_free_i32(fptr);
+    /* VLLDM or VLSTM helpers might have updated vpr or ltpsize */
+    s->mve_no_pred = false;
 
     clear_eci_state(s);
 
@@ -326,6 +328,7 @@ static bool gen_M_fp_sysreg_write(DisasContext *s, int regno,
     case ARM_VFP_FPSCR:
         tmp = loadfn(s, opaque, true);
         gen_helper_vfp_set_fpscr(cpu_env, tmp);
+        s->mve_no_pred = false;
         tcg_temp_free_i32(tmp);
         gen_lookup_tb(s);
         break;
@@ -397,6 +400,7 @@ static bool gen_M_fp_sysreg_write(DisasContext *s, int regno,
         store_cpu_field(control, v7m.control[M_REG_S]);
         tcg_gen_andi_i32(tmp, tmp, ~FPCR_NZCV_MASK);
         gen_helper_vfp_set_fpscr(cpu_env, tmp);
+        s->mve_no_pred = false;
         tcg_temp_free_i32(tmp);
         tcg_temp_free_i32(sfpa);
         break;
@@ -409,6 +413,7 @@ static bool gen_M_fp_sysreg_write(DisasContext *s, int regno,
         }
         tmp = loadfn(s, opaque, true);
         store_cpu_field(tmp, v7m.vpr);
+        s->mve_no_pred = false;
         break;
     case ARM_VFP_P0:
     {
@@ -418,6 +423,7 @@ static bool gen_M_fp_sysreg_write(DisasContext *s, int regno,
         tcg_gen_deposit_i32(vpr, vpr, tmp,
                             R_V7M_VPR_P0_SHIFT, R_V7M_VPR_P0_LENGTH);
         store_cpu_field(vpr, v7m.vpr);
+        s->mve_no_pred = false;
         tcg_temp_free_i32(tmp);
         break;
     }
@@ -500,6 +506,7 @@ static bool gen_M_fp_sysreg_read(DisasContext *s, int regno,
         store_cpu_field(control, v7m.control[M_REG_S]);
         fpscr = load_cpu_field(v7m.fpdscr[M_REG_NS]);
         gen_helper_vfp_set_fpscr(cpu_env, fpscr);
+        s->mve_no_pred = false;
         tcg_temp_free_i32(fpscr);
         lookup_tb = true;
         break;
@@ -549,6 +556,7 @@ static bool gen_M_fp_sysreg_read(DisasContext *s, int regno,
         zero = tcg_const_i32(0);
         tcg_gen_movcond_i32(TCG_COND_EQ, fpscr, sfpa, zero, fpdscr, fpscr);
         gen_helper_vfp_set_fpscr(cpu_env, fpscr);
+        s->mve_no_pred = false;
         tcg_temp_free_i32(zero);
         tcg_temp_free_i32(sfpa);
         tcg_temp_free_i32(fpdscr);
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 2ed91577ec8..d95807541b1 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -810,7 +810,11 @@ DO_LOGIC(VORR, gen_helper_mve_vorr)
 DO_LOGIC(VORN, gen_helper_mve_vorn)
 DO_LOGIC(VEOR, gen_helper_mve_veor)
 
-DO_LOGIC(VPSEL, gen_helper_mve_vpsel)
+static bool trans_VPSEL(DisasContext *s, arg_2op *a)
+{
+    s->mve_no_pred = false;
+    return do_2op(s, a, gen_helper_mve_vpsel);
+}
 
 #define DO_2OP(INSN, FN) \
     static bool trans_##INSN(DisasContext *s, arg_2op *a)       \
@@ -1366,6 +1370,7 @@ static bool trans_VPNOT(DisasContext *s, arg_VPNOT *a)
     }
 
     gen_helper_mve_vpnot(cpu_env);
+    s->mve_no_pred = false;
     mve_update_eci(s);
     return true;
 }
@@ -1852,6 +1857,7 @@ static bool do_vcmp(DisasContext *s, arg_vcmp *a, MVEGenCmpFn *fn)
         /* VPT */
         gen_vpst(s, a->mask);
     }
+    s->mve_no_pred = false;
     mve_update_eci(s);
     return true;
 }
@@ -1883,6 +1889,7 @@ static bool do_vcmp_scalar(DisasContext *s, arg_vcmp_scalar *a,
         /* VPT */
         gen_vpst(s, a->mask);
     }
+    s->mve_no_pred = false;
     mve_update_eci(s);
     return true;
 }
diff --git a/target/arm/translate-vfp.c b/target/arm/translate-vfp.c
index e2eb797c829..155ca69075b 100644
--- a/target/arm/translate-vfp.c
+++ b/target/arm/translate-vfp.c
@@ -128,6 +128,8 @@ static void gen_preserve_fp_state(DisasContext *s)
          * any further FP insns in this TB.
          */
         s->v7m_lspact = false;
+        /* Helper might have updated vpr or ltpsize */
+        s->mve_no_pred = false;
     }
 }
 
@@ -164,6 +166,7 @@ static void gen_update_fp_context(DisasContext *s)
 
         fpscr = load_cpu_field(v7m.fpdscr[s->v8m_secure]);
         gen_helper_vfp_set_fpscr(cpu_env, fpscr);
+        s->mve_no_pred = false;
         tcg_temp_free_i32(fpscr);
         if (dc_isar_feature(aa32_mve, s)) {
             TCGv_i32 z32 = tcg_const_i32(0);
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 24b7f49d767..54c31ad6dec 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -8464,6 +8464,7 @@ static bool trans_DLS(DisasContext *s, arg_DLS *a)
         /* DLSTP: set FPSCR.LTPSIZE */
         tmp = tcg_const_i32(a->size);
         store_cpu_field(tmp, v7m.ltpsize);
+        s->mve_no_pred = false;
     }
     return true;
 }
@@ -8529,6 +8530,7 @@ static bool trans_WLS(DisasContext *s, arg_WLS *a)
         assert(ok);
         tmp = tcg_const_i32(a->size);
         store_cpu_field(tmp, v7m.ltpsize);
+        s->mve_no_pred = false;
     }
     gen_jmp_tb(s, s->base.pc_next, 1);
 
@@ -8711,6 +8713,7 @@ static bool trans_VCTP(DisasContext *s, arg_VCTP *a)
     gen_helper_mve_vctp(cpu_env, masklen);
     tcg_temp_free_i32(masklen);
     tcg_temp_free_i32(rn_shifted);
+    s->mve_no_pred = false;
     mve_update_eci(s);
     return true;
 }
@@ -9370,6 +9373,7 @@ static void arm_tr_init_disas_context(DisasContextBase *dcbase, CPUState *cs)
         dc->v7m_new_fp_ctxt_needed =
             EX_TBFLAG_M32(tb_flags, NEW_FP_CTXT_NEEDED);
         dc->v7m_lspact = EX_TBFLAG_M32(tb_flags, LSPACT);
+        dc->mve_no_pred = EX_TBFLAG_M32(tb_flags, MVE_NO_PRED) && dc->eci == 0;
     } else {
         dc->debug_target_el = EX_TBFLAG_ANY(tb_flags, DEBUG_TARGET_EL);
         dc->sctlr_b = EX_TBFLAG_A32(tb_flags, SCTLR__B);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] target/arm: Optimize MVE logic ops
  2021-09-02 15:09 [PATCH 0/4] target/arm: Use TCG vector ops for MVE Peter Maydell
  2021-09-02 15:09 ` [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated" Peter Maydell
@ 2021-09-02 15:09 ` Peter Maydell
  2021-09-02 16:19   ` Philippe Mathieu-Daudé
  2021-09-03 13:58   ` Richard Henderson
  2021-09-02 15:09 ` [PATCH 3/4] target/arm: Optimize MVE arithmetic ops Peter Maydell
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Peter Maydell @ 2021-09-02 15:09 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

When not predicating, implement the MVE bitwise logical insns
directly using TCG vector operations.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/translate-mve.c | 41 ++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index d95807541b1..087d60f0d81 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -774,7 +774,8 @@ static bool trans_VNEG_fp(DisasContext *s, arg_1op *a)
     return do_1op(s, a, fns[a->size]);
 }
 
-static bool do_2op(DisasContext *s, arg_2op *a, MVEGenTwoOpFn fn)
+static bool do_2op_vec(DisasContext *s, arg_2op *a, MVEGenTwoOpFn fn,
+                       GVecGen3Fn *vecfn)
 {
     TCGv_ptr qd, qn, qm;
 
@@ -787,28 +788,38 @@ static bool do_2op(DisasContext *s, arg_2op *a, MVEGenTwoOpFn fn)
         return true;
     }
 
-    qd = mve_qreg_ptr(a->qd);
-    qn = mve_qreg_ptr(a->qn);
-    qm = mve_qreg_ptr(a->qm);
-    fn(cpu_env, qd, qn, qm);
-    tcg_temp_free_ptr(qd);
-    tcg_temp_free_ptr(qn);
-    tcg_temp_free_ptr(qm);
+    if (vecfn && s->mve_no_pred) {
+        vecfn(a->size, mve_qreg_offset(a->qd), mve_qreg_offset(a->qn),
+              mve_qreg_offset(a->qm), 16, 16);
+    } else {
+        qd = mve_qreg_ptr(a->qd);
+        qn = mve_qreg_ptr(a->qn);
+        qm = mve_qreg_ptr(a->qm);
+        fn(cpu_env, qd, qn, qm);
+        tcg_temp_free_ptr(qd);
+        tcg_temp_free_ptr(qn);
+        tcg_temp_free_ptr(qm);
+    }
     mve_update_eci(s);
     return true;
 }
 
-#define DO_LOGIC(INSN, HELPER)                                  \
+static bool do_2op(DisasContext *s, arg_2op *a, MVEGenTwoOpFn *fn)
+{
+    return do_2op_vec(s, a, fn, NULL);
+}
+
+#define DO_LOGIC(INSN, HELPER, VECFN)                           \
     static bool trans_##INSN(DisasContext *s, arg_2op *a)       \
     {                                                           \
-        return do_2op(s, a, HELPER);                            \
+        return do_2op_vec(s, a, HELPER, VECFN);                 \
     }
 
-DO_LOGIC(VAND, gen_helper_mve_vand)
-DO_LOGIC(VBIC, gen_helper_mve_vbic)
-DO_LOGIC(VORR, gen_helper_mve_vorr)
-DO_LOGIC(VORN, gen_helper_mve_vorn)
-DO_LOGIC(VEOR, gen_helper_mve_veor)
+DO_LOGIC(VAND, gen_helper_mve_vand, tcg_gen_gvec_and)
+DO_LOGIC(VBIC, gen_helper_mve_vbic, tcg_gen_gvec_andc)
+DO_LOGIC(VORR, gen_helper_mve_vorr, tcg_gen_gvec_or)
+DO_LOGIC(VORN, gen_helper_mve_vorn, tcg_gen_gvec_orc)
+DO_LOGIC(VEOR, gen_helper_mve_veor, tcg_gen_gvec_xor)
 
 static bool trans_VPSEL(DisasContext *s, arg_2op *a)
 {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] target/arm: Optimize MVE arithmetic ops
  2021-09-02 15:09 [PATCH 0/4] target/arm: Use TCG vector ops for MVE Peter Maydell
  2021-09-02 15:09 ` [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated" Peter Maydell
  2021-09-02 15:09 ` [PATCH 2/4] target/arm: Optimize MVE logic ops Peter Maydell
@ 2021-09-02 15:09 ` Peter Maydell
  2021-09-02 16:20   ` Philippe Mathieu-Daudé
  2021-09-03 13:59   ` Richard Henderson
  2021-09-02 15:09 ` [PATCH 4/4] target/arm: Optimize MVE VNEG, VABS Peter Maydell
  2021-09-03 15:14 ` [PATCH 0/4] target/arm: Use TCG vector ops for MVE Richard Henderson
  4 siblings, 2 replies; 27+ messages in thread
From: Peter Maydell @ 2021-09-02 15:09 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

Optimize MVE arithmetic ops when we have a TCG
vector operation we can use.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/translate-mve.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 087d60f0d81..5bc5a3a2063 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -827,7 +827,7 @@ static bool trans_VPSEL(DisasContext *s, arg_2op *a)
     return do_2op(s, a, gen_helper_mve_vpsel);
 }
 
-#define DO_2OP(INSN, FN) \
+#define DO_2OP_VEC(INSN, FN, VECFN)                             \
     static bool trans_##INSN(DisasContext *s, arg_2op *a)       \
     {                                                           \
         static MVEGenTwoOpFn * const fns[] = {                  \
@@ -836,20 +836,22 @@ static bool trans_VPSEL(DisasContext *s, arg_2op *a)
             gen_helper_mve_##FN##w,                             \
             NULL,                                               \
         };                                                      \
-        return do_2op(s, a, fns[a->size]);                      \
+        return do_2op_vec(s, a, fns[a->size], VECFN);           \
     }
 
-DO_2OP(VADD, vadd)
-DO_2OP(VSUB, vsub)
-DO_2OP(VMUL, vmul)
+#define DO_2OP(INSN, FN) DO_2OP_VEC(INSN, FN, NULL)
+
+DO_2OP_VEC(VADD, vadd, tcg_gen_gvec_add)
+DO_2OP_VEC(VSUB, vsub, tcg_gen_gvec_sub)
+DO_2OP_VEC(VMUL, vmul, tcg_gen_gvec_mul)
 DO_2OP(VMULH_S, vmulhs)
 DO_2OP(VMULH_U, vmulhu)
 DO_2OP(VRMULH_S, vrmulhs)
 DO_2OP(VRMULH_U, vrmulhu)
-DO_2OP(VMAX_S, vmaxs)
-DO_2OP(VMAX_U, vmaxu)
-DO_2OP(VMIN_S, vmins)
-DO_2OP(VMIN_U, vminu)
+DO_2OP_VEC(VMAX_S, vmaxs, tcg_gen_gvec_smax)
+DO_2OP_VEC(VMAX_U, vmaxu, tcg_gen_gvec_umax)
+DO_2OP_VEC(VMIN_S, vmins, tcg_gen_gvec_smin)
+DO_2OP_VEC(VMIN_U, vminu, tcg_gen_gvec_umin)
 DO_2OP(VABD_S, vabds)
 DO_2OP(VABD_U, vabdu)
 DO_2OP(VHADD_S, vhadds)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] target/arm: Optimize MVE VNEG, VABS
  2021-09-02 15:09 [PATCH 0/4] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (2 preceding siblings ...)
  2021-09-02 15:09 ` [PATCH 3/4] target/arm: Optimize MVE arithmetic ops Peter Maydell
@ 2021-09-02 15:09 ` Peter Maydell
  2021-09-02 16:21   ` Philippe Mathieu-Daudé
  2021-09-03 14:00   ` Richard Henderson
  2021-09-03 15:14 ` [PATCH 0/4] target/arm: Use TCG vector ops for MVE Richard Henderson
  4 siblings, 2 replies; 27+ messages in thread
From: Peter Maydell @ 2021-09-02 15:09 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

Optimize the MVE VNEG and VABS insns by using TCG
vector ops when possible.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/translate-mve.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 5bc5a3a2063..32c5ff6ea06 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -500,7 +500,8 @@ static bool trans_VDUP(DisasContext *s, arg_VDUP *a)
     return true;
 }
 
-static bool do_1op(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn)
+static bool do_1op_vec(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn,
+                       GVecGen2Fn vecfn)
 {
     TCGv_ptr qd, qm;
 
@@ -514,16 +515,25 @@ static bool do_1op(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn)
         return true;
     }
 
-    qd = mve_qreg_ptr(a->qd);
-    qm = mve_qreg_ptr(a->qm);
-    fn(cpu_env, qd, qm);
-    tcg_temp_free_ptr(qd);
-    tcg_temp_free_ptr(qm);
+    if (vecfn && s->mve_no_pred) {
+        vecfn(a->size, mve_qreg_offset(a->qd), mve_qreg_offset(a->qm), 16, 16);
+    } else {
+        qd = mve_qreg_ptr(a->qd);
+        qm = mve_qreg_ptr(a->qm);
+        fn(cpu_env, qd, qm);
+        tcg_temp_free_ptr(qd);
+        tcg_temp_free_ptr(qm);
+    }
     mve_update_eci(s);
     return true;
 }
 
-#define DO_1OP(INSN, FN)                                        \
+static bool do_1op(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn)
+{
+    return do_1op_vec(s, a, fn, NULL);
+}
+
+#define DO_1OP_VEC(INSN, FN, VECFN)                             \
     static bool trans_##INSN(DisasContext *s, arg_1op *a)       \
     {                                                           \
         static MVEGenOneOpFn * const fns[] = {                  \
@@ -532,13 +542,15 @@ static bool do_1op(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn)
             gen_helper_mve_##FN##w,                             \
             NULL,                                               \
         };                                                      \
-        return do_1op(s, a, fns[a->size]);                      \
+        return do_1op_vec(s, a, fns[a->size], VECFN);           \
     }
 
+#define DO_1OP(INSN, FN) DO_1OP_VEC(INSN, FN, NULL)
+
 DO_1OP(VCLZ, vclz)
 DO_1OP(VCLS, vcls)
-DO_1OP(VABS, vabs)
-DO_1OP(VNEG, vneg)
+DO_1OP_VEC(VABS, vabs, tcg_gen_gvec_abs)
+DO_1OP_VEC(VNEG, vneg, tcg_gen_gvec_neg)
 DO_1OP(VQABS, vqabs)
 DO_1OP(VQNEG, vqneg)
 DO_1OP(VMAXA, vmaxa)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] target/arm: Optimize MVE logic ops
  2021-09-02 15:09 ` [PATCH 2/4] target/arm: Optimize MVE logic ops Peter Maydell
@ 2021-09-02 16:19   ` Philippe Mathieu-Daudé
  2021-09-03 13:58   ` Richard Henderson
  1 sibling, 0 replies; 27+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-09-02 16:19 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 9/2/21 5:09 PM, Peter Maydell wrote:
> When not predicating, implement the MVE bitwise logical insns
> directly using TCG vector operations.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
>  target/arm/translate-mve.c | 41 ++++++++++++++++++++++++--------------
>  1 file changed, 26 insertions(+), 15 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] target/arm: Optimize MVE arithmetic ops
  2021-09-02 15:09 ` [PATCH 3/4] target/arm: Optimize MVE arithmetic ops Peter Maydell
@ 2021-09-02 16:20   ` Philippe Mathieu-Daudé
  2021-09-03 13:59   ` Richard Henderson
  1 sibling, 0 replies; 27+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-09-02 16:20 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 9/2/21 5:09 PM, Peter Maydell wrote:
> Optimize MVE arithmetic ops when we have a TCG
> vector operation we can use.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
>  target/arm/translate-mve.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] target/arm: Optimize MVE VNEG, VABS
  2021-09-02 15:09 ` [PATCH 4/4] target/arm: Optimize MVE VNEG, VABS Peter Maydell
@ 2021-09-02 16:21   ` Philippe Mathieu-Daudé
  2021-09-03 14:00   ` Richard Henderson
  1 sibling, 0 replies; 27+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-09-02 16:21 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 9/2/21 5:09 PM, Peter Maydell wrote:
> Optimize the MVE VNEG and VABS insns by using TCG
> vector ops when possible.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
>  target/arm/translate-mve.c | 32 ++++++++++++++++++++++----------
>  1 file changed, 22 insertions(+), 10 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-02 15:09 ` [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated" Peter Maydell
@ 2021-09-03 13:57   ` Richard Henderson
  2021-09-03 14:17     ` Peter Maydell
                       ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Richard Henderson @ 2021-09-03 13:57 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 9/2/21 5:09 PM, Peter Maydell wrote:
> Our current codegen for MVE always calls out to helper functions,
> because some byte lanes might be predicated. The common case is
> that in fact there is no predication active and all lanes should
> be updated together, so we can produce better code by detecting
> that and using the TCG generic vector infrastructure.
> 
> Add a TB flag that is set when we can guarantee that there
> is no active MVE predication, and a bool in the DisasContext.
> Subsequent patches will use this flag to generate improved
> code for some instructions.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> Another of those awkward-to-review patches where the interesting
> question is "did I miss anywhere where we set v7m.vpr to non-0
> or v7m.ltpsize to non-4 while in the middle of a TB ?" ...
> 
> I've left the TB flag as non-cached because there seemed to
> be an irritatingly large set of places where we'd have to do
> an hflags recomputation.

Indeed, non-cached seems like the right option.

> +static bool mve_no_predication(CPUARMState *env)
> +{
> +    /*
> +     * Return true if there is definitely no predication of MVE
> +     * instructions for any reason. (Returning false even if there
> +     * isn't any predication is OK; generated code will just be
> +     * a little worse.)
> +     * We do not account here for partial insn execution due to
> +     * ECI bits as those are already in the TB flags elsewhere.
> +     */

I think you should go ahead and include that here, as it makes the test self-contained, 
and avoids you having to do this:

> +        dc->mve_no_pred = EX_TBFLAG_M32(tb_flags, MVE_NO_PRED) && dc->eci == 0;



> +    /* VLLDM or VLSTM helpers might have updated vpr or ltpsize */
> +    s->mve_no_pred = false;

For a moment I was thinking that this was ok, just finish the rest of the TB and resync 
naturally, letting the end of the TB use the helpers.  But then I thought about goto_tb. 
If you have this potentially variable condition, you can't chain the links between this TB 
and the next.

I think you need to go ahead and end the TB and resync immediately.
Just set dc->base.is_jmp = DISAS_UPDATE_NOCHAIN instead.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] target/arm: Optimize MVE logic ops
  2021-09-02 15:09 ` [PATCH 2/4] target/arm: Optimize MVE logic ops Peter Maydell
  2021-09-02 16:19   ` Philippe Mathieu-Daudé
@ 2021-09-03 13:58   ` Richard Henderson
  1 sibling, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2021-09-03 13:58 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 9/2/21 5:09 PM, Peter Maydell wrote:
> When not predicating, implement the MVE bitwise logical insns
> directly using TCG vector operations.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/translate-mve.c | 41 ++++++++++++++++++++++++--------------
>   1 file changed, 26 insertions(+), 15 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] target/arm: Optimize MVE arithmetic ops
  2021-09-02 15:09 ` [PATCH 3/4] target/arm: Optimize MVE arithmetic ops Peter Maydell
  2021-09-02 16:20   ` Philippe Mathieu-Daudé
@ 2021-09-03 13:59   ` Richard Henderson
  1 sibling, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2021-09-03 13:59 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 9/2/21 5:09 PM, Peter Maydell wrote:
> Optimize MVE arithmetic ops when we have a TCG
> vector operation we can use.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/translate-mve.c | 20 +++++++++++---------
>   1 file changed, 11 insertions(+), 9 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] target/arm: Optimize MVE VNEG, VABS
  2021-09-02 15:09 ` [PATCH 4/4] target/arm: Optimize MVE VNEG, VABS Peter Maydell
  2021-09-02 16:21   ` Philippe Mathieu-Daudé
@ 2021-09-03 14:00   ` Richard Henderson
  1 sibling, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2021-09-03 14:00 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 9/2/21 5:09 PM, Peter Maydell wrote:
> Optimize the MVE VNEG and VABS insns by using TCG
> vector ops when possible.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/translate-mve.c | 32 ++++++++++++++++++++++----------
>   1 file changed, 22 insertions(+), 10 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-03 13:57   ` Richard Henderson
@ 2021-09-03 14:17     ` Peter Maydell
  2021-09-03 15:28       ` Richard Henderson
  2021-09-09 13:46     ` Peter Maydell
  2021-09-09 14:53     ` Peter Maydell
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2021-09-03 14:17 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Fri, 3 Sept 2021 at 14:58, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 9/2/21 5:09 PM, Peter Maydell wrote:
> > Our current codegen for MVE always calls out to helper functions,
> > because some byte lanes might be predicated. The common case is
> > that in fact there is no predication active and all lanes should
> > be updated together, so we can produce better code by detecting
> > that and using the TCG generic vector infrastructure.
> >
> > Add a TB flag that is set when we can guarantee that there
> > is no active MVE predication, and a bool in the DisasContext.
> > Subsequent patches will use this flag to generate improved
> > code for some instructions.
> >
> > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > ---
> > Another of those awkward-to-review patches where the interesting
> > question is "did I miss anywhere where we set v7m.vpr to non-0
> > or v7m.ltpsize to non-4 while in the middle of a TB ?" ...
> >
> > I've left the TB flag as non-cached because there seemed to
> > be an irritatingly large set of places where we'd have to do
> > an hflags recomputation.
>
> Indeed, non-cached seems like the right option.
>
> > +static bool mve_no_predication(CPUARMState *env)
> > +{
> > +    /*
> > +     * Return true if there is definitely no predication of MVE
> > +     * instructions for any reason. (Returning false even if there
> > +     * isn't any predication is OK; generated code will just be
> > +     * a little worse.)
> > +     * We do not account here for partial insn execution due to
> > +     * ECI bits as those are already in the TB flags elsewhere.
> > +     */
>
> I think you should go ahead and include that here, as it makes the test self-contained,
> and avoids you having to do this:
>
> > +        dc->mve_no_pred = EX_TBFLAG_M32(tb_flags, MVE_NO_PRED) && dc->eci == 0;

Yeah. I think I was initially thinking that I wanted the MVE_NO_PRED
flag to be cached in hflags, in which case you have to do the combination
with ECI here because the ECI/condexec bits are not cached. But if we're
happy for MVE_NO_PRED to be not cached then we can happily combine with
ECI directly in mve_no_predication().

> > +    /* VLLDM or VLSTM helpers might have updated vpr or ltpsize */
> > +    s->mve_no_pred = false;
>
> For a moment I was thinking that this was ok, just finish the rest of the TB and resync
> naturally, letting the end of the TB use the helpers.  But then I thought about goto_tb.
> If you have this potentially variable condition, you can't chain the links between this TB
> and the next.

I don't really understand what you mean here. What's the difference
between ending the TB now and translating a few more insns in
this TB before we end it?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] target/arm: Use TCG vector ops for MVE
  2021-09-02 15:09 [PATCH 0/4] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (3 preceding siblings ...)
  2021-09-02 15:09 ` [PATCH 4/4] target/arm: Optimize MVE VNEG, VABS Peter Maydell
@ 2021-09-03 15:14 ` Richard Henderson
  2021-09-03 15:20   ` Peter Maydell
  4 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2021-09-03 15:14 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 9/2/21 5:09 PM, Peter Maydell wrote:
> Richard: if you have time to scan through the MVE insns
> and suggest which ones would benefit from a vectorized
> version that would be very helpful...

VDUP
VMOVL (into shifts or and)
VMVN (which seems to have gotten separated from the rest of the do_1op)
VSHL
VSHR
VSLI
VSRI

I think that's about all.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] target/arm: Use TCG vector ops for MVE
  2021-09-03 15:14 ` [PATCH 0/4] target/arm: Use TCG vector ops for MVE Richard Henderson
@ 2021-09-03 15:20   ` Peter Maydell
  2021-09-03 15:36     ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2021-09-03 15:20 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Fri, 3 Sept 2021 at 16:14, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 9/2/21 5:09 PM, Peter Maydell wrote:
> > Richard: if you have time to scan through the MVE insns
> > and suggest which ones would benefit from a vectorized
> > version that would be very helpful...
>
> VDUP
> VMOVL (into shifts or and)
> VMVN (which seems to have gotten separated from the rest of the do_1op)
> VSHL
> VSHR
> VSLI
> VSRI
>
> I think that's about all.

I guess also VMOV (immediate) (vector) ?

Thanks for the list, I'll have a look at them.

-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-03 14:17     ` Peter Maydell
@ 2021-09-03 15:28       ` Richard Henderson
  2021-09-03 16:30         ` Peter Maydell
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2021-09-03 15:28 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 9/3/21 4:17 PM, Peter Maydell wrote:
> I don't really understand what you mean here. What's the difference
> between ending the TB now and translating a few more insns in
> this TB before we end it?

   VCMP (pred now on or off)
   B    label

The code we emit for B uses goto_tb.

When goto_tb is unlinked (e.g. first execute), we exit with a code that causes the current 
cpu state to be evaluated, the new TB created, and the link filled in.  This makes 
permanent the cpu state we evaluated, and thus the state change across goto_tb must be 
constant.  This is fine for B, since only the PC changes and to a known destination.

So we have two choices: end the TB immediately after the VCMP, or set another flag to 
cause gen_goto_tb to fall back to goto_ptr.

But I sorta think that it's just easier to end after VCMP.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/4] target/arm: Use TCG vector ops for MVE
  2021-09-03 15:20   ` Peter Maydell
@ 2021-09-03 15:36     ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2021-09-03 15:36 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 9/3/21 5:20 PM, Peter Maydell wrote:
> On Fri, 3 Sept 2021 at 16:14, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 9/2/21 5:09 PM, Peter Maydell wrote:
>>> Richard: if you have time to scan through the MVE insns
>>> and suggest which ones would benefit from a vectorized
>>> version that would be very helpful...
>>
>> VDUP
>> VMOVL (into shifts or and)
>> VMVN (which seems to have gotten separated from the rest of the do_1op)
>> VSHL
>> VSHR
>> VSLI
>> VSRI
>>
>> I think that's about all.
> 
> I guess also VMOV (immediate) (vector) ?

Oh right, yes.  I skipped VMOV because I remembered that the register version is an alias 
for VORR.  Am I correct that there is no mve insn corresponding to the NEON VLD1 (single 
element to all lanes)?


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-03 15:28       ` Richard Henderson
@ 2021-09-03 16:30         ` Peter Maydell
  2021-09-07 14:11           ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2021-09-03 16:30 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Fri, 3 Sept 2021 at 16:28, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 9/3/21 4:17 PM, Peter Maydell wrote:
> > I don't really understand what you mean here. What's the difference
> > between ending the TB now and translating a few more insns in
> > this TB before we end it?
>
>    VCMP (pred now on or off)
>    B    label
>
> The code we emit for B uses goto_tb.
>
> When goto_tb is unlinked (e.g. first execute), we exit with a code that causes the current
> cpu state to be evaluated, the new TB created, and the link filled in.  This makes
> permanent the cpu state we evaluated, and thus the state change across goto_tb must be
> constant.  This is fine for B, since only the PC changes and to a known destination.
>
> So we have two choices: end the TB immediately after the VCMP, or set another flag to
> cause gen_goto_tb to fall back to goto_ptr.

Ah, I see, and when we end after the VCMP then we use UPDATE_NOCHAIN
which causes us to use lookup_and_goto_ptr instead of goto_tb.

I was wondering if this is also a problem for the flags like s->v7m_lspact,
which we currently handle in about this way -- set from a TB flag, but
then updated for the rest of the TB. Let me try to write out the rules:

 * if you do something that changes the TB flag, and you know for
   definite the new state, and this change happens and is the same
   for every exit from the TB, then you can just update the flag
   and keep going in this TB. (This is why lspact etc are OK)
 * if you don't know for definite the new state, because it might
   not have happened, eg a helper function condition changes
   something, then you must end the TB immediately (in a way that
   doesn't use goto_tb). This is true both for "I couldn't figure
   out the new state, it's too hard" and "the new state depends on
   some runtime data such that different executions of the same TB
   might end up with different values for the flag"
 * it's not good enough to say "well, I can pessimistically always
   assume mve_no_pred", because of the goto_tb issue. You have to
   be sure of the exact value that a TB flags calculation after
   the insn will get.

So I think that pretty much everywhere in my current patch that is setting
s->mve_no_pred = false needs instead to just end the TB. That seems mostly
straightforward, but there are some tricky cases:

 * WLSTP. The code generated for this insn has two exits. The change
   to ltpsize happens on only one of those, which is currently
   handled by gen_jmp_tb(). I think we are OK to just leave the
   code as it is, because the value we pass to LTPSIZE is
   constant (encoded in the instruction), and so the new value
   of MVE_NO_PRED is always the same on that exit from the TB
   for all executions
 * gen_update_fp_context() -- this function gets called for pretty
   much every FP/MVE insn (as part of vfp_access_check), and it
   can in rare cases update the FPSCR.LTPSIZE and the VPR. I guess
   this means we really do need to end the TB
   if (MVE && s->v7m_new_fp_ctxt_needed) (ie the comment in
   gen_update_fp_context "We don't need to arrange to end the TB,
   because [we don't cache FPSCR in TB flags]" is no longer true).
   That seems likely to be painful because some of the insns that
   do a vfp_access_check do complicated things with the TB exits
   (eg WLSTP, LETP) that are going to override a naive setting of
   is_jmp in gen_update_fp_context()...
 * gen_preserve_fp_state() is similar to gen_update_fp_context()

thanks
-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-03 16:30         ` Peter Maydell
@ 2021-09-07 14:11           ` Richard Henderson
  2021-09-07 14:22             ` Richard Henderson
  2021-09-10 16:00             ` Peter Maydell
  0 siblings, 2 replies; 27+ messages in thread
From: Richard Henderson @ 2021-09-07 14:11 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 9/3/21 6:30 PM, Peter Maydell wrote:
> Let me try to write out the rules:
> 
>   * if you do something that changes the TB flag, and you know for
>     definite the new state, and this change happens and is the same
>     for every exit from the TB, then you can just update the flag
>     and keep going in this TB. (This is why lspact etc are OK)
>   * if you don't know for definite the new state, because it might
>     not have happened, eg a helper function condition changes
>     something, then you must end the TB immediately (in a way that
>     doesn't use goto_tb). This is true both for "I couldn't figure
>     out the new state, it's too hard" and "the new state depends on
>     some runtime data such that different executions of the same TB
>     might end up with different values for the flag"
>   * it's not good enough to say "well, I can pessimistically always
>     assume mve_no_pred", because of the goto_tb issue. You have to
>     be sure of the exact value that a TB flags calculation after
>     the insn will get.

Correct.

> 
> So I think that pretty much everywhere in my current patch that is setting
> s->mve_no_pred = false needs instead to just end the TB. That seems mostly
> straightforward, but there are some tricky cases:
> 
>   * WLSTP. The code generated for this insn has two exits. The change
>     to ltpsize happens on only one of those, which is currently
>     handled by gen_jmp_tb(). I think we are OK to just leave the
>     code as it is, because the value we pass to LTPSIZE is
>     constant (encoded in the instruction), and so the new value
>     of MVE_NO_PRED is always the same on that exit from the TB
>     for all executions

Correct.

>   * gen_update_fp_context() -- this function gets called for pretty
>     much every FP/MVE insn (as part of vfp_access_check), and it
>     can in rare cases update the FPSCR.LTPSIZE and the VPR. I guess
>     this means we really do need to end the TB
>     if (MVE && s->v7m_new_fp_ctxt_needed) (ie the comment in
>     gen_update_fp_context "We don't need to arrange to end the TB,
>     because [we don't cache FPSCR in TB flags]" is no longer true).
>     That seems likely to be painful because some of the insns that
>     do a vfp_access_check do complicated things with the TB exits
>     (eg WLSTP, LETP) that are going to override a naive setting of
>     is_jmp in gen_update_fp_context()...

Well, we could let gen_goto_tb see that is_jmp is already UPDATE_NOCHAIN, and suppress the 
goto_tb in that case.  That would seem to take care of any entry into gen_jmp(_tb).


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-07 14:11           ` Richard Henderson
@ 2021-09-07 14:22             ` Richard Henderson
  2021-09-10 16:00             ` Peter Maydell
  1 sibling, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2021-09-07 14:22 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 9/7/21 4:11 PM, Richard Henderson wrote:
> Well, we could let gen_goto_tb see that is_jmp is already UPDATE_NOCHAIN...

Perhaps anything besides DISAS_NEXT?


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-03 13:57   ` Richard Henderson
  2021-09-03 14:17     ` Peter Maydell
@ 2021-09-09 13:46     ` Peter Maydell
  2021-09-10  6:46       ` Richard Henderson
  2021-09-09 14:53     ` Peter Maydell
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2021-09-09 13:46 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Fri, 3 Sept 2021 at 14:58, Richard Henderson
<richard.henderson@linaro.org> wrote:
> I think you need to go ahead and end the TB and resync immediately.
> Just set dc->base.is_jmp = DISAS_UPDATE_NOCHAIN instead.

Is there a rule for when we should set is_jmp to DISAS_UPDATE_NOCHAIN,
when we should set it to DISAS_UPDATE_EXIT, and when we should call
gen_lookup_tb() ? We seem to use all three methods at various points
for "I changed some CPU state and want to end the TB"...

-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-03 13:57   ` Richard Henderson
  2021-09-03 14:17     ` Peter Maydell
  2021-09-09 13:46     ` Peter Maydell
@ 2021-09-09 14:53     ` Peter Maydell
  2 siblings, 0 replies; 27+ messages in thread
From: Peter Maydell @ 2021-09-09 14:53 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Fri, 3 Sept 2021 at 14:58, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 9/2/21 5:09 PM, Peter Maydell wrote:
> > Our current codegen for MVE always calls out to helper functions,
> > because some byte lanes might be predicated. The common case is
> > that in fact there is no predication active and all lanes should
> > be updated together, so we can produce better code by detecting
> > that and using the TCG generic vector infrastructure.
> >
> > Add a TB flag that is set when we can guarantee that there
> > is no active MVE predication, and a bool in the DisasContext.
> > Subsequent patches will use this flag to generate improved
> > code for some instructions.
> >
> > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > ---
> > Another of those awkward-to-review patches where the interesting
> > question is "did I miss anywhere where we set v7m.vpr to non-0
> > or v7m.ltpsize to non-4 while in the middle of a TB ?" ...
> >
> > I've left the TB flag as non-cached because there seemed to
> > be an irritatingly large set of places where we'd have to do
> > an hflags recomputation.
>
> Indeed, non-cached seems like the right option.
>
> > +static bool mve_no_predication(CPUARMState *env)
> > +{
> > +    /*
> > +     * Return true if there is definitely no predication of MVE
> > +     * instructions for any reason. (Returning false even if there
> > +     * isn't any predication is OK; generated code will just be
> > +     * a little worse.)
> > +     * We do not account here for partial insn execution due to
> > +     * ECI bits as those are already in the TB flags elsewhere.
> > +     */
>
> I think you should go ahead and include that here, as it makes the test self-contained,
> and avoids you having to do this:
>
> > +        dc->mve_no_pred = EX_TBFLAG_M32(tb_flags, MVE_NO_PRED) && dc->eci == 0;

I found a reason why these should stay split, and it's related to the
discussion in this thread about TB flags whose state changes and whether
that means you need to end the TB. The TB flag state in the CONDEXEC bits
(ie including the ECI bits) changes for a lot of instructions, but we don't
need to end the TB because it always changes in the exact same way
regardless. A TB flag encoding just "MVE predication might happen
due to LTPSIZE and VPR" changes in only a few places, and we can end
the TB in those places. A single TB flag that mixes together all of
LTPSIZE, VPR and ECI can change state in a hard-to-disentangle way,
because a non-zero ECI can change to zero for just about any MVE insn.

If we keep MVE_NO_PRED and ECI distinct, then we don't need to consider
ECI updates when figuring out if we need to end the TB, which works out neater.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-09 13:46     ` Peter Maydell
@ 2021-09-10  6:46       ` Richard Henderson
  2021-09-10  9:30         ` Peter Maydell
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2021-09-10  6:46 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 9/9/21 3:46 PM, Peter Maydell wrote:
> On Fri, 3 Sept 2021 at 14:58, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>> I think you need to go ahead and end the TB and resync immediately.
>> Just set dc->base.is_jmp = DISAS_UPDATE_NOCHAIN instead.
> 
> Is there a rule for when we should set is_jmp to DISAS_UPDATE_NOCHAIN,
> when we should set it to DISAS_UPDATE_EXIT, and when we should call
> gen_lookup_tb() ? We seem to use all three methods at various points
> for "I changed some CPU state and want to end the TB"...

UPDATE_EXIT is for when you need to return to the main loop; UPDATE_NOCHAIN merely avoids 
goto_tb.  Direct use of gen_goto_ptr should be reserved for branches (which I think is 
already the case).


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-10  6:46       ` Richard Henderson
@ 2021-09-10  9:30         ` Peter Maydell
  2021-09-10 13:08           ` Richard Henderson
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2021-09-10  9:30 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Fri, 10 Sept 2021 at 07:46, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 9/9/21 3:46 PM, Peter Maydell wrote:
> > On Fri, 3 Sept 2021 at 14:58, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >> I think you need to go ahead and end the TB and resync immediately.
> >> Just set dc->base.is_jmp = DISAS_UPDATE_NOCHAIN instead.
> >
> > Is there a rule for when we should set is_jmp to DISAS_UPDATE_NOCHAIN,
> > when we should set it to DISAS_UPDATE_EXIT, and when we should call
> > gen_lookup_tb() ? We seem to use all three methods at various points
> > for "I changed some CPU state and want to end the TB"...
>
> UPDATE_EXIT is for when you need to return to the main loop; UPDATE_NOCHAIN merely avoids
> goto_tb.  Direct use of gen_goto_ptr should be reserved for branches (which I think is
> already the case).

And what about gen_lookup_tb() ? As far as I can tell the semantics of
this are the same as setting is_jmp to DISAS_UPDATE_EXIT: the "set PC
to s->base.pc_next" happens immediately rather than being done later in
arm_tr_tb_stop(), but I don't think that should make a difference.
Or is there a case where it matters?

If gen_lookup_tb() and DISAS_UPDATE_EXIT are the same, maybe we
should get rid of gen_lookup_tb() entirely ?

-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-10  9:30         ` Peter Maydell
@ 2021-09-10 13:08           ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2021-09-10 13:08 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 9/10/21 11:30 AM, Peter Maydell wrote:
> If gen_lookup_tb() and DISAS_UPDATE_EXIT are the same, maybe we
> should get rid of gen_lookup_tb() entirely ?

Yes, I think we should do just that.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-07 14:11           ` Richard Henderson
  2021-09-07 14:22             ` Richard Henderson
@ 2021-09-10 16:00             ` Peter Maydell
  2021-09-12 12:48               ` Richard Henderson
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Maydell @ 2021-09-10 16:00 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers

On Tue, 7 Sept 2021 at 15:11, Richard Henderson
<richard.henderson@linaro.org> wrote:
> >   * gen_update_fp_context() -- this function gets called for pretty
> >     much every FP/MVE insn (as part of vfp_access_check), and it
> >     can in rare cases update the FPSCR.LTPSIZE and the VPR. I guess
> >     this means we really do need to end the TB
> >     if (MVE && s->v7m_new_fp_ctxt_needed) (ie the comment in
> >     gen_update_fp_context "We don't need to arrange to end the TB,
> >     because [we don't cache FPSCR in TB flags]" is no longer true).
> >     That seems likely to be painful because some of the insns that
> >     do a vfp_access_check do complicated things with the TB exits
> >     (eg WLSTP, LETP) that are going to override a naive setting of
> >     is_jmp in gen_update_fp_context()...
>
> Well, we could let gen_goto_tb see that is_jmp is already UPDATE_NOCHAIN, and suppress the
> goto_tb in that case.  That would seem to take care of any entry into gen_jmp(_tb).

We actually already have code that sets is_jmp (to DISAS_UPDATE_EXIT)
from gen_preserve_fp_state() -- we do that if we're using icount,
setting DISAS_UPDATE_EXIT to force this to be the last insn in the TB.

Do icount IO instructions need to avoid a possible goto_tb ?
I suppose we do want them to go back to the main loop.

This suggests that we need to make gen_goto_tb() look at
the is_jmp value anyway, regardless of this series.

-- PMM


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-10 16:00             ` Peter Maydell
@ 2021-09-12 12:48               ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2021-09-12 12:48 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers

On 9/10/21 9:00 AM, Peter Maydell wrote:
> We actually already have code that sets is_jmp (to DISAS_UPDATE_EXIT)
> from gen_preserve_fp_state() -- we do that if we're using icount,
> setting DISAS_UPDATE_EXIT to force this to be the last insn in the TB.
> 
> Do icount IO instructions need to avoid a possible goto_tb ?
> I suppose we do want them to go back to the main loop.

We do want them to go back to the main loop.  I'd be surprised if many (or any) io 
instructions are also direct branches, but...

> This suggests that we need to make gen_goto_tb() look at
> the is_jmp value anyway, regardless of this series.

Yes.  WLSTP may need to avoid the goto_tb along one path.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-09-12 12:52 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 15:09 [PATCH 0/4] target/arm: Use TCG vector ops for MVE Peter Maydell
2021-09-02 15:09 ` [PATCH 1/4] target/arm: Add TB flag for "MVE insns not predicated" Peter Maydell
2021-09-03 13:57   ` Richard Henderson
2021-09-03 14:17     ` Peter Maydell
2021-09-03 15:28       ` Richard Henderson
2021-09-03 16:30         ` Peter Maydell
2021-09-07 14:11           ` Richard Henderson
2021-09-07 14:22             ` Richard Henderson
2021-09-10 16:00             ` Peter Maydell
2021-09-12 12:48               ` Richard Henderson
2021-09-09 13:46     ` Peter Maydell
2021-09-10  6:46       ` Richard Henderson
2021-09-10  9:30         ` Peter Maydell
2021-09-10 13:08           ` Richard Henderson
2021-09-09 14:53     ` Peter Maydell
2021-09-02 15:09 ` [PATCH 2/4] target/arm: Optimize MVE logic ops Peter Maydell
2021-09-02 16:19   ` Philippe Mathieu-Daudé
2021-09-03 13:58   ` Richard Henderson
2021-09-02 15:09 ` [PATCH 3/4] target/arm: Optimize MVE arithmetic ops Peter Maydell
2021-09-02 16:20   ` Philippe Mathieu-Daudé
2021-09-03 13:59   ` Richard Henderson
2021-09-02 15:09 ` [PATCH 4/4] target/arm: Optimize MVE VNEG, VABS Peter Maydell
2021-09-02 16:21   ` Philippe Mathieu-Daudé
2021-09-03 14:00   ` Richard Henderson
2021-09-03 15:14 ` [PATCH 0/4] target/arm: Use TCG vector ops for MVE Richard Henderson
2021-09-03 15:20   ` Peter Maydell
2021-09-03 15:36     ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.