All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v6 0/4] target/mips: Optimize MSA interleave instructions
@ 2019-04-04 13:14 Mateja Marjanovic
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 1/4] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions Mateja Marjanovic
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-04 13:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: aurelien, philmd, richard.henderson, amarkovic, arikalo

From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>

Optimize MSA instructions ILVEV.<B|H|W|D>, ILVOD.<B|H|W|D>,
ILVL.<B|H|W|D> and ILVR.<B|H|W|D> using the hybrid approach,
MSA helpers in some cases and directly tcg registers in other
cases, so the performance would be better.

v6:
 - Add ILVL.<B|H|W|D> and ILVR.<B|H|W|D> MSA instructions
   with mixed approaches (with helpers and with tcg
   registers).
 - Test the performance for ILVL.<B|H|W|D> and
   ILVR.<B|H|W|D> MSA instructions, with helpers,
   with tcg and with the mixed approach.
 - Use a tcg register instead of an int variable for
   storing a constant value of the mask (for logic
   operations).
 - Eliminate some unnecessary tcg_gen calls.
 - Changes in commit messages and the cover letter.

v5:
 - Use tcg_gen_deposit function.
 - Added performance number for no-deposit and
   with-deposit cases of ILVEV.W.
 - Minor changes in commit messages and the cover letter.

v4:
 - Clean up typing errors.
 - Change the commit message and the cover letter.
 - Fix bug for ILVEV.D, in case where the destination
   and one of the sources are the same register.

v3:
 - Reduce the number of logic operations to a
   minimum.
 - Add comments.

v2:
 - Minor changes in commit messages and the cover letter.

Mateja Marjanovic (4):
  target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions
  target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  target/mips: Optimize ILVL.<B|H|W|D> MSA instructions
  target/mips: Optimize ILVR.<B|H|W|D> MSA instructions

 target/mips/helper.h     |   7 +-
 target/mips/msa_helper.c |  82 +++++---
 target/mips/translate.c  | 498 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 547 insertions(+), 40 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Qemu-devel] [PATCH v6 1/4] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions
  2019-04-04 13:14 [Qemu-devel] [PATCH v6 0/4] target/mips: Optimize MSA interleave instructions Mateja Marjanovic
@ 2019-04-04 13:14 ` Mateja Marjanovic
  2019-04-04 13:47   ` Philippe Mathieu-Daudé
  2019-04-13 16:09     ` Aleksandar Markovic
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-04 13:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: aurelien, philmd, richard.henderson, amarkovic, arikalo

From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>

Optimize set of MSA instructions ILVOD.<B|H|W|D>, using
directly tcg registers and performing logic on them instead
of using helpers.

In the following table, the first column is the performance
before this patch. The second represents the performance,
after converting from helpers to tcg, but without using
tcg_gen_deposit function. The third one is the solution
which is implemented in this patch.

Performance measurement is done by executing the
instructions a large number of times on a computer
with Intel Core i7-3770 CPU @ 3.40GHz×8.

============================================================
|| instr    ||   before    || no-deposit || with-deposit  ||
============================================================
|| ilvod.b  ||  117.50 ms  ||  24.13 ms  ||   23.71 ms    ||
|| ilvod.h  ||   93.16 ms  ||  24.21 ms  ||   23.45 ms    ||
|| ilvod.w  ||  119.90 ms  ||  24.15 ms  ||   22.91 ms    ||
|| ilvod.d  ||   43.01 ms  ||  21.17 ms  ||   20.53 ms    ||
============================================================

No-deposit column and with-deposit column have the
same statistical values in every row, except ILVOD.W,
which is the only function which uses the deposit
function.

No-deposit version of the ILVOD.W implementation:

static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
                               uint32_t ws, uint32_t wt)
{
    TCGv_i64 t1 = tcg_temp_new_i64();
    TCGv_i64 t2 = tcg_temp_new_i64();
    TCGv_i64 mask = tcg_const_i64(0xffffffff00000000ULL);

    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
    tcg_gen_shri_i64(t1, t1, 32);
    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);

    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
    tcg_gen_shri_i64(t1, t1, 32);
    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);

    tcg_temp_free_i64(mask);
    tcg_temp_free_i64(t1);
    tcg_temp_free_i64(t2);
}

Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
---
 target/mips/helper.h     |   1 -
 target/mips/msa_helper.c |   7 ----
 target/mips/translate.c  | 106 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 105 insertions(+), 9 deletions(-)

diff --git a/target/mips/helper.h b/target/mips/helper.h
index 2863f60..02e16c7 100644
--- a/target/mips/helper.h
+++ b/target/mips/helper.h
@@ -865,7 +865,6 @@ DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
-DEF_HELPER_5(msa_ilvod_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
index 6c57281..a7ea6aa 100644
--- a/target/mips/msa_helper.c
+++ b/target/mips/msa_helper.c
@@ -1206,13 +1206,6 @@ MSA_FN_DF(ilvr_df)
 MSA_FN_DF(ilvev_df)
 #undef MSA_DO
 
-#define MSA_DO(DF)                          \
-    do {                                    \
-        pwx->DF[2*i]   = pwt->DF[2*i+1];    \
-        pwx->DF[2*i+1] = pws->DF[2*i+1];    \
-    } while (0)
-MSA_FN_DF(ilvod_df)
-#undef MSA_DO
 #undef MSA_LOOP_COND
 
 #define MSA_LOOP_COND(DF) \
diff --git a/target/mips/translate.c b/target/mips/translate.c
index bba8b6c..df685e4 100644
--- a/target/mips/translate.c
+++ b/target/mips/translate.c
@@ -28884,6 +28884,95 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
     tcg_temp_free_i32(tws);
 }
 
+/*
+ * [MSA] ILVOD.B wd, ws, wt
+ *
+ *   Vector Interleave Odd (byte data elements)
+ *
+ */
+static inline void gen_ilvod_b(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xff00ff00ff00ff00ULL);
+
+    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_shri_i64(t1, t1, 8);
+    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
+
+    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 8);
+    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
+
+    tcg_temp_free_i64(mask);
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVOD.H wd, ws, wt
+ *
+ *   Vector Interleave Odd (halfword data elements)
+ *
+ */
+static inline void gen_ilvod_h(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffff0000ffff0000ULL);
+
+    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
+
+    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
+
+    tcg_temp_free_i64(mask);
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVOD.W wd, ws, wt
+ *
+ *   Vector Interleave Odd (word data elements)
+ *
+ */
+static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+
+    tcg_gen_shri_i64(t1, msa_wr_d[wt * 2], 32);
+    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[ws * 2], t1, 0, 32);
+
+    tcg_gen_shri_i64(t1, msa_wr_d[wt * 2 + 1], 32);
+    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1], t1, 0, 32);
+
+    tcg_temp_free_i64(t1);
+}
+
+/*
+ * [MSA] ILVOD.D wd, ws, wt
+ *
+ *   Vector Interleave Odd (doubleword data elements)
+ *
+ */
+static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2 + 1]);
+    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
+}
+
 static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
 {
 #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
@@ -29055,7 +29144,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
         gen_helper_msa_mod_u_df(cpu_env, tdf, twd, tws, twt);
         break;
     case OPC_ILVOD_df:
-        gen_helper_msa_ilvod_df(cpu_env, tdf, twd, tws, twt);
+        switch (df) {
+        case DF_BYTE:
+            gen_ilvod_b(env, wd, ws, wt);
+            break;
+        case DF_HALF:
+            gen_ilvod_h(env, wd, ws, wt);
+            break;
+        case DF_WORD:
+            gen_ilvod_w(env, wd, ws, wt);
+            break;
+        case DF_DOUBLE:
+            gen_ilvod_d(env, wd, ws, wt);
+            break;
+        default:
+            assert(0);
+        }
         break;
 
     case OPC_DOTP_S_df:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-04 13:14 [Qemu-devel] [PATCH v6 0/4] target/mips: Optimize MSA interleave instructions Mateja Marjanovic
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 1/4] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions Mateja Marjanovic
@ 2019-04-04 13:14 ` Mateja Marjanovic
  2019-04-04 13:42   ` Philippe Mathieu-Daudé
  2019-04-13 16:05     ` Aleksandar Markovic
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 3/4] target/mips: Optimize ILVL.<B|H|W|D> " Mateja Marjanovic
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> " Mateja Marjanovic
  3 siblings, 2 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-04 13:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: aurelien, philmd, richard.henderson, amarkovic, arikalo

From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>

Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
directly tcg registers and performing logic on them
instead of using helpers.

In the following table, the first column is the performance
before this patch. The second represents the performance,
after converting from helpers to tcg, but without using
tcg_gen_deposit function. The third one is the solution
which is implemented in this patch.

Performance measurement is done by executing the
instructions a large number of times on a computer
with Intel Core i7-3770 CPU @ 3.40GHz×8.

============================================================
|| instr    ||   before    || no-deposit ||  with-deposit ||
============================================================
|| ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
|| ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
|| ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
|| ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
============================================================

No-deposit column and with-deposit column have the
same statistical values in every row, except ILVEV.W,
which is the only function which uses the deposit
function.

No-deposit version of the ILVEV.W implementation:

static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
                               uint32_t ws, uint32_t wt)
{
    TCGv_i64 t1 = tcg_temp_new_i64();
    TCGv_i64 t2 = tcg_temp_new_i64();
    uint64_t mask = 0x00000000ffffffffULL;

    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
    tcg_gen_shli_i64(t2, t2, 32);
    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);

    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
    tcg_gen_shli_i64(t2, t2, 32);
    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);

    tcg_temp_free_i64(t1);
    tcg_temp_free_i64(t2);
}

Suggested-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
---
 target/mips/helper.h     |   1 -
 target/mips/msa_helper.c |   9 -----
 target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 100 insertions(+), 11 deletions(-)

diff --git a/target/mips/helper.h b/target/mips/helper.h
index 02e16c7..82f6a40 100644
--- a/target/mips/helper.h
+++ b/target/mips/helper.h
@@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
-DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
index a7ea6aa..d5c3842 100644
--- a/target/mips/msa_helper.c
+++ b/target/mips/msa_helper.c
@@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
     } while (0)
 MSA_FN_DF(ilvr_df)
 #undef MSA_DO
-
-#define MSA_DO(DF)                      \
-    do {                                \
-        pwx->DF[2*i]   = pwt->DF[2*i];  \
-        pwx->DF[2*i+1] = pws->DF[2*i];  \
-    } while (0)
-MSA_FN_DF(ilvev_df)
-#undef MSA_DO
-
 #undef MSA_LOOP_COND
 
 #define MSA_LOOP_COND(DF) \
diff --git a/target/mips/translate.c b/target/mips/translate.c
index df685e4..3057669 100644
--- a/target/mips/translate.c
+++ b/target/mips/translate.c
@@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
     tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
 }
 
+/*
+ * [MSA] ILVEV.B wd, ws, wt
+ *
+ *   Vector Interleave Even (byte data elements)
+ *
+ */
+static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
+
+    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
+    tcg_gen_shli_i64(t2, t2, 8);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
+
+    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t2, t2, 8);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
+
+    tcg_temp_free_i64(mask);
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVEV.H wd, ws, wt
+ *
+ *   Vector Interleave Even (halfword data elements)
+ *
+ */
+static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
+
+    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
+    tcg_gen_shli_i64(t2, t2, 16);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
+
+    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t2, t2, 16);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
+
+    tcg_temp_free_i64(mask);
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVEV.W wd, ws, wt
+ *
+ *   Vector Interleave Even (word data elements)
+ *
+ */
+static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
+                        msa_wr_d[ws * 2], 32, 32);
+    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
+                        msa_wr_d[ws * 2 + 1], 32, 32);
+}
+
+/*
+ * [MSA] ILVEV.D wd, ws, wt
+ *
+ *   Vector Interleave Even (Doubleword data elements)
+ *
+ */
+static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
+    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
+}
+
 static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
 {
 #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
@@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
         gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
         break;
     case OPC_ILVEV_df:
-        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
+        switch (df) {
+        case DF_BYTE:
+            gen_ilvev_b(env, wd, ws, wt);
+            break;
+        case DF_HALF:
+            gen_ilvev_h(env, wd, ws, wt);
+            break;
+        case DF_WORD:
+            gen_ilvev_w(env, wd, ws, wt);
+            break;
+        case DF_DOUBLE:
+            gen_ilvev_d(env, wd, ws, wt);
+            break;
+        default:
+            assert(0);
+        }
         break;
     case OPC_BINSR_df:
         gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Qemu-devel] [PATCH v6 3/4] target/mips: Optimize ILVL.<B|H|W|D> MSA instructions
  2019-04-04 13:14 [Qemu-devel] [PATCH v6 0/4] target/mips: Optimize MSA interleave instructions Mateja Marjanovic
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 1/4] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions Mateja Marjanovic
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
@ 2019-04-04 13:14 ` Mateja Marjanovic
  2019-04-13 16:15     ` Aleksandar Markovic
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> " Mateja Marjanovic
  3 siblings, 1 reply; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-04 13:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: aurelien, philmd, richard.henderson, amarkovic, arikalo

From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>

Optimized ILVL.<B|H|W|D> instructions, using a hybrid
approach. For byte data elements, use a helper with an
unrolled loop (much better performance), for halfword,
word and doubleword data elements use directly tcg
registers and logic performed on them.

Performance measurement is done by executing the
instructions a large number of times on a computer
with Intel Core i7-3770 CPU @ 3.40GHz×8.

==================================================
||  instr  ||  helper  ||   tcg    ||  hybrid   ||
==================================================
|| ilvl.b: || 59.91 ms || 74.41 ms ||  59.24 ms || <-- helper
|| ilvl.h: || 41.33 ms || 33.08 ms ||  32.96 ms || <-- tcg
|| ilvl.w: || 30.99 ms || 22.87 ms ||  22.81 ms || <-- tcg
|| ilvl.d: || 26.40 ms || 19.64 ms ||  19.45 ms || <-- tcg
==================================================

Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
---
 target/mips/helper.h     |   3 +-
 target/mips/msa_helper.c |  33 ++++++---
 target/mips/translate.c  | 184 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 210 insertions(+), 10 deletions(-)

diff --git a/target/mips/helper.h b/target/mips/helper.h
index 82f6a40..cd73723 100644
--- a/target/mips/helper.h
+++ b/target/mips/helper.h
@@ -862,7 +862,6 @@ DEF_HELPER_5(msa_sld_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_splat_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
-DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
@@ -946,6 +945,8 @@ DEF_HELPER_4(msa_insert_h, void, env, i32, i32, i32)
 DEF_HELPER_4(msa_insert_w, void, env, i32, i32, i32)
 DEF_HELPER_4(msa_insert_d, void, env, i32, i32, i32)
 
+DEF_HELPER_4(msa_ilvl_b, void, env, i32, i32, i32)
+
 DEF_HELPER_4(msa_fclass_df, void, env, i32, i32, i32)
 DEF_HELPER_4(msa_ftrunc_s_df, void, env, i32, i32, i32)
 DEF_HELPER_4(msa_ftrunc_u_df, void, env, i32, i32, i32)
diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
index d5c3842..84bbe6f 100644
--- a/target/mips/msa_helper.c
+++ b/target/mips/msa_helper.c
@@ -1184,14 +1184,6 @@ MSA_FN_DF(pckod_df)
 
 #define MSA_DO(DF)                      \
     do {                                \
-        pwx->DF[2*i]   = L##DF(pwt, i); \
-        pwx->DF[2*i+1] = L##DF(pws, i); \
-    } while (0)
-MSA_FN_DF(ilvl_df)
-#undef MSA_DO
-
-#define MSA_DO(DF)                      \
-    do {                                \
         pwx->DF[2*i]   = R##DF(pwt, i); \
         pwx->DF[2*i+1] = R##DF(pws, i); \
     } while (0)
@@ -1232,6 +1224,31 @@ void helper_msa_splati_df(CPUMIPSState *env, uint32_t df, uint32_t wd,
     msa_splat_df(df, pwd, pws, n);
 }
 
+void helper_msa_ilvl_b(CPUMIPSState *env, uint32_t wd,
+                       uint32_t ws, uint32_t wt)
+{
+    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
+    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
+    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
+
+    pwd->b[0]  = pwt->b[8];
+    pwd->b[1]  = pws->b[8];
+    pwd->b[2]  = pwt->b[9];
+    pwd->b[3]  = pws->b[9];
+    pwd->b[4]  = pwt->b[10];
+    pwd->b[5]  = pws->b[10];
+    pwd->b[6]  = pwt->b[11];
+    pwd->b[7]  = pws->b[11];
+    pwd->b[8]  = pwt->b[12];
+    pwd->b[9]  = pws->b[12];
+    pwd->b[10] = pwt->b[13];
+    pwd->b[11] = pws->b[13];
+    pwd->b[12] = pwt->b[14];
+    pwd->b[13] = pws->b[14];
+    pwd->b[14] = pwt->b[15];
+    pwd->b[15] = pws->b[15];
+}
+
 void helper_msa_copy_s_b(CPUMIPSState *env, uint32_t rd,
                          uint32_t ws, uint32_t n)
 {
diff --git a/target/mips/translate.c b/target/mips/translate.c
index 3057669..6c6811e 100644
--- a/target/mips/translate.c
+++ b/target/mips/translate.c
@@ -28885,6 +28885,173 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
 }
 
 /*
+ * [MSA] ILVL.B wd, ws, wt
+ *
+ *   Vector Interleave Left (byte data elements)
+ *
+ */
+static inline void gen_ilvl_b(CPUMIPSState *env, uint32_t wd,
+                              uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    uint64_t mask = 0x00000000000000ffULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 8);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 8;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 8);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 8;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 24);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 8;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 24);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 32);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
+
+    mask <<= 8;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 32);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 24);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 8;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 24);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 8;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 8);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 8;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 8);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVL.H wd, ws, wt
+ *
+ *   Vector Interleave Left (halfword data elements)
+ *
+ */
+static inline void gen_ilvl_h(CPUMIPSState *env, uint32_t wd,
+                              uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    uint64_t mask = 0x000000000000ffffULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 16;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 32);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
+
+    mask <<= 16;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 32);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 16;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVL.W wd, ws, wt
+ *
+ *   Vector Interleave Left (word data elements)
+ *
+ */
+static inline void gen_ilvl_w(CPUMIPSState *env, uint32_t wd,
+                              uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    uint64_t mask = 0x00000000ffffffffULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t1, t1, 32);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
+
+    mask <<= 32;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 32);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVL.D wd, ws, wt
+ *
+ *   Vector Interleave Left (doubleword data elements)
+ *
+ */
+static inline void gen_ilvl_d(CPUMIPSState *env, uint32_t wd,
+                              uint32_t ws, uint32_t wt)
+{
+    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
+    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2 + 1]);
+}
+
+/*
  * [MSA] ILVOD.B wd, ws, wt
  *
  *   Vector Interleave Odd (byte data elements)
@@ -29177,7 +29344,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
         gen_helper_msa_div_s_df(cpu_env, tdf, twd, tws, twt);
         break;
     case OPC_ILVL_df:
-        gen_helper_msa_ilvl_df(cpu_env, tdf, twd, tws, twt);
+        switch (df) {
+        case DF_BYTE:
+            gen_helper_msa_ilvl_b(cpu_env, twd, tws, twt);
+            break;
+        case DF_HALF:
+            gen_ilvl_h(env, wd, ws, wt);
+            break;
+        case DF_WORD:
+            gen_ilvl_w(env, wd, ws, wt);
+            break;
+        case DF_DOUBLE:
+            gen_ilvl_d(env, wd, ws, wt);
+            break;
+        default:
+            assert(0);
+        }
         break;
     case OPC_BNEG_df:
         gen_helper_msa_bneg_df(cpu_env, tdf, twd, tws, twt);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> MSA instructions
  2019-04-04 13:14 [Qemu-devel] [PATCH v6 0/4] target/mips: Optimize MSA interleave instructions Mateja Marjanovic
                   ` (2 preceding siblings ...)
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 3/4] target/mips: Optimize ILVL.<B|H|W|D> " Mateja Marjanovic
@ 2019-04-04 13:14 ` Mateja Marjanovic
  2019-04-13 16:05     ` Aleksandar Markovic
  3 siblings, 1 reply; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-04 13:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: aurelien, philmd, richard.henderson, amarkovic, arikalo

From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>

Optimized ILVR.<B|H|W|D> instructions, using a hybrid
approach. For byte data elements, use a helper with an
unrolled loop (much better performance), for halfword,
word and doubleword data elements use directly tcg
registers and logic performed on them.

Performance measurement is done by executing the
instructions a large number of times on a computer
with Intel Core i7-3770 CPU @ 3.40GHz×8.

===================================================
||  instr  ||  helper  ||    tcg    ||   hybrid  ||
===================================================
|| ilvr.b: || 62.87 ms ||  74.76 ms ||  61.52 ms || <-- helper
|| ilvr.h: || 44.11 ms ||  33.00 ms ||  33.55 ms || <-- tcg
|| ilvr.w: || 34.97 ms ||  23.06 ms ||  22.67 ms || <-- tcg
|| ilvr.d: || 27.33 ms ||  19.87 ms ||  20.02 ms || <-- tcg
===================================================

Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
---
 target/mips/helper.h     |   2 +-
 target/mips/msa_helper.c |  33 +++++++++++----
 target/mips/translate.c  | 107 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 132 insertions(+), 10 deletions(-)

diff --git a/target/mips/helper.h b/target/mips/helper.h
index cd73723..d4755ef 100644
--- a/target/mips/helper.h
+++ b/target/mips/helper.h
@@ -862,7 +862,6 @@ DEF_HELPER_5(msa_sld_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_splat_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
-DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
@@ -946,6 +945,7 @@ DEF_HELPER_4(msa_insert_w, void, env, i32, i32, i32)
 DEF_HELPER_4(msa_insert_d, void, env, i32, i32, i32)
 
 DEF_HELPER_4(msa_ilvl_b, void, env, i32, i32, i32)
+DEF_HELPER_4(msa_ilvr_b, void, env, i32, i32, i32)
 
 DEF_HELPER_4(msa_fclass_df, void, env, i32, i32, i32)
 DEF_HELPER_4(msa_ftrunc_s_df, void, env, i32, i32, i32)
diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
index 84bbe6f..2470cef 100644
--- a/target/mips/msa_helper.c
+++ b/target/mips/msa_helper.c
@@ -1181,14 +1181,6 @@ MSA_FN_DF(pckev_df)
     } while (0)
 MSA_FN_DF(pckod_df)
 #undef MSA_DO
-
-#define MSA_DO(DF)                      \
-    do {                                \
-        pwx->DF[2*i]   = R##DF(pwt, i); \
-        pwx->DF[2*i+1] = R##DF(pws, i); \
-    } while (0)
-MSA_FN_DF(ilvr_df)
-#undef MSA_DO
 #undef MSA_LOOP_COND
 
 #define MSA_LOOP_COND(DF) \
@@ -1249,6 +1241,31 @@ void helper_msa_ilvl_b(CPUMIPSState *env, uint32_t wd,
     pwd->b[15] = pws->b[15];
 }
 
+void helper_msa_ilvr_b(CPUMIPSState *env, uint32_t wd,
+                       uint32_t ws, uint32_t wt)
+{
+    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
+    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
+    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
+
+    pwd->b[15] = pws->b[7];
+    pwd->b[14] = pwt->b[7];
+    pwd->b[13] = pws->b[6];
+    pwd->b[12] = pwt->b[6];
+    pwd->b[11] = pws->b[5];
+    pwd->b[10] = pwt->b[5];
+    pwd->b[9]  = pws->b[4];
+    pwd->b[8]  = pwt->b[4];
+    pwd->b[7]  = pws->b[3];
+    pwd->b[6]  = pwt->b[3];
+    pwd->b[5]  = pws->b[2];
+    pwd->b[4]  = pwt->b[2];
+    pwd->b[3]  = pws->b[1];
+    pwd->b[2]  = pwt->b[1];
+    pwd->b[1]  = pws->b[0];
+    pwd->b[0]  = pwt->b[0];
+}
+
 void helper_msa_copy_s_b(CPUMIPSState *env, uint32_t rd,
                          uint32_t ws, uint32_t n)
 {
diff --git a/target/mips/translate.c b/target/mips/translate.c
index 6c6811e..90332fb 100644
--- a/target/mips/translate.c
+++ b/target/mips/translate.c
@@ -28885,6 +28885,96 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
 }
 
 /*
+ * [MSA] ILVR.H wd, ws, wt
+ *
+ *   Vector Interleave Right (halfword data elements)
+ *
+ */
+static inline void gen_ilvr_h(CPUMIPSState *env, uint32_t wd,
+                              uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    uint64_t mask = 0x000000000000ffffULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
+    tcg_gen_shli_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 16;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_shli_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
+    tcg_gen_shli_i64(t1, t1, 32);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
+
+    mask <<= 16;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_shri_i64(t1, t1, 32);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+
+    mask <<= 16;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_or_i64(t2, t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVR.W wd, ws, wt
+ *
+ *   Vector Interleave Right (word data elements)
+ *
+ */
+static inline void gen_ilvr_w(CPUMIPSState *env, uint32_t wd,
+                              uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    uint64_t mask = 0x00000000ffffffffULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
+    tcg_gen_shli_i64(t1, t1, 32);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
+
+    mask <<= 32;
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_shri_i64(t1, t1, 32);
+    tcg_gen_mov_i64(t2, t1);
+    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVR.D wd, ws, wt
+ *
+ *   Vector Interleave Right (doubleword data elements)
+ *
+ */
+static inline void gen_ilvr_d(CPUMIPSState *env, uint32_t wd,
+                              uint32_t ws, uint32_t wt)
+{
+    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
+    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
+}
+
+
+/*
  * [MSA] ILVL.B wd, ws, wt
  *
  *   Vector Interleave Left (byte data elements)
@@ -29380,7 +29470,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
         gen_helper_msa_div_u_df(cpu_env, tdf, twd, tws, twt);
         break;
     case OPC_ILVR_df:
-        gen_helper_msa_ilvr_df(cpu_env, tdf, twd, tws, twt);
+        switch (df) {
+        case DF_BYTE:
+            gen_helper_msa_ilvr_b(cpu_env, twd, tws, twt);
+            break;
+        case DF_HALF:
+            gen_ilvr_h(env, wd, ws, wt);
+            break;
+        case DF_WORD:
+            gen_ilvr_w(env, wd, ws, wt);
+            break;
+        case DF_DOUBLE:
+            gen_ilvr_d(env, wd, ws, wt);
+            break;
+        default:
+            assert(0);
+        }
         break;
     case OPC_BINSL_df:
         gen_helper_msa_binsl_df(cpu_env, tdf, twd, tws, twt);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
@ 2019-04-04 13:42   ` Philippe Mathieu-Daudé
  2019-04-04 18:19     ` Aleksandar Markovic
  2019-04-17 12:45       ` Mateja Marjanovic
  2019-04-13 16:05     ` Aleksandar Markovic
  1 sibling, 2 replies; 29+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-04-04 13:42 UTC (permalink / raw)
  To: Mateja Marjanovic, qemu-devel
  Cc: aurelien, richard.henderson, amarkovic, arikalo

Hi Mateja,

On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> 
> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
> directly tcg registers and performing logic on them
> instead of using helpers.
> 
> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.
> 
> Performance measurement is done by executing the
> instructions a large number of times on a computer
> with Intel Core i7-3770 CPU @ 3.40GHz×8.
> 
> ============================================================
> || instr    ||   before    || no-deposit ||  with-deposit ||
> ============================================================
> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||

I'm quite surprised there is not a single change here since your v5, are
you sure you used the correct result? I was expecting a slighly improvement.

> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
> ============================================================
> 
> No-deposit column and with-deposit column have the
> same statistical values in every row, except ILVEV.W,
> which is the only function which uses the deposit
> function.
> 
> No-deposit version of the ILVEV.W implementation:
> 
> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     TCGv_i64 t1 = tcg_temp_new_i64();
>     TCGv_i64 t2 = tcg_temp_new_i64();
>     uint64_t mask = 0x00000000ffffffffULL;
> 
>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>     tcg_gen_shli_i64(t2, t2, 32);
>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> 
>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>     tcg_gen_shli_i64(t2, t2, 32);
>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> 
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(t2);
> }
> 
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   9 -----
>  target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 100 insertions(+), 11 deletions(-)
> 
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 02e16c7..82f6a40 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index a7ea6aa..d5c3842 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>      } while (0)
>  MSA_FN_DF(ilvr_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> -    } while (0)
> -MSA_FN_DF(ilvev_df)
> -#undef MSA_DO
> -
>  #undef MSA_LOOP_COND
>  
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index df685e4..3057669 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>  }
>  
> +/*
> + * [MSA] ILVEV.B wd, ws, wt
> + *
> + *   Vector Interleave Even (byte data elements)
> + *
> + */
> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVEV.H wd, ws, wt
> + *
> + *   Vector Interleave Even (halfword data elements)
> + *
> + */
> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}

Apparently you missed my comment about refactoring using mask/shift as
arguments:

static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
                                uint32_t ws, uint32_t wt,
                                int64_t mask, int64_t shift)
{
    TCGv_i64 t1 = tcg_temp_new_i64();
    TCGv_i64 t2 = tcg_temp_new_i64();
    TCGv_i64 tm = tcg_const_i64(mask);

    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
    tcg_gen_shli_i64(t2, t2, shift);
    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);

    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
    tcg_gen_shli_i64(t2, t2, shift);
    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);

    tcg_temp_free_i64(tm);
    tcg_temp_free_i64(t1);
    tcg_temp_free_i64(t2);
}

static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
                               uint32_t ws, uint32_t wt)
{
    gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
}

static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
                               uint32_t ws, uint32_t wt)
{
    gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
}


> +
> +/*
> + * [MSA] ILVEV.W wd, ws, wt
> + *
> + *   Vector Interleave Even (word data elements)
> + *
> + */
> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
> +                        msa_wr_d[ws * 2], 32, 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
> +                        msa_wr_d[ws * 2 + 1], 32, 32);
> +}
> +
> +/*
> + * [MSA] ILVEV.D wd, ws, wt
> + *
> + *   Vector Interleave Even (Doubleword data elements)
> + *
> + */
> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVEV_df:
> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvev_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvev_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvev_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvev_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSR_df:
>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions
  2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 1/4] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions Mateja Marjanovic
@ 2019-04-04 13:47   ` Philippe Mathieu-Daudé
  2019-04-13 16:09     ` Aleksandar Markovic
  1 sibling, 0 replies; 29+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-04-04 13:47 UTC (permalink / raw)
  To: Mateja Marjanovic, qemu-devel
  Cc: aurelien, richard.henderson, amarkovic, arikalo

On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> 
> Optimize set of MSA instructions ILVOD.<B|H|W|D>, using
> directly tcg registers and performing logic on them instead
> of using helpers.
> 
> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.
> 
> Performance measurement is done by executing the
> instructions a large number of times on a computer
> with Intel Core i7-3770 CPU @ 3.40GHz×8.
> 
> ============================================================
> || instr    ||   before    || no-deposit || with-deposit  ||
> ============================================================
> || ilvod.b  ||  117.50 ms  ||  24.13 ms  ||   23.71 ms    ||
> || ilvod.h  ||   93.16 ms  ||  24.21 ms  ||   23.45 ms    ||
> || ilvod.w  ||  119.90 ms  ||  24.15 ms  ||   22.91 ms    ||
> || ilvod.d  ||   43.01 ms  ||  21.17 ms  ||   20.53 ms    ||
> ============================================================
> 
> No-deposit column and with-deposit column have the
> same statistical values in every row, except ILVOD.W,
> which is the only function which uses the deposit
> function.
> 
> No-deposit version of the ILVOD.W implementation:
> 
> static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     TCGv_i64 t1 = tcg_temp_new_i64();
>     TCGv_i64 t2 = tcg_temp_new_i64();
>     TCGv_i64 mask = tcg_const_i64(0xffffffff00000000ULL);
> 
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>     tcg_gen_shri_i64(t1, t1, 32);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> 
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>     tcg_gen_shri_i64(t1, t1, 32);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> 
>     tcg_temp_free_i64(mask);
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(t2);
> }
> 
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   7 ----
>  target/mips/translate.c  | 106 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 105 insertions(+), 9 deletions(-)
> 
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 2863f60..02e16c7 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -865,7 +865,6 @@ DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index 6c57281..a7ea6aa 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1206,13 +1206,6 @@ MSA_FN_DF(ilvr_df)
>  MSA_FN_DF(ilvev_df)
>  #undef MSA_DO
>  
> -#define MSA_DO(DF)                          \
> -    do {                                    \
> -        pwx->DF[2*i]   = pwt->DF[2*i+1];    \
> -        pwx->DF[2*i+1] = pws->DF[2*i+1];    \
> -    } while (0)
> -MSA_FN_DF(ilvod_df)
> -#undef MSA_DO
>  #undef MSA_LOOP_COND
>  
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index bba8b6c..df685e4 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28884,6 +28884,95 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
>      tcg_temp_free_i32(tws);
>  }
>  
> +/*
> + * [MSA] ILVOD.B wd, ws, wt
> + *
> + *   Vector Interleave Odd (byte data elements)
> + *
> + */
> +static inline void gen_ilvod_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0xff00ff00ff00ff00ULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVOD.H wd, ws, wt
> + *
> + *   Vector Interleave Odd (halfword data elements)
> + *
> + */
> +static inline void gen_ilvod_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0xffff0000ffff0000ULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}

Same comment that patch #2 of this series applies here, refactor the b/h
cases would ease code maintainance.

> +
> +/*
> + * [MSA] ILVOD.W wd, ws, wt
> + *
> + *   Vector Interleave Odd (word data elements)
> + *
> + */
> +static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +
> +    tcg_gen_shri_i64(t1, msa_wr_d[wt * 2], 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[ws * 2], t1, 0, 32);
> +
> +    tcg_gen_shri_i64(t1, msa_wr_d[wt * 2 + 1], 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1], t1, 0, 32);
> +
> +    tcg_temp_free_i64(t1);
> +}
> +
> +/*
> + * [MSA] ILVOD.D wd, ws, wt
> + *
> + *   Vector Interleave Odd (doubleword data elements)
> + *
> + */
> +static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2 + 1]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29055,7 +29144,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_mod_u_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVOD_df:
> -        gen_helper_msa_ilvod_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvod_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvod_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvod_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvod_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>  
>      case OPC_DOTP_S_df:
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-04 13:42   ` Philippe Mathieu-Daudé
@ 2019-04-04 18:19     ` Aleksandar Markovic
  2019-04-04 19:17       ` Philippe Mathieu-Daudé
  2019-04-17 12:45       ` Mateja Marjanovic
  1 sibling, 1 reply; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-04 18:19 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Mateja Marjanovic, qemu-devel
  Cc: aurelien, richard.henderson, Aleksandar Rikalo

> From: Philippe Mathieu-Daudé <philmd@redhat.com>
> Subject: Re: [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
> 
> Hi Mateja,
> 
> On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
> > From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> >
> > Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
> > directly tcg registers and performing logic on them
> > instead of using helpers.
> >
> > In the following table, the first column is the performance
> > before this patch. The second represents the performance,
> > after converting from helpers to tcg, but without using
> > tcg_gen_deposit function. The third one is the solution
> > which is implemented in this patch.
> >
> > Performance measurement is done by executing the
> > instructions a large number of times on a computer
> > with Intel Core i7-3770 CPU @ 3.40GHz×8.
> >
> > ============================================================
> > || instr    ||   before    || no-deposit ||  with-deposit ||
> > ============================================================
> > || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
> > || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> 
> I'm quite surprised there is not a single change here since your v5, are
> you sure you used the correct result? I was expecting a slighly improvement.
> 

Hello, Philippe.

First of all, thank you so much for taking your time to provide
Matejan with the source code below.

Speaking about your idea, Mateja told me he DID implemented it,
and did measurements, and that the improvement is noticeable,
but really small. I don't know why he did not include that change
- perhaps he didn't have enough time to integrate it.

I know he is on a long weekend now, so we will have to wait for
the next week for Mateja to explain this to us. Mateja, could you
perhaps add a column "with-deposit-and-mask-as-tcg-constant"?

Yours,
Aleksandar


> > || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
> > || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
> > ============================================================
> >
> > No-deposit column and with-deposit column have the
> > same statistical values in every row, except ILVEV.W,
> > which is the only function which uses the deposit
> > function.
> >
> > No-deposit version of the ILVEV.W implementation:
> >
> > static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> >                                uint32_t ws, uint32_t wt)
> > {
> >     TCGv_i64 t1 = tcg_temp_new_i64();
> >     TCGv_i64 t2 = tcg_temp_new_i64();
> >     uint64_t mask = 0x00000000ffffffffULL;
> >
> >     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> >     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> >     tcg_gen_shli_i64(t2, t2, 32);
> >     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >
> >     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >     tcg_gen_shli_i64(t2, t2, 32);
> >     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >
> >     tcg_temp_free_i64(t1);
> >     tcg_temp_free_i64(t2);
> > }
> >
> > Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> > Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> > ---
> >  target/mips/helper.h     |   1 -
> >  target/mips/msa_helper.c |   9 -----
> >  target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 100 insertions(+), 11 deletions(-)
> >
> > diff --git a/target/mips/helper.h b/target/mips/helper.h
> > index 02e16c7..82f6a40 100644
> > --- a/target/mips/helper.h
> > +++ b/target/mips/helper.h
> > @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> > -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> > diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> > index a7ea6aa..d5c3842 100644
> > --- a/target/mips/msa_helper.c
> > +++ b/target/mips/msa_helper.c
> > @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
> >      } while (0)
> >  MSA_FN_DF(ilvr_df)
> >  #undef MSA_DO
> > -
> > -#define MSA_DO(DF)                      \
> > -    do {                                \
> > -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> > -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> > -    } while (0)
> > -MSA_FN_DF(ilvev_df)
> > -#undef MSA_DO
> > -
> >  #undef MSA_LOOP_COND
> >
> >  #define MSA_LOOP_COND(DF) \
> > diff --git a/target/mips/translate.c b/target/mips/translate.c
> > index df685e4..3057669 100644
> > --- a/target/mips/translate.c
> > +++ b/target/mips/translate.c
> > @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
> >      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> >  }
> >
> > +/*
> > + * [MSA] ILVEV.B wd, ws, wt
> > + *
> > + *   Vector Interleave Even (byte data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    TCGv_i64 t1 = tcg_temp_new_i64();
> > +    TCGv_i64 t2 = tcg_temp_new_i64();
> > +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> > +
> > +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> > +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> > +    tcg_gen_shli_i64(t2, t2, 8);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> > +
> > +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> > +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> > +    tcg_gen_shli_i64(t2, t2, 8);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> > +
> > +    tcg_temp_free_i64(mask);
> > +    tcg_temp_free_i64(t1);
> > +    tcg_temp_free_i64(t2);
> > +}
> > +
> > +/*
> > + * [MSA] ILVEV.H wd, ws, wt
> > + *
> > + *   Vector Interleave Even (halfword data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    TCGv_i64 t1 = tcg_temp_new_i64();
> > +    TCGv_i64 t2 = tcg_temp_new_i64();
> > +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
> > +
> > +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> > +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> > +    tcg_gen_shli_i64(t2, t2, 16);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> > +
> > +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> > +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> > +    tcg_gen_shli_i64(t2, t2, 16);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> > +
> > +    tcg_temp_free_i64(mask);
> > +    tcg_temp_free_i64(t1);
> > +    tcg_temp_free_i64(t2);
> > +}
> 
> Apparently you missed my comment about refactoring using mask/shift as
> arguments:
> 
> static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
>                                 uint32_t ws, uint32_t wt,
>                                 int64_t mask, int64_t shift)
> {
>     TCGv_i64 t1 = tcg_temp_new_i64();
>     TCGv_i64 t2 = tcg_temp_new_i64();
>     TCGv_i64 tm = tcg_const_i64(mask);
> 
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
>     tcg_gen_shli_i64(t2, t2, shift);
>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> 
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
>     tcg_gen_shli_i64(t2, t2, shift);
>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> 
>     tcg_temp_free_i64(tm);
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(t2);
> }
> 
> static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
> }
> 
> static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
> }
> 
> 
> > +
> > +/*
> > + * [MSA] ILVEV.W wd, ws, wt
> > + *
> > + *   Vector Interleave Even (word data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
> > +                        msa_wr_d[ws * 2], 32, 32);
> > +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
> > +                        msa_wr_d[ws * 2 + 1], 32, 32);
> > +}
> > +
> > +/*
> > + * [MSA] ILVEV.D wd, ws, wt
> > + *
> > + *   Vector Interleave Even (Doubleword data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> > +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> > +}
> > +
> >  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
> >  {
> >  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> > @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
> >          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
> >          break;
> >      case OPC_ILVEV_df:
> > -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> > +        switch (df) {
> > +        case DF_BYTE:
> > +            gen_ilvev_b(env, wd, ws, wt);
> > +            break;
> > +        case DF_HALF:
> > +            gen_ilvev_h(env, wd, ws, wt);
> > +            break;
> > +        case DF_WORD:
> > +            gen_ilvev_w(env, wd, ws, wt);
> > +            break;
> > +        case DF_DOUBLE:
> > +            gen_ilvev_d(env, wd, ws, wt);
> > +            break;
> > +        default:
> > +            assert(0);
> > +        }
> >          break;
> >      case OPC_BINSR_df:
> >          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> >
> 
________________________________________
From: Philippe Mathieu-Daudé <philmd@redhat.com>
Sent: Thursday, April 4, 2019 3:42 PM
To: Mateja Marjanovic; qemu-devel@nongnu.org
Cc: aurelien@aurel32.net; richard.henderson@linaro.org; Aleksandar Markovic; Aleksandar Rikalo
Subject: Re: [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions

Hi Mateja,

On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
> directly tcg registers and performing logic on them
> instead of using helpers.
>
> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.
>
> Performance measurement is done by executing the
> instructions a large number of times on a computer
> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>
> ============================================================
> || instr    ||   before    || no-deposit ||  with-deposit ||
> ============================================================
> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||

I'm quite surprised there is not a single change here since your v5, are
you sure you used the correct result? I was expecting a slighly improvement.

> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
> ============================================================
>
> No-deposit column and with-deposit column have the
> same statistical values in every row, except ILVEV.W,
> which is the only function which uses the deposit
> function.
>
> No-deposit version of the ILVEV.W implementation:
>
> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     TCGv_i64 t1 = tcg_temp_new_i64();
>     TCGv_i64 t2 = tcg_temp_new_i64();
>     uint64_t mask = 0x00000000ffffffffULL;
>
>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>     tcg_gen_shli_i64(t2, t2, 32);
>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>
>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>     tcg_gen_shli_i64(t2, t2, 32);
>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(t2);
> }
>
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   9 -----
>  target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 100 insertions(+), 11 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 02e16c7..82f6a40 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index a7ea6aa..d5c3842 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>      } while (0)
>  MSA_FN_DF(ilvr_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> -    } while (0)
> -MSA_FN_DF(ilvev_df)
> -#undef MSA_DO
> -
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index df685e4..3057669 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>  }
>
> +/*
> + * [MSA] ILVEV.B wd, ws, wt
> + *
> + *   Vector Interleave Even (byte data elements)
> + *
> + */
> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVEV.H wd, ws, wt
> + *
> + *   Vector Interleave Even (halfword data elements)
> + *
> + */
> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}

Apparently you missed my comment about refactoring using mask/shift as
arguments:

static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
                                uint32_t ws, uint32_t wt,
                                int64_t mask, int64_t shift)
{
    TCGv_i64 t1 = tcg_temp_new_i64();
    TCGv_i64 t2 = tcg_temp_new_i64();
    TCGv_i64 tm = tcg_const_i64(mask);

    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
    tcg_gen_shli_i64(t2, t2, shift);
    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);

    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
    tcg_gen_shli_i64(t2, t2, shift);
    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);

    tcg_temp_free_i64(tm);
    tcg_temp_free_i64(t1);
    tcg_temp_free_i64(t2);
}

static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
                               uint32_t ws, uint32_t wt)
{
    gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
}

static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
                               uint32_t ws, uint32_t wt)
{
    gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
}


> +
> +/*
> + * [MSA] ILVEV.W wd, ws, wt
> + *
> + *   Vector Interleave Even (word data elements)
> + *
> + */
> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
> +                        msa_wr_d[ws * 2], 32, 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
> +                        msa_wr_d[ws * 2 + 1], 32, 32);
> +}
> +
> +/*
> + * [MSA] ILVEV.D wd, ws, wt
> + *
> + *   Vector Interleave Even (Doubleword data elements)
> + *
> + */
> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVEV_df:
> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvev_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvev_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvev_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvev_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSR_df:
>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-04 18:19     ` Aleksandar Markovic
@ 2019-04-04 19:17       ` Philippe Mathieu-Daudé
  2019-04-05  0:26           ` Aleksandar Markovic
  0 siblings, 1 reply; 29+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-04-04 19:17 UTC (permalink / raw)
  To: Aleksandar Markovic, Mateja Marjanovic, qemu-devel
  Cc: aurelien, richard.henderson, Aleksandar Rikalo

Hi Aleksandar,

On 4/4/19 8:19 PM, Aleksandar Markovic wrote:
>> From: Philippe Mathieu-Daudé <philmd@redhat.com>
>> Subject: Re: [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
>>
>> Hi Mateja,
>>
>> On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
>>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>>
>>> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
>>> directly tcg registers and performing logic on them
>>> instead of using helpers.
>>>
>>> In the following table, the first column is the performance
>>> before this patch. The second represents the performance,
>>> after converting from helpers to tcg, but without using
>>> tcg_gen_deposit function. The third one is the solution
>>> which is implemented in this patch.
>>>
>>> Performance measurement is done by executing the
>>> instructions a large number of times on a computer
>>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>>
>>> ============================================================
>>> || instr    ||   before    || no-deposit ||  with-deposit ||
>>> ============================================================
>>> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
>>> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
>>
>> I'm quite surprised there is not a single change here since your v5, are
>> you sure you used the correct result? I was expecting a slighly improvement.
>>
> 
> Hello, Philippe.
> 
> First of all, thank you so much for taking your time to provide
> Matejan with the source code below.
> 
> Speaking about your idea, Mateja told me he DID implemented it,
> and did measurements, and that the improvement is noticeable,
> but really small. I don't know why he did not include that change
> - perhaps he didn't have enough time to integrate it.
> 
> I know he is on a long weekend now, so we will have to wait for
> the next week for Mateja to explain this to us. Mateja, could you
> perhaps add a column "with-deposit-and-mask-as-tcg-constant"?

Ah OK, no worries since this series will enter in the 4.1 dev cycle and
there is still 2/3 weeks to go.
I am simply curious to see the difference using the register approach :)
I recently noticed I can run those tests myself (I previously mixed
threads and thought Mateja was running this on a FPGA).

I try to have a quick look at your work, and I'm pleased to see how
Mateja improves the quality/accuracy of each series.

Regards,

Phil.

> Yours,
> Aleksandar
> 
> 
>>> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
>>> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
>>> ============================================================
>>>
>>> No-deposit column and with-deposit column have the
>>> same statistical values in every row, except ILVEV.W,
>>> which is the only function which uses the deposit
>>> function.
>>>
>>> No-deposit version of the ILVEV.W implementation:
>>>
>>> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>>                                uint32_t ws, uint32_t wt)
>>> {
>>>     TCGv_i64 t1 = tcg_temp_new_i64();
>>>     TCGv_i64 t2 = tcg_temp_new_i64();
>>>     uint64_t mask = 0x00000000ffffffffULL;
>>>
>>>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>>     tcg_gen_shli_i64(t2, t2, 32);
>>>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>>
>>>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>>     tcg_gen_shli_i64(t2, t2, 32);
>>>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>>
>>>     tcg_temp_free_i64(t1);
>>>     tcg_temp_free_i64(t2);
>>> }
>>>
>>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>>> ---
>>>  target/mips/helper.h     |   1 -
>>>  target/mips/msa_helper.c |   9 -----
>>>  target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>>>  3 files changed, 100 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>>> index 02e16c7..82f6a40 100644
>>> --- a/target/mips/helper.h
>>> +++ b/target/mips/helper.h
>>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>>> index a7ea6aa..d5c3842 100644
>>> --- a/target/mips/msa_helper.c
>>> +++ b/target/mips/msa_helper.c
>>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>>      } while (0)
>>>  MSA_FN_DF(ilvr_df)
>>>  #undef MSA_DO
>>> -
>>> -#define MSA_DO(DF)                      \
>>> -    do {                                \
>>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>>> -    } while (0)
>>> -MSA_FN_DF(ilvev_df)
>>> -#undef MSA_DO
>>> -
>>>  #undef MSA_LOOP_COND
>>>
>>>  #define MSA_LOOP_COND(DF) \
>>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>>> index df685e4..3057669 100644
>>> --- a/target/mips/translate.c
>>> +++ b/target/mips/translate.c
>>> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>>  }
>>>
>>> +/*
>>> + * [MSA] ILVEV.B wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (byte data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>>> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>>> +
>>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 8);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>> +
>>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 8);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>> +
>>> +    tcg_temp_free_i64(mask);
>>> +    tcg_temp_free_i64(t1);
>>> +    tcg_temp_free_i64(t2);
>>> +}
>>> +
>>> +/*
>>> + * [MSA] ILVEV.H wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (halfword data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>>> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
>>> +
>>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 16);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>> +
>>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 16);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>> +
>>> +    tcg_temp_free_i64(mask);
>>> +    tcg_temp_free_i64(t1);
>>> +    tcg_temp_free_i64(t2);
>>> +}
>>
>> Apparently you missed my comment about refactoring using mask/shift as
>> arguments:
>>
>> static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
>>                                 uint32_t ws, uint32_t wt,
>>                                 int64_t mask, int64_t shift)
>> {
>>     TCGv_i64 t1 = tcg_temp_new_i64();
>>     TCGv_i64 t2 = tcg_temp_new_i64();
>>     TCGv_i64 tm = tcg_const_i64(mask);
>>
>>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
>>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
>>     tcg_gen_shli_i64(t2, t2, shift);
>>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>
>>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
>>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
>>     tcg_gen_shli_i64(t2, t2, shift);
>>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>
>>     tcg_temp_free_i64(tm);
>>     tcg_temp_free_i64(t1);
>>     tcg_temp_free_i64(t2);
>> }
>>
>> static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>>                                uint32_t ws, uint32_t wt)
>> {
>>     gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
>> }
>>
>> static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>>                                uint32_t ws, uint32_t wt)
>> {
>>     gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
>> }
>>
>>
>>> +
>>> +/*
>>> + * [MSA] ILVEV.W wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (word data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
>>> +                        msa_wr_d[ws * 2], 32, 32);
>>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
>>> +                        msa_wr_d[ws * 2 + 1], 32, 32);
>>> +}
>>> +
>>> +/*
>>> + * [MSA] ILVEV.D wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (Doubleword data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>>> +}
>>> +
>>>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>>  {
>>>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>>> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>>          break;
>>>      case OPC_ILVEV_df:
>>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>>> +        switch (df) {
>>> +        case DF_BYTE:
>>> +            gen_ilvev_b(env, wd, ws, wt);
>>> +            break;
>>> +        case DF_HALF:
>>> +            gen_ilvev_h(env, wd, ws, wt);
>>> +            break;
>>> +        case DF_WORD:
>>> +            gen_ilvev_w(env, wd, ws, wt);
>>> +            break;
>>> +        case DF_DOUBLE:
>>> +            gen_ilvev_d(env, wd, ws, wt);
>>> +            break;
>>> +        default:
>>> +            assert(0);
>>> +        }
>>>          break;
>>>      case OPC_BINSR_df:
>>>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>>>
>>
> ________________________________________
> From: Philippe Mathieu-Daudé <philmd@redhat.com>
> Sent: Thursday, April 4, 2019 3:42 PM
> To: Mateja Marjanovic; qemu-devel@nongnu.org
> Cc: aurelien@aurel32.net; richard.henderson@linaro.org; Aleksandar Markovic; Aleksandar Rikalo
> Subject: Re: [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
> 
> Hi Mateja,
> 
> On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>
>> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
>> directly tcg registers and performing logic on them
>> instead of using helpers.
>>
>> In the following table, the first column is the performance
>> before this patch. The second represents the performance,
>> after converting from helpers to tcg, but without using
>> tcg_gen_deposit function. The third one is the solution
>> which is implemented in this patch.
>>
>> Performance measurement is done by executing the
>> instructions a large number of times on a computer
>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>
>> ============================================================
>> || instr    ||   before    || no-deposit ||  with-deposit ||
>> ============================================================
>> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
>> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> 
> I'm quite surprised there is not a single change here since your v5, are
> you sure you used the correct result? I was expecting a slighly improvement.
> 
>> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
>> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
>> ============================================================
>>
>> No-deposit column and with-deposit column have the
>> same statistical values in every row, except ILVEV.W,
>> which is the only function which uses the deposit
>> function.
>>
>> No-deposit version of the ILVEV.W implementation:
>>
>> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>                                uint32_t ws, uint32_t wt)
>> {
>>     TCGv_i64 t1 = tcg_temp_new_i64();
>>     TCGv_i64 t2 = tcg_temp_new_i64();
>>     uint64_t mask = 0x00000000ffffffffULL;
>>
>>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>     tcg_gen_shli_i64(t2, t2, 32);
>>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>
>>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>     tcg_gen_shli_i64(t2, t2, 32);
>>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>
>>     tcg_temp_free_i64(t1);
>>     tcg_temp_free_i64(t2);
>> }
>>
>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>> ---
>>  target/mips/helper.h     |   1 -
>>  target/mips/msa_helper.c |   9 -----
>>  target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 100 insertions(+), 11 deletions(-)
>>
>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>> index 02e16c7..82f6a40 100644
>> --- a/target/mips/helper.h
>> +++ b/target/mips/helper.h
>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>> index a7ea6aa..d5c3842 100644
>> --- a/target/mips/msa_helper.c
>> +++ b/target/mips/msa_helper.c
>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>      } while (0)
>>  MSA_FN_DF(ilvr_df)
>>  #undef MSA_DO
>> -
>> -#define MSA_DO(DF)                      \
>> -    do {                                \
>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>> -    } while (0)
>> -MSA_FN_DF(ilvev_df)
>> -#undef MSA_DO
>> -
>>  #undef MSA_LOOP_COND
>>
>>  #define MSA_LOOP_COND(DF) \
>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>> index df685e4..3057669 100644
>> --- a/target/mips/translate.c
>> +++ b/target/mips/translate.c
>> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>  }
>>
>> +/*
>> + * [MSA] ILVEV.B wd, ws, wt
>> + *
>> + *   Vector Interleave Even (byte data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.H wd, ws, wt
>> + *
>> + *   Vector Interleave Even (halfword data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
> 
> Apparently you missed my comment about refactoring using mask/shift as
> arguments:
> 
> static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
>                                 uint32_t ws, uint32_t wt,
>                                 int64_t mask, int64_t shift)
> {
>     TCGv_i64 t1 = tcg_temp_new_i64();
>     TCGv_i64 t2 = tcg_temp_new_i64();
>     TCGv_i64 tm = tcg_const_i64(mask);
> 
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
>     tcg_gen_shli_i64(t2, t2, shift);
>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> 
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
>     tcg_gen_shli_i64(t2, t2, shift);
>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> 
>     tcg_temp_free_i64(tm);
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(t2);
> }
> 
> static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
> }
> 
> static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
> }
> 
> 
>> +
>> +/*
>> + * [MSA] ILVEV.W wd, ws, wt
>> + *
>> + *   Vector Interleave Even (word data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
>> +                        msa_wr_d[ws * 2], 32, 32);
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
>> +                        msa_wr_d[ws * 2 + 1], 32, 32);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.D wd, ws, wt
>> + *
>> + *   Vector Interleave Even (Doubleword data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>> +}
>> +
>>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>  {
>>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>          break;
>>      case OPC_ILVEV_df:
>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>> +        switch (df) {
>> +        case DF_BYTE:
>> +            gen_ilvev_b(env, wd, ws, wt);
>> +            break;
>> +        case DF_HALF:
>> +            gen_ilvev_h(env, wd, ws, wt);
>> +            break;
>> +        case DF_WORD:
>> +            gen_ilvev_w(env, wd, ws, wt);
>> +            break;
>> +        case DF_DOUBLE:
>> +            gen_ilvev_d(env, wd, ws, wt);
>> +            break;
>> +        default:
>> +            assert(0);
>> +        }
>>          break;
>>      case OPC_BINSR_df:
>>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
@ 2019-04-05  0:26           ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-05  0:26 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Aleksandar Markovic, Aleksandar Rikalo, Mateja Marjanovic,
	aurelien, richard.henderson, qemu-devel

On Apr 4, 2019 9:17 PM, "Philippe Mathieu-Daudé" <philmd@redhat.com> wrote:
>
> Hi Aleksandar,
>
> On 4/4/19 8:19 PM, Aleksandar Markovic wrote:
> >> From: Philippe Mathieu-Daudé <philmd@redhat.com>
> >> Subject: Re: [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA
instructions
> >>
> >> Hi Mateja,
> >>
> >> On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
> >>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> >>>
> >>> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
> >>> directly tcg registers and performing logic on them
> >>> instead of using helpers.
> >>>
> >>> In the following table, the first column is the performance
> >>> before this patch. The second represents the performance,
> >>> after converting from helpers to tcg, but without using
> >>> tcg_gen_deposit function. The third one is the solution
> >>> which is implemented in this patch.
> >>>
> >>> Performance measurement is done by executing the
> >>> instructions a large number of times on a computer
> >>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
> >>>
> >>> ============================================================
> >>> || instr    ||   before    || no-deposit ||  with-deposit ||
> >>> ============================================================
> >>> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
> >>> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> >>
> >> I'm quite surprised there is not a single change here since your v5,
are
> >> you sure you used the correct result? I was expecting a slighly
improvement.
> >>
> >
> > Hello, Philippe.
> >
> > First of all, thank you so much for taking your time to provide
> > Matejan with the source code below.
> >
> > Speaking about your idea, Mateja told me he DID implemented it,
> > and did measurements, and that the improvement is noticeable,
> > but really small. I don't know why he did not include that change
> > - perhaps he didn't have enough time to integrate it.
> >
> > I know he is on a long weekend now, so we will have to wait for
> > the next week for Mateja to explain this to us. Mateja, could you
> > perhaps add a column "with-deposit-and-mask-as-tcg-constant"?
>
> Ah OK, no worries since this series will enter in the 4.1 dev cycle and
> there is still 2/3 weeks to go.
> I am simply curious to see the difference using the register approach :)
> I recently noticed I can run those tests myself (I previously mixed
> threads and thought Mateja was running this on a FPGA).
>

The problem with the test procedure and results that are presented here is
that their accuracy is very small if the emulation of the instruction under
measurement is fast. The source code is just 10,000,000 single MIPS
instruction executions in a C for loop, and that is all nice and dendy if
the emulation of that instruction dominates over for loop overhead, in
terms of emulation time. But for instructions that are emulated fast, we
are measuring more loop overhead than instruction itself.

Actually, the numbers inthat area (under letˊs say 25 ms) are plain
misleading.

I think we need to devise better test, and update the table in the commit
message with the new results that will be closer to the reality - and hence
more informative and valuable.

BTW, I came to the conclusion that there is no faster emulation for special
cases when two or three operands are the same, so those cases will not be
singled out.

Aleksandar

> I try to have a quick look at your work, and I'm pleased to see how
> Mateja improves the quality/accuracy of each series.
>
> Regards,
>
> Phil.
>
> > Yours,
> > Aleksandar
> >
> >
> >>> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
> >>> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
> >>> ============================================================
> >>>
> >>> No-deposit column and with-deposit column have the
> >>> same statistical values in every row, except ILVEV.W,
> >>> which is the only function which uses the deposit
> >>> function.
> >>>
> >>> No-deposit version of the ILVEV.W implementation:
> >>>
> >>> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> >>>                                uint32_t ws, uint32_t wt)
> >>> {
> >>>     TCGv_i64 t1 = tcg_temp_new_i64();
> >>>     TCGv_i64 t2 = tcg_temp_new_i64();
> >>>     uint64_t mask = 0x00000000ffffffffULL;
> >>>
> >>>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> >>>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> >>>     tcg_gen_shli_i64(t2, t2, 32);
> >>>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>>
> >>>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >>>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >>>     tcg_gen_shli_i64(t2, t2, 32);
> >>>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>>
> >>>     tcg_temp_free_i64(t1);
> >>>     tcg_temp_free_i64(t2);
> >>> }
> >>>
> >>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> >>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> >>> ---
> >>>  target/mips/helper.h     |   1 -
> >>>  target/mips/msa_helper.c |   9 -----
> >>>  target/mips/translate.c  | 101
++++++++++++++++++++++++++++++++++++++++++++++-
> >>>  3 files changed, 100 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/target/mips/helper.h b/target/mips/helper.h
> >>> index 02e16c7..82f6a40 100644
> >>> --- a/target/mips/helper.h
> >>> +++ b/target/mips/helper.h
> >>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32,
i32, i32)
> >>>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> >>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> >>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> >>> index a7ea6aa..d5c3842 100644
> >>> --- a/target/mips/msa_helper.c
> >>> +++ b/target/mips/msa_helper.c
> >>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
> >>>      } while (0)
> >>>  MSA_FN_DF(ilvr_df)
> >>>  #undef MSA_DO
> >>> -
> >>> -#define MSA_DO(DF)                      \
> >>> -    do {                                \
> >>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> >>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> >>> -    } while (0)
> >>> -MSA_FN_DF(ilvev_df)
> >>> -#undef MSA_DO
> >>> -
> >>>  #undef MSA_LOOP_COND
> >>>
> >>>  #define MSA_LOOP_COND(DF) \
> >>> diff --git a/target/mips/translate.c b/target/mips/translate.c
> >>> index df685e4..3057669 100644
> >>> --- a/target/mips/translate.c
> >>> +++ b/target/mips/translate.c
> >>> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState
*env, uint32_t wd,
> >>>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> >>>  }
> >>>
> >>> +/*
> >>> + * [MSA] ILVEV.B wd, ws, wt
> >>> + *
> >>> + *   Vector Interleave Even (byte data elements)
> >>> + *
> >>> + */
> >>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> >>> +                               uint32_t ws, uint32_t wt)
> >>> +{
> >>> +    TCGv_i64 t1 = tcg_temp_new_i64();
> >>> +    TCGv_i64 t2 = tcg_temp_new_i64();
> >>> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> >>> +
> >>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> >>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> >>> +    tcg_gen_shli_i64(t2, t2, 8);
> >>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>> +
> >>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >>> +    tcg_gen_shli_i64(t2, t2, 8);
> >>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>> +
> >>> +    tcg_temp_free_i64(mask);
> >>> +    tcg_temp_free_i64(t1);
> >>> +    tcg_temp_free_i64(t2);
> >>> +}
> >>> +
> >>> +/*
> >>> + * [MSA] ILVEV.H wd, ws, wt
> >>> + *
> >>> + *   Vector Interleave Even (halfword data elements)
> >>> + *
> >>> + */
> >>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> >>> +                               uint32_t ws, uint32_t wt)
> >>> +{
> >>> +    TCGv_i64 t1 = tcg_temp_new_i64();
> >>> +    TCGv_i64 t2 = tcg_temp_new_i64();
> >>> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
> >>> +
> >>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> >>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> >>> +    tcg_gen_shli_i64(t2, t2, 16);
> >>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>> +
> >>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >>> +    tcg_gen_shli_i64(t2, t2, 16);
> >>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>> +
> >>> +    tcg_temp_free_i64(mask);
> >>> +    tcg_temp_free_i64(t1);
> >>> +    tcg_temp_free_i64(t2);
> >>> +}
> >>
> >> Apparently you missed my comment about refactoring using mask/shift as
> >> arguments:
> >>
> >> static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
> >>                                 uint32_t ws, uint32_t wt,
> >>                                 int64_t mask, int64_t shift)
> >> {
> >>     TCGv_i64 t1 = tcg_temp_new_i64();
> >>     TCGv_i64 t2 = tcg_temp_new_i64();
> >>     TCGv_i64 tm = tcg_const_i64(mask);
> >>
> >>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
> >>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
> >>     tcg_gen_shli_i64(t2, t2, shift);
> >>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>
> >>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
> >>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
> >>     tcg_gen_shli_i64(t2, t2, shift);
> >>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>
> >>     tcg_temp_free_i64(tm);
> >>     tcg_temp_free_i64(t1);
> >>     tcg_temp_free_i64(t2);
> >> }
> >>
> >> static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> >>                                uint32_t ws, uint32_t wt)
> >> {
> >>     gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
> >> }
> >>
> >> static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> >>                                uint32_t ws, uint32_t wt)
> >> {
> >>     gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
> >> }
> >>
> >>
> >>> +
> >>> +/*
> >>> + * [MSA] ILVEV.W wd, ws, wt
> >>> + *
> >>> + *   Vector Interleave Even (word data elements)
> >>> + *
> >>> + */
> >>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> >>> +                               uint32_t ws, uint32_t wt)
> >>> +{
> >>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
> >>> +                        msa_wr_d[ws * 2], 32, 32);
> >>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
> >>> +                        msa_wr_d[ws * 2 + 1], 32, 32);
> >>> +}
> >>> +
> >>> +/*
> >>> + * [MSA] ILVEV.D wd, ws, wt
> >>> + *
> >>> + *   Vector Interleave Even (Doubleword data elements)
> >>> + *
> >>> + */
> >>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> >>> +                               uint32_t ws, uint32_t wt)
> >>> +{
> >>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> >>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> >>> +}
> >>> +
> >>>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
> >>>  {
> >>>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> >>> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env,
DisasContext *ctx)
> >>>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
> >>>          break;
> >>>      case OPC_ILVEV_df:
> >>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> >>> +        switch (df) {
> >>> +        case DF_BYTE:
> >>> +            gen_ilvev_b(env, wd, ws, wt);
> >>> +            break;
> >>> +        case DF_HALF:
> >>> +            gen_ilvev_h(env, wd, ws, wt);
> >>> +            break;
> >>> +        case DF_WORD:
> >>> +            gen_ilvev_w(env, wd, ws, wt);
> >>> +            break;
> >>> +        case DF_DOUBLE:
> >>> +            gen_ilvev_d(env, wd, ws, wt);
> >>> +            break;
> >>> +        default:
> >>> +            assert(0);
> >>> +        }
> >>>          break;
> >>>      case OPC_BINSR_df:
> >>>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> >>>
> >>
> > ________________________________________
> > From: Philippe Mathieu-Daudé <philmd@redhat.com>
> > Sent: Thursday, April 4, 2019 3:42 PM
> > To: Mateja Marjanovic; qemu-devel@nongnu.org
> > Cc: aurelien@aurel32.net; richard.henderson@linaro.org; Aleksandar
Markovic; Aleksandar Rikalo
> > Subject: Re: [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA
instructions
> >
> > Hi Mateja,
> >
> > On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
> >> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> >>
> >> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
> >> directly tcg registers and performing logic on them
> >> instead of using helpers.
> >>
> >> In the following table, the first column is the performance
> >> before this patch. The second represents the performance,
> >> after converting from helpers to tcg, but without using
> >> tcg_gen_deposit function. The third one is the solution
> >> which is implemented in this patch.
> >>
> >> Performance measurement is done by executing the
> >> instructions a large number of times on a computer
> >> with Intel Core i7-3770 CPU @ 3.40GHz×8.
> >>
> >> ============================================================
> >> || instr    ||   before    || no-deposit ||  with-deposit ||
> >> ============================================================
> >> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
> >> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> >
> > I'm quite surprised there is not a single change here since your v5, are
> > you sure you used the correct result? I was expecting a slighly
improvement.
> >
> >> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
> >> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
> >> ============================================================
> >>
> >> No-deposit column and with-deposit column have the
> >> same statistical values in every row, except ILVEV.W,
> >> which is the only function which uses the deposit
> >> function.
> >>
> >> No-deposit version of the ILVEV.W implementation:
> >>
> >> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> >>                                uint32_t ws, uint32_t wt)
> >> {
> >>     TCGv_i64 t1 = tcg_temp_new_i64();
> >>     TCGv_i64 t2 = tcg_temp_new_i64();
> >>     uint64_t mask = 0x00000000ffffffffULL;
> >>
> >>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> >>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> >>     tcg_gen_shli_i64(t2, t2, 32);
> >>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>
> >>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >>     tcg_gen_shli_i64(t2, t2, 32);
> >>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>
> >>     tcg_temp_free_i64(t1);
> >>     tcg_temp_free_i64(t2);
> >> }
> >>
> >> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> >> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> >> ---
> >>  target/mips/helper.h     |   1 -
> >>  target/mips/msa_helper.c |   9 -----
> >>  target/mips/translate.c  | 101
++++++++++++++++++++++++++++++++++++++++++++++-
> >>  3 files changed, 100 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/target/mips/helper.h b/target/mips/helper.h
> >> index 02e16c7..82f6a40 100644
> >> --- a/target/mips/helper.h
> >> +++ b/target/mips/helper.h
> >> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32,
i32, i32)
> >>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> >> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> >> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> >> index a7ea6aa..d5c3842 100644
> >> --- a/target/mips/msa_helper.c
> >> +++ b/target/mips/msa_helper.c
> >> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
> >>      } while (0)
> >>  MSA_FN_DF(ilvr_df)
> >>  #undef MSA_DO
> >> -
> >> -#define MSA_DO(DF)                      \
> >> -    do {                                \
> >> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> >> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> >> -    } while (0)
> >> -MSA_FN_DF(ilvev_df)
> >> -#undef MSA_DO
> >> -
> >>  #undef MSA_LOOP_COND
> >>
> >>  #define MSA_LOOP_COND(DF) \
> >> diff --git a/target/mips/translate.c b/target/mips/translate.c
> >> index df685e4..3057669 100644
> >> --- a/target/mips/translate.c
> >> +++ b/target/mips/translate.c
> >> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState
*env, uint32_t wd,
> >>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> >>  }
> >>
> >> +/*
> >> + * [MSA] ILVEV.B wd, ws, wt
> >> + *
> >> + *   Vector Interleave Even (byte data elements)
> >> + *
> >> + */
> >> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> >> +                               uint32_t ws, uint32_t wt)
> >> +{
> >> +    TCGv_i64 t1 = tcg_temp_new_i64();
> >> +    TCGv_i64 t2 = tcg_temp_new_i64();
> >> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> >> +
> >> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> >> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> >> +    tcg_gen_shli_i64(t2, t2, 8);
> >> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >> +
> >> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >> +    tcg_gen_shli_i64(t2, t2, 8);
> >> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >> +
> >> +    tcg_temp_free_i64(mask);
> >> +    tcg_temp_free_i64(t1);
> >> +    tcg_temp_free_i64(t2);
> >> +}
> >> +
> >> +/*
> >> + * [MSA] ILVEV.H wd, ws, wt
> >> + *
> >> + *   Vector Interleave Even (halfword data elements)
> >> + *
> >> + */
> >> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> >> +                               uint32_t ws, uint32_t wt)
> >> +{
> >> +    TCGv_i64 t1 = tcg_temp_new_i64();
> >> +    TCGv_i64 t2 = tcg_temp_new_i64();
> >> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
> >> +
> >> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> >> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> >> +    tcg_gen_shli_i64(t2, t2, 16);
> >> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >> +
> >> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >> +    tcg_gen_shli_i64(t2, t2, 16);
> >> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >> +
> >> +    tcg_temp_free_i64(mask);
> >> +    tcg_temp_free_i64(t1);
> >> +    tcg_temp_free_i64(t2);
> >> +}
> >
> > Apparently you missed my comment about refactoring using mask/shift as
> > arguments:
> >
> > static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
> >                                 uint32_t ws, uint32_t wt,
> >                                 int64_t mask, int64_t shift)
> > {
> >     TCGv_i64 t1 = tcg_temp_new_i64();
> >     TCGv_i64 t2 = tcg_temp_new_i64();
> >     TCGv_i64 tm = tcg_const_i64(mask);
> >
> >     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
> >     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
> >     tcg_gen_shli_i64(t2, t2, shift);
> >     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >
> >     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
> >     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
> >     tcg_gen_shli_i64(t2, t2, shift);
> >     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >
> >     tcg_temp_free_i64(tm);
> >     tcg_temp_free_i64(t1);
> >     tcg_temp_free_i64(t2);
> > }
> >
> > static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> >                                uint32_t ws, uint32_t wt)
> > {
> >     gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
> > }
> >
> > static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> >                                uint32_t ws, uint32_t wt)
> > {
> >     gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
> > }
> >
> >
> >> +
> >> +/*
> >> + * [MSA] ILVEV.W wd, ws, wt
> >> + *
> >> + *   Vector Interleave Even (word data elements)
> >> + *
> >> + */
> >> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> >> +                               uint32_t ws, uint32_t wt)
> >> +{
> >> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
> >> +                        msa_wr_d[ws * 2], 32, 32);
> >> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
> >> +                        msa_wr_d[ws * 2 + 1], 32, 32);
> >> +}
> >> +
> >> +/*
> >> + * [MSA] ILVEV.D wd, ws, wt
> >> + *
> >> + *   Vector Interleave Even (Doubleword data elements)
> >> + *
> >> + */
> >> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> >> +                               uint32_t ws, uint32_t wt)
> >> +{
> >> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> >> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> >> +}
> >> +
> >>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
> >>  {
> >>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> >> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env,
DisasContext *ctx)
> >>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
> >>          break;
> >>      case OPC_ILVEV_df:
> >> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> >> +        switch (df) {
> >> +        case DF_BYTE:
> >> +            gen_ilvev_b(env, wd, ws, wt);
> >> +            break;
> >> +        case DF_HALF:
> >> +            gen_ilvev_h(env, wd, ws, wt);
> >> +            break;
> >> +        case DF_WORD:
> >> +            gen_ilvev_w(env, wd, ws, wt);
> >> +            break;
> >> +        case DF_DOUBLE:
> >> +            gen_ilvev_d(env, wd, ws, wt);
> >> +            break;
> >> +        default:
> >> +            assert(0);
> >> +        }
> >>          break;
> >>      case OPC_BINSR_df:
> >>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> >>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
@ 2019-04-05  0:26           ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-05  0:26 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Aleksandar Rikalo, richard.henderson, qemu-devel,
	Mateja Marjanovic, Aleksandar Markovic, aurelien

On Apr 4, 2019 9:17 PM, "Philippe Mathieu-Daudé" <philmd@redhat.com> wrote:
>
> Hi Aleksandar,
>
> On 4/4/19 8:19 PM, Aleksandar Markovic wrote:
> >> From: Philippe Mathieu-Daudé <philmd@redhat.com>
> >> Subject: Re: [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA
instructions
> >>
> >> Hi Mateja,
> >>
> >> On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
> >>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> >>>
> >>> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
> >>> directly tcg registers and performing logic on them
> >>> instead of using helpers.
> >>>
> >>> In the following table, the first column is the performance
> >>> before this patch. The second represents the performance,
> >>> after converting from helpers to tcg, but without using
> >>> tcg_gen_deposit function. The third one is the solution
> >>> which is implemented in this patch.
> >>>
> >>> Performance measurement is done by executing the
> >>> instructions a large number of times on a computer
> >>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
> >>>
> >>> ============================================================
> >>> || instr    ||   before    || no-deposit ||  with-deposit ||
> >>> ============================================================
> >>> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
> >>> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> >>
> >> I'm quite surprised there is not a single change here since your v5,
are
> >> you sure you used the correct result? I was expecting a slighly
improvement.
> >>
> >
> > Hello, Philippe.
> >
> > First of all, thank you so much for taking your time to provide
> > Matejan with the source code below.
> >
> > Speaking about your idea, Mateja told me he DID implemented it,
> > and did measurements, and that the improvement is noticeable,
> > but really small. I don't know why he did not include that change
> > - perhaps he didn't have enough time to integrate it.
> >
> > I know he is on a long weekend now, so we will have to wait for
> > the next week for Mateja to explain this to us. Mateja, could you
> > perhaps add a column "with-deposit-and-mask-as-tcg-constant"?
>
> Ah OK, no worries since this series will enter in the 4.1 dev cycle and
> there is still 2/3 weeks to go.
> I am simply curious to see the difference using the register approach :)
> I recently noticed I can run those tests myself (I previously mixed
> threads and thought Mateja was running this on a FPGA).
>

The problem with the test procedure and results that are presented here is
that their accuracy is very small if the emulation of the instruction under
measurement is fast. The source code is just 10,000,000 single MIPS
instruction executions in a C for loop, and that is all nice and dendy if
the emulation of that instruction dominates over for loop overhead, in
terms of emulation time. But for instructions that are emulated fast, we
are measuring more loop overhead than instruction itself.

Actually, the numbers inthat area (under letˊs say 25 ms) are plain
misleading.

I think we need to devise better test, and update the table in the commit
message with the new results that will be closer to the reality - and hence
more informative and valuable.

BTW, I came to the conclusion that there is no faster emulation for special
cases when two or three operands are the same, so those cases will not be
singled out.

Aleksandar

> I try to have a quick look at your work, and I'm pleased to see how
> Mateja improves the quality/accuracy of each series.
>
> Regards,
>
> Phil.
>
> > Yours,
> > Aleksandar
> >
> >
> >>> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
> >>> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
> >>> ============================================================
> >>>
> >>> No-deposit column and with-deposit column have the
> >>> same statistical values in every row, except ILVEV.W,
> >>> which is the only function which uses the deposit
> >>> function.
> >>>
> >>> No-deposit version of the ILVEV.W implementation:
> >>>
> >>> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> >>>                                uint32_t ws, uint32_t wt)
> >>> {
> >>>     TCGv_i64 t1 = tcg_temp_new_i64();
> >>>     TCGv_i64 t2 = tcg_temp_new_i64();
> >>>     uint64_t mask = 0x00000000ffffffffULL;
> >>>
> >>>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> >>>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> >>>     tcg_gen_shli_i64(t2, t2, 32);
> >>>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>>
> >>>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >>>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >>>     tcg_gen_shli_i64(t2, t2, 32);
> >>>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>>
> >>>     tcg_temp_free_i64(t1);
> >>>     tcg_temp_free_i64(t2);
> >>> }
> >>>
> >>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> >>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> >>> ---
> >>>  target/mips/helper.h     |   1 -
> >>>  target/mips/msa_helper.c |   9 -----
> >>>  target/mips/translate.c  | 101
++++++++++++++++++++++++++++++++++++++++++++++-
> >>>  3 files changed, 100 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/target/mips/helper.h b/target/mips/helper.h
> >>> index 02e16c7..82f6a40 100644
> >>> --- a/target/mips/helper.h
> >>> +++ b/target/mips/helper.h
> >>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32,
i32, i32)
> >>>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> >>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
> >>>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> >>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> >>> index a7ea6aa..d5c3842 100644
> >>> --- a/target/mips/msa_helper.c
> >>> +++ b/target/mips/msa_helper.c
> >>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
> >>>      } while (0)
> >>>  MSA_FN_DF(ilvr_df)
> >>>  #undef MSA_DO
> >>> -
> >>> -#define MSA_DO(DF)                      \
> >>> -    do {                                \
> >>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> >>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> >>> -    } while (0)
> >>> -MSA_FN_DF(ilvev_df)
> >>> -#undef MSA_DO
> >>> -
> >>>  #undef MSA_LOOP_COND
> >>>
> >>>  #define MSA_LOOP_COND(DF) \
> >>> diff --git a/target/mips/translate.c b/target/mips/translate.c
> >>> index df685e4..3057669 100644
> >>> --- a/target/mips/translate.c
> >>> +++ b/target/mips/translate.c
> >>> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState
*env, uint32_t wd,
> >>>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> >>>  }
> >>>
> >>> +/*
> >>> + * [MSA] ILVEV.B wd, ws, wt
> >>> + *
> >>> + *   Vector Interleave Even (byte data elements)
> >>> + *
> >>> + */
> >>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> >>> +                               uint32_t ws, uint32_t wt)
> >>> +{
> >>> +    TCGv_i64 t1 = tcg_temp_new_i64();
> >>> +    TCGv_i64 t2 = tcg_temp_new_i64();
> >>> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> >>> +
> >>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> >>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> >>> +    tcg_gen_shli_i64(t2, t2, 8);
> >>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>> +
> >>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >>> +    tcg_gen_shli_i64(t2, t2, 8);
> >>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>> +
> >>> +    tcg_temp_free_i64(mask);
> >>> +    tcg_temp_free_i64(t1);
> >>> +    tcg_temp_free_i64(t2);
> >>> +}
> >>> +
> >>> +/*
> >>> + * [MSA] ILVEV.H wd, ws, wt
> >>> + *
> >>> + *   Vector Interleave Even (halfword data elements)
> >>> + *
> >>> + */
> >>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> >>> +                               uint32_t ws, uint32_t wt)
> >>> +{
> >>> +    TCGv_i64 t1 = tcg_temp_new_i64();
> >>> +    TCGv_i64 t2 = tcg_temp_new_i64();
> >>> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
> >>> +
> >>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> >>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> >>> +    tcg_gen_shli_i64(t2, t2, 16);
> >>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>> +
> >>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >>> +    tcg_gen_shli_i64(t2, t2, 16);
> >>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>> +
> >>> +    tcg_temp_free_i64(mask);
> >>> +    tcg_temp_free_i64(t1);
> >>> +    tcg_temp_free_i64(t2);
> >>> +}
> >>
> >> Apparently you missed my comment about refactoring using mask/shift as
> >> arguments:
> >>
> >> static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
> >>                                 uint32_t ws, uint32_t wt,
> >>                                 int64_t mask, int64_t shift)
> >> {
> >>     TCGv_i64 t1 = tcg_temp_new_i64();
> >>     TCGv_i64 t2 = tcg_temp_new_i64();
> >>     TCGv_i64 tm = tcg_const_i64(mask);
> >>
> >>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
> >>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
> >>     tcg_gen_shli_i64(t2, t2, shift);
> >>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>
> >>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
> >>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
> >>     tcg_gen_shli_i64(t2, t2, shift);
> >>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>
> >>     tcg_temp_free_i64(tm);
> >>     tcg_temp_free_i64(t1);
> >>     tcg_temp_free_i64(t2);
> >> }
> >>
> >> static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> >>                                uint32_t ws, uint32_t wt)
> >> {
> >>     gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
> >> }
> >>
> >> static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> >>                                uint32_t ws, uint32_t wt)
> >> {
> >>     gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
> >> }
> >>
> >>
> >>> +
> >>> +/*
> >>> + * [MSA] ILVEV.W wd, ws, wt
> >>> + *
> >>> + *   Vector Interleave Even (word data elements)
> >>> + *
> >>> + */
> >>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> >>> +                               uint32_t ws, uint32_t wt)
> >>> +{
> >>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
> >>> +                        msa_wr_d[ws * 2], 32, 32);
> >>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
> >>> +                        msa_wr_d[ws * 2 + 1], 32, 32);
> >>> +}
> >>> +
> >>> +/*
> >>> + * [MSA] ILVEV.D wd, ws, wt
> >>> + *
> >>> + *   Vector Interleave Even (Doubleword data elements)
> >>> + *
> >>> + */
> >>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> >>> +                               uint32_t ws, uint32_t wt)
> >>> +{
> >>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> >>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> >>> +}
> >>> +
> >>>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
> >>>  {
> >>>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> >>> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env,
DisasContext *ctx)
> >>>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
> >>>          break;
> >>>      case OPC_ILVEV_df:
> >>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> >>> +        switch (df) {
> >>> +        case DF_BYTE:
> >>> +            gen_ilvev_b(env, wd, ws, wt);
> >>> +            break;
> >>> +        case DF_HALF:
> >>> +            gen_ilvev_h(env, wd, ws, wt);
> >>> +            break;
> >>> +        case DF_WORD:
> >>> +            gen_ilvev_w(env, wd, ws, wt);
> >>> +            break;
> >>> +        case DF_DOUBLE:
> >>> +            gen_ilvev_d(env, wd, ws, wt);
> >>> +            break;
> >>> +        default:
> >>> +            assert(0);
> >>> +        }
> >>>          break;
> >>>      case OPC_BINSR_df:
> >>>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> >>>
> >>
> > ________________________________________
> > From: Philippe Mathieu-Daudé <philmd@redhat.com>
> > Sent: Thursday, April 4, 2019 3:42 PM
> > To: Mateja Marjanovic; qemu-devel@nongnu.org
> > Cc: aurelien@aurel32.net; richard.henderson@linaro.org; Aleksandar
Markovic; Aleksandar Rikalo
> > Subject: Re: [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA
instructions
> >
> > Hi Mateja,
> >
> > On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
> >> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> >>
> >> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
> >> directly tcg registers and performing logic on them
> >> instead of using helpers.
> >>
> >> In the following table, the first column is the performance
> >> before this patch. The second represents the performance,
> >> after converting from helpers to tcg, but without using
> >> tcg_gen_deposit function. The third one is the solution
> >> which is implemented in this patch.
> >>
> >> Performance measurement is done by executing the
> >> instructions a large number of times on a computer
> >> with Intel Core i7-3770 CPU @ 3.40GHz×8.
> >>
> >> ============================================================
> >> || instr    ||   before    || no-deposit ||  with-deposit ||
> >> ============================================================
> >> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
> >> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> >
> > I'm quite surprised there is not a single change here since your v5, are
> > you sure you used the correct result? I was expecting a slighly
improvement.
> >
> >> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
> >> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
> >> ============================================================
> >>
> >> No-deposit column and with-deposit column have the
> >> same statistical values in every row, except ILVEV.W,
> >> which is the only function which uses the deposit
> >> function.
> >>
> >> No-deposit version of the ILVEV.W implementation:
> >>
> >> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> >>                                uint32_t ws, uint32_t wt)
> >> {
> >>     TCGv_i64 t1 = tcg_temp_new_i64();
> >>     TCGv_i64 t2 = tcg_temp_new_i64();
> >>     uint64_t mask = 0x00000000ffffffffULL;
> >>
> >>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> >>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> >>     tcg_gen_shli_i64(t2, t2, 32);
> >>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >>
> >>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >>     tcg_gen_shli_i64(t2, t2, 32);
> >>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >>
> >>     tcg_temp_free_i64(t1);
> >>     tcg_temp_free_i64(t2);
> >> }
> >>
> >> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> >> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> >> ---
> >>  target/mips/helper.h     |   1 -
> >>  target/mips/msa_helper.c |   9 -----
> >>  target/mips/translate.c  | 101
++++++++++++++++++++++++++++++++++++++++++++++-
> >>  3 files changed, 100 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/target/mips/helper.h b/target/mips/helper.h
> >> index 02e16c7..82f6a40 100644
> >> --- a/target/mips/helper.h
> >> +++ b/target/mips/helper.h
> >> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32,
i32, i32)
> >>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> >> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
> >>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> >> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> >> index a7ea6aa..d5c3842 100644
> >> --- a/target/mips/msa_helper.c
> >> +++ b/target/mips/msa_helper.c
> >> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
> >>      } while (0)
> >>  MSA_FN_DF(ilvr_df)
> >>  #undef MSA_DO
> >> -
> >> -#define MSA_DO(DF)                      \
> >> -    do {                                \
> >> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> >> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> >> -    } while (0)
> >> -MSA_FN_DF(ilvev_df)
> >> -#undef MSA_DO
> >> -
> >>  #undef MSA_LOOP_COND
> >>
> >>  #define MSA_LOOP_COND(DF) \
> >> diff --git a/target/mips/translate.c b/target/mips/translate.c
> >> index df685e4..3057669 100644
> >> --- a/target/mips/translate.c
> >> +++ b/target/mips/translate.c
> >> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState
*env, uint32_t wd,
> >>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> >>  }
> >>
> >> +/*
> >> + * [MSA] ILVEV.B wd, ws, wt
> >> + *
> >> + *   Vector Interleave Even (byte data elements)
> >> + *
> >> + */
> >> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> >> +                               uint32_t ws, uint32_t wt)
> >> +{
> >> +    TCGv_i64 t1 = tcg_temp_new_i64();
> >> +    TCGv_i64 t2 = tcg_temp_new_i64();
> >> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> >> +
> >> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> >> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> >> +    tcg_gen_shli_i64(t2, t2, 8);
> >> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >> +
> >> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >> +    tcg_gen_shli_i64(t2, t2, 8);
> >> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >> +
> >> +    tcg_temp_free_i64(mask);
> >> +    tcg_temp_free_i64(t1);
> >> +    tcg_temp_free_i64(t2);
> >> +}
> >> +
> >> +/*
> >> + * [MSA] ILVEV.H wd, ws, wt
> >> + *
> >> + *   Vector Interleave Even (halfword data elements)
> >> + *
> >> + */
> >> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> >> +                               uint32_t ws, uint32_t wt)
> >> +{
> >> +    TCGv_i64 t1 = tcg_temp_new_i64();
> >> +    TCGv_i64 t2 = tcg_temp_new_i64();
> >> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
> >> +
> >> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> >> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> >> +    tcg_gen_shli_i64(t2, t2, 16);
> >> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >> +
> >> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> >> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> >> +    tcg_gen_shli_i64(t2, t2, 16);
> >> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >> +
> >> +    tcg_temp_free_i64(mask);
> >> +    tcg_temp_free_i64(t1);
> >> +    tcg_temp_free_i64(t2);
> >> +}
> >
> > Apparently you missed my comment about refactoring using mask/shift as
> > arguments:
> >
> > static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
> >                                 uint32_t ws, uint32_t wt,
> >                                 int64_t mask, int64_t shift)
> > {
> >     TCGv_i64 t1 = tcg_temp_new_i64();
> >     TCGv_i64 t2 = tcg_temp_new_i64();
> >     TCGv_i64 tm = tcg_const_i64(mask);
> >
> >     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
> >     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
> >     tcg_gen_shli_i64(t2, t2, shift);
> >     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> >
> >     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
> >     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
> >     tcg_gen_shli_i64(t2, t2, shift);
> >     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> >
> >     tcg_temp_free_i64(tm);
> >     tcg_temp_free_i64(t1);
> >     tcg_temp_free_i64(t2);
> > }
> >
> > static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> >                                uint32_t ws, uint32_t wt)
> > {
> >     gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
> > }
> >
> > static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> >                                uint32_t ws, uint32_t wt)
> > {
> >     gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
> > }
> >
> >
> >> +
> >> +/*
> >> + * [MSA] ILVEV.W wd, ws, wt
> >> + *
> >> + *   Vector Interleave Even (word data elements)
> >> + *
> >> + */
> >> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> >> +                               uint32_t ws, uint32_t wt)
> >> +{
> >> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
> >> +                        msa_wr_d[ws * 2], 32, 32);
> >> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
> >> +                        msa_wr_d[ws * 2 + 1], 32, 32);
> >> +}
> >> +
> >> +/*
> >> + * [MSA] ILVEV.D wd, ws, wt
> >> + *
> >> + *   Vector Interleave Even (Doubleword data elements)
> >> + *
> >> + */
> >> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> >> +                               uint32_t ws, uint32_t wt)
> >> +{
> >> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> >> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> >> +}
> >> +
> >>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
> >>  {
> >>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> >> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env,
DisasContext *ctx)
> >>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
> >>          break;
> >>      case OPC_ILVEV_df:
> >> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> >> +        switch (df) {
> >> +        case DF_BYTE:
> >> +            gen_ilvev_b(env, wd, ws, wt);
> >> +            break;
> >> +        case DF_HALF:
> >> +            gen_ilvev_h(env, wd, ws, wt);
> >> +            break;
> >> +        case DF_WORD:
> >> +            gen_ilvev_w(env, wd, ws, wt);
> >> +            break;
> >> +        case DF_DOUBLE:
> >> +            gen_ilvev_d(env, wd, ws, wt);
> >> +            break;
> >> +        default:
> >> +            assert(0);
> >> +        }
> >>          break;
> >>      case OPC_BINSR_df:
> >>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> >>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> MSA instructions
@ 2019-04-13 16:05     ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-13 16:05 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: QEMU Developers, Aleksandar Rikalo, Richard Henderson,
	Philippe Mathieu-Daudé,
	Aleksandar Markovic, Aurelien Jarno

On Thu, Apr 4, 2019 at 3:16 PM Mateja Marjanovic
<mateja.marjanovic@rt-rk.com> wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimized ILVR.<B|H|W|D> instructions, using a hybrid

Optimized -> Optimize

> approach. For byte data elements, use a helper with an
> unrolled loop (much better performance), for halfword,

(much better performance) -> (having much better performance
than direct tcg translation)

> word and doubleword data elements use directly tcg
> registers and logic performed on them.
>
> Performance measurement is done by executing the
> instructions a large number of times on a computer
> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>
> ===================================================
> ||  instr  ||  helper  ||    tcg    ||   hybrid  ||
> ===================================================
> || ilvr.b: || 62.87 ms ||  74.76 ms ||  61.52 ms || <-- helper
> || ilvr.h: || 44.11 ms ||  33.00 ms ||  33.55 ms || <-- tcg
> || ilvr.w: || 34.97 ms ||  23.06 ms ||  22.67 ms || <-- tcg
> || ilvr.d: || 27.33 ms ||  19.87 ms ||  20.02 ms || <-- tcg
> ===================================================
>

instr -> instruction

||  61.52 ms || <-- helper  ->  ||  61.52 ms (helper) ||

and similar for other three raws.

> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   2 +-
>  target/mips/msa_helper.c |  33 +++++++++++----
>  target/mips/translate.c  | 107 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 132 insertions(+), 10 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index cd73723..d4755ef 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -862,7 +862,6 @@ DEF_HELPER_5(msa_sld_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_splat_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> @@ -946,6 +945,7 @@ DEF_HELPER_4(msa_insert_w, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_insert_d, void, env, i32, i32, i32)
>
>  DEF_HELPER_4(msa_ilvl_b, void, env, i32, i32, i32)
> +DEF_HELPER_4(msa_ilvr_b, void, env, i32, i32, i32)
>
>  DEF_HELPER_4(msa_fclass_df, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_ftrunc_s_df, void, env, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index 84bbe6f..2470cef 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1181,14 +1181,6 @@ MSA_FN_DF(pckev_df)
>      } while (0)
>  MSA_FN_DF(pckod_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = R##DF(pwt, i); \
> -        pwx->DF[2*i+1] = R##DF(pws, i); \
> -    } while (0)
> -MSA_FN_DF(ilvr_df)
> -#undef MSA_DO
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> @@ -1249,6 +1241,31 @@ void helper_msa_ilvl_b(CPUMIPSState *env, uint32_t wd,
>      pwd->b[15] = pws->b[15];
>  }
>
> +void helper_msa_ilvr_b(CPUMIPSState *env, uint32_t wd,
> +                       uint32_t ws, uint32_t wt)
> +{
> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
> +

Why do we use here env->active_fpu.fpr[wd].wr, while for other instructions in
this patch, we access msa_wr_d<b|h|w|d[] directly?

> +    pwd->b[15] = pws->b[7];
> +    pwd->b[14] = pwt->b[7];
> +    pwd->b[13] = pws->b[6];
> +    pwd->b[12] = pwt->b[6];
> +    pwd->b[11] = pws->b[5];
> +    pwd->b[10] = pwt->b[5];
> +    pwd->b[9]  = pws->b[4];
> +    pwd->b[8]  = pwt->b[4];
> +    pwd->b[7]  = pws->b[3];
> +    pwd->b[6]  = pwt->b[3];
> +    pwd->b[5]  = pws->b[2];
> +    pwd->b[4]  = pwt->b[2];
> +    pwd->b[3]  = pws->b[1];
> +    pwd->b[2]  = pwt->b[1];
> +    pwd->b[1]  = pws->b[0];
> +    pwd->b[0]  = pwt->b[0];
> +}
> +
>  void helper_msa_copy_s_b(CPUMIPSState *env, uint32_t rd,
>                           uint32_t ws, uint32_t n)
>  {
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index 6c6811e..90332fb 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28885,6 +28885,96 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
>  }
>
>  /*
> + * [MSA] ILVR.H wd, ws, wt
> + *
> + *   Vector Interleave Right (halfword data elements)
> + *
> + */
> +static inline void gen_ilvr_h(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x000000000000ffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVR.W wd, ws, wt
> + *
> + *   Vector Interleave Right (word data elements)
> + *
> + */
> +static inline void gen_ilvr_w(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x00000000ffffffffULL;

Use tcg_const_i64(). The same for the previous function.

> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 32;

Just assign the constant value to the mask, no need for shift operation.
The same applies for other similar cases in this patch.

> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVR.D wd, ws, wt
> + *
> + *   Vector Interleave Right (doubleword data elements)
> + *
> + */
> +static inline void gen_ilvr_d(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +

This function seems to be identical to the gen_ilvev_d(). Please,
if that is the case, in this patch rename gen_ilvev_d() to gen_ilvev_ilvr_d(),
and  use it both for hanlding ILVEV.D and ILVR.D.

> +
> +/*
>   * [MSA] ILVL.B wd, ws, wt
>   *
>   *   Vector Interleave Left (byte data elements)
> @@ -29380,7 +29470,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_div_u_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVR_df:
> -        gen_helper_msa_ilvr_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_helper_msa_ilvr_b(cpu_env, twd, tws, twt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvr_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvr_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvr_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSL_df:
>          gen_helper_msa_binsl_df(cpu_env, tdf, twd, tws, twt);
> --
> 2.7.4
>
>

Thanks,
Aleksandar

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> MSA instructions
@ 2019-04-13 16:05     ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-13 16:05 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: Aleksandar Rikalo, Richard Henderson, QEMU Developers,
	Aleksandar Markovic, Philippe Mathieu-Daudé,
	Aurelien Jarno

On Thu, Apr 4, 2019 at 3:16 PM Mateja Marjanovic
<mateja.marjanovic@rt-rk.com> wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimized ILVR.<B|H|W|D> instructions, using a hybrid

Optimized -> Optimize

> approach. For byte data elements, use a helper with an
> unrolled loop (much better performance), for halfword,

(much better performance) -> (having much better performance
than direct tcg translation)

> word and doubleword data elements use directly tcg
> registers and logic performed on them.
>
> Performance measurement is done by executing the
> instructions a large number of times on a computer
> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>
> ===================================================
> ||  instr  ||  helper  ||    tcg    ||   hybrid  ||
> ===================================================
> || ilvr.b: || 62.87 ms ||  74.76 ms ||  61.52 ms || <-- helper
> || ilvr.h: || 44.11 ms ||  33.00 ms ||  33.55 ms || <-- tcg
> || ilvr.w: || 34.97 ms ||  23.06 ms ||  22.67 ms || <-- tcg
> || ilvr.d: || 27.33 ms ||  19.87 ms ||  20.02 ms || <-- tcg
> ===================================================
>

instr -> instruction

||  61.52 ms || <-- helper  ->  ||  61.52 ms (helper) ||

and similar for other three raws.

> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   2 +-
>  target/mips/msa_helper.c |  33 +++++++++++----
>  target/mips/translate.c  | 107 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 132 insertions(+), 10 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index cd73723..d4755ef 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -862,7 +862,6 @@ DEF_HELPER_5(msa_sld_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_splat_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> @@ -946,6 +945,7 @@ DEF_HELPER_4(msa_insert_w, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_insert_d, void, env, i32, i32, i32)
>
>  DEF_HELPER_4(msa_ilvl_b, void, env, i32, i32, i32)
> +DEF_HELPER_4(msa_ilvr_b, void, env, i32, i32, i32)
>
>  DEF_HELPER_4(msa_fclass_df, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_ftrunc_s_df, void, env, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index 84bbe6f..2470cef 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1181,14 +1181,6 @@ MSA_FN_DF(pckev_df)
>      } while (0)
>  MSA_FN_DF(pckod_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = R##DF(pwt, i); \
> -        pwx->DF[2*i+1] = R##DF(pws, i); \
> -    } while (0)
> -MSA_FN_DF(ilvr_df)
> -#undef MSA_DO
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> @@ -1249,6 +1241,31 @@ void helper_msa_ilvl_b(CPUMIPSState *env, uint32_t wd,
>      pwd->b[15] = pws->b[15];
>  }
>
> +void helper_msa_ilvr_b(CPUMIPSState *env, uint32_t wd,
> +                       uint32_t ws, uint32_t wt)
> +{
> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
> +

Why do we use here env->active_fpu.fpr[wd].wr, while for other instructions in
this patch, we access msa_wr_d<b|h|w|d[] directly?

> +    pwd->b[15] = pws->b[7];
> +    pwd->b[14] = pwt->b[7];
> +    pwd->b[13] = pws->b[6];
> +    pwd->b[12] = pwt->b[6];
> +    pwd->b[11] = pws->b[5];
> +    pwd->b[10] = pwt->b[5];
> +    pwd->b[9]  = pws->b[4];
> +    pwd->b[8]  = pwt->b[4];
> +    pwd->b[7]  = pws->b[3];
> +    pwd->b[6]  = pwt->b[3];
> +    pwd->b[5]  = pws->b[2];
> +    pwd->b[4]  = pwt->b[2];
> +    pwd->b[3]  = pws->b[1];
> +    pwd->b[2]  = pwt->b[1];
> +    pwd->b[1]  = pws->b[0];
> +    pwd->b[0]  = pwt->b[0];
> +}
> +
>  void helper_msa_copy_s_b(CPUMIPSState *env, uint32_t rd,
>                           uint32_t ws, uint32_t n)
>  {
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index 6c6811e..90332fb 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28885,6 +28885,96 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
>  }
>
>  /*
> + * [MSA] ILVR.H wd, ws, wt
> + *
> + *   Vector Interleave Right (halfword data elements)
> + *
> + */
> +static inline void gen_ilvr_h(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x000000000000ffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVR.W wd, ws, wt
> + *
> + *   Vector Interleave Right (word data elements)
> + *
> + */
> +static inline void gen_ilvr_w(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x00000000ffffffffULL;

Use tcg_const_i64(). The same for the previous function.

> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 32;

Just assign the constant value to the mask, no need for shift operation.
The same applies for other similar cases in this patch.

> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVR.D wd, ws, wt
> + *
> + *   Vector Interleave Right (doubleword data elements)
> + *
> + */
> +static inline void gen_ilvr_d(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +

This function seems to be identical to the gen_ilvev_d(). Please,
if that is the case, in this patch rename gen_ilvev_d() to gen_ilvev_ilvr_d(),
and  use it both for hanlding ILVEV.D and ILVR.D.

> +
> +/*
>   * [MSA] ILVL.B wd, ws, wt
>   *
>   *   Vector Interleave Left (byte data elements)
> @@ -29380,7 +29470,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_div_u_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVR_df:
> -        gen_helper_msa_ilvr_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_helper_msa_ilvr_b(cpu_env, twd, tws, twt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvr_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvr_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvr_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSL_df:
>          gen_helper_msa_binsl_df(cpu_env, tdf, twd, tws, twt);
> --
> 2.7.4
>
>

Thanks,
Aleksandar


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
@ 2019-04-13 16:05     ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-13 16:05 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: QEMU Developers, Aleksandar Rikalo, Richard Henderson,
	Philippe Mathieu-Daudé,
	Aleksandar Markovic, Aurelien Jarno

On Thu, Apr 4, 2019 at 3:18 PM Mateja Marjanovic
<mateja.marjanovic@rt-rk.com> wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
> directly tcg registers and performing logic on them
> instead of using helpers.
>
> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.
>
> Performance measurement is done by executing the
> instructions a large number of times on a computer

What is the exact number of times?

> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>
> ============================================================
> || instr    ||   before    || no-deposit ||  with-deposit ||
> ============================================================
> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
> ============================================================
>

"With-deposit" for ilvev.w can't be the same as in the previous
version (22.17 ms), since you eliminated one tcg_gen_andi_i64() in
this version compared to the previous one. "With-deposit" for ilvev.wb
and ilvev.h also can't be the same. It looks you just copy-pasted the
numbers. Please retest the performance and attach the accurate
numbers.

Also, there should be five columns and their meanings should be:

  - instruction
  - before
  - no-deposit-no-mask-as-tcg-constant
  - with-deposit-no-mask-as-tcg-constant
  - with-deposit-with-mask-as-tcg-constant (final)

> No-deposit column and with-deposit column have the
> same statistical values in every row, except ILVEV.W,
> which is the only function which uses the deposit
> function.
>
> No-deposit version of the ILVEV.W implementation:
>
> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     TCGv_i64 t1 = tcg_temp_new_i64();
>     TCGv_i64 t2 = tcg_temp_new_i64();
>     uint64_t mask = 0x00000000ffffffffULL;
>
>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>     tcg_gen_shli_i64(t2, t2, 32);
>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>
>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>     tcg_gen_shli_i64(t2, t2, 32);
>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(t2);
> }
>
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>

You forgot Philippe.

> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   9 -----
>  target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 100 insertions(+), 11 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 02e16c7..82f6a40 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index a7ea6aa..d5c3842 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>      } while (0)
>  MSA_FN_DF(ilvr_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> -    } while (0)
> -MSA_FN_DF(ilvev_df)
> -#undef MSA_DO
> -
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index df685e4..3057669 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>  }
>
> +/*
> + * [MSA] ILVEV.B wd, ws, wt
> + *
> + *   Vector Interleave Even (byte data elements)
> + *
> + */
> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVEV.H wd, ws, wt
> + *
> + *   Vector Interleave Even (halfword data elements)
> + *
> + */
> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}

Please apply Philippe's refactoring for the preceding two functions.

> +
> +/*
> + * [MSA] ILVEV.W wd, ws, wt
> + *
> + *   Vector Interleave Even (word data elements)
> + *
> + */
> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
> +                        msa_wr_d[ws * 2], 32, 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
> +                        msa_wr_d[ws * 2 + 1], 32, 32);
> +}
> +
> +/*
> + * [MSA] ILVEV.D wd, ws, wt
> + *
> + *   Vector Interleave Even (Doubleword data elements)

Doubleword -> doubleword

> + *
> + */
> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVEV_df:
> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvev_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvev_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvev_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvev_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSR_df:
>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> --
> 2.7.4
>
>

Thanks,
Aleksandar

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
@ 2019-04-13 16:05     ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-13 16:05 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: Aleksandar Rikalo, Richard Henderson, QEMU Developers,
	Aleksandar Markovic, Philippe Mathieu-Daudé,
	Aurelien Jarno

On Thu, Apr 4, 2019 at 3:18 PM Mateja Marjanovic
<mateja.marjanovic@rt-rk.com> wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
> directly tcg registers and performing logic on them
> instead of using helpers.
>
> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.
>
> Performance measurement is done by executing the
> instructions a large number of times on a computer

What is the exact number of times?

> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>
> ============================================================
> || instr    ||   before    || no-deposit ||  with-deposit ||
> ============================================================
> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
> ============================================================
>

"With-deposit" for ilvev.w can't be the same as in the previous
version (22.17 ms), since you eliminated one tcg_gen_andi_i64() in
this version compared to the previous one. "With-deposit" for ilvev.wb
and ilvev.h also can't be the same. It looks you just copy-pasted the
numbers. Please retest the performance and attach the accurate
numbers.

Also, there should be five columns and their meanings should be:

  - instruction
  - before
  - no-deposit-no-mask-as-tcg-constant
  - with-deposit-no-mask-as-tcg-constant
  - with-deposit-with-mask-as-tcg-constant (final)

> No-deposit column and with-deposit column have the
> same statistical values in every row, except ILVEV.W,
> which is the only function which uses the deposit
> function.
>
> No-deposit version of the ILVEV.W implementation:
>
> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     TCGv_i64 t1 = tcg_temp_new_i64();
>     TCGv_i64 t2 = tcg_temp_new_i64();
>     uint64_t mask = 0x00000000ffffffffULL;
>
>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>     tcg_gen_shli_i64(t2, t2, 32);
>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>
>     tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>     tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>     tcg_gen_shli_i64(t2, t2, 32);
>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(t2);
> }
>
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>

You forgot Philippe.

> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   9 -----
>  target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 100 insertions(+), 11 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 02e16c7..82f6a40 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index a7ea6aa..d5c3842 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>      } while (0)
>  MSA_FN_DF(ilvr_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> -    } while (0)
> -MSA_FN_DF(ilvev_df)
> -#undef MSA_DO
> -
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index df685e4..3057669 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>  }
>
> +/*
> + * [MSA] ILVEV.B wd, ws, wt
> + *
> + *   Vector Interleave Even (byte data elements)
> + *
> + */
> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVEV.H wd, ws, wt
> + *
> + *   Vector Interleave Even (halfword data elements)
> + *
> + */
> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}

Please apply Philippe's refactoring for the preceding two functions.

> +
> +/*
> + * [MSA] ILVEV.W wd, ws, wt
> + *
> + *   Vector Interleave Even (word data elements)
> + *
> + */
> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
> +                        msa_wr_d[ws * 2], 32, 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
> +                        msa_wr_d[ws * 2 + 1], 32, 32);
> +}
> +
> +/*
> + * [MSA] ILVEV.D wd, ws, wt
> + *
> + *   Vector Interleave Even (Doubleword data elements)

Doubleword -> doubleword

> + *
> + */
> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVEV_df:
> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvev_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvev_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvev_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvev_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSR_df:
>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> --
> 2.7.4
>
>

Thanks,
Aleksandar


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions
@ 2019-04-13 16:09     ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-13 16:09 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: QEMU Developers, Aleksandar Rikalo, Richard Henderson,
	Philippe Mathieu-Daudé,
	Aleksandar Markovic, Aurelien Jarno

On Thu, Apr 4, 2019 at 3:16 PM Mateja Marjanovic
<mateja.marjanovic@rt-rk.com> wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimize set of MSA instructions ILVOD.<B|H|W|D>, using
> directly tcg registers and performing logic on them instead
> of using helpers.
>

Please see my comments for ILVEV.D.

Thanks,
Aleksandar

> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.
>
> Performance measurement is done by executing the
> instructions a large number of times on a computer
> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>
> ============================================================
> || instr    ||   before    || no-deposit || with-deposit  ||
> ============================================================
> || ilvod.b  ||  117.50 ms  ||  24.13 ms  ||   23.71 ms    ||
> || ilvod.h  ||   93.16 ms  ||  24.21 ms  ||   23.45 ms    ||
> || ilvod.w  ||  119.90 ms  ||  24.15 ms  ||   22.91 ms    ||
> || ilvod.d  ||   43.01 ms  ||  21.17 ms  ||   20.53 ms    ||
> ============================================================
>
> No-deposit column and with-deposit column have the
> same statistical values in every row, except ILVOD.W,
> which is the only function which uses the deposit
> function.
>
> No-deposit version of the ILVOD.W implementation:
>
> static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     TCGv_i64 t1 = tcg_temp_new_i64();
>     TCGv_i64 t2 = tcg_temp_new_i64();
>     TCGv_i64 mask = tcg_const_i64(0xffffffff00000000ULL);
>
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>     tcg_gen_shri_i64(t1, t1, 32);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>     tcg_gen_shri_i64(t1, t1, 32);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>
>     tcg_temp_free_i64(mask);
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(t2);
> }
>
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   7 ----
>  target/mips/translate.c  | 106 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 105 insertions(+), 9 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 2863f60..02e16c7 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -865,7 +865,6 @@ DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index 6c57281..a7ea6aa 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1206,13 +1206,6 @@ MSA_FN_DF(ilvr_df)
>  MSA_FN_DF(ilvev_df)
>  #undef MSA_DO
>
> -#define MSA_DO(DF)                          \
> -    do {                                    \
> -        pwx->DF[2*i]   = pwt->DF[2*i+1];    \
> -        pwx->DF[2*i+1] = pws->DF[2*i+1];    \
> -    } while (0)
> -MSA_FN_DF(ilvod_df)
> -#undef MSA_DO
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index bba8b6c..df685e4 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28884,6 +28884,95 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
>      tcg_temp_free_i32(tws);
>  }
>
> +/*
> + * [MSA] ILVOD.B wd, ws, wt
> + *
> + *   Vector Interleave Odd (byte data elements)
> + *
> + */
> +static inline void gen_ilvod_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0xff00ff00ff00ff00ULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVOD.H wd, ws, wt
> + *
> + *   Vector Interleave Odd (halfword data elements)
> + *
> + */
> +static inline void gen_ilvod_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0xffff0000ffff0000ULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVOD.W wd, ws, wt
> + *
> + *   Vector Interleave Odd (word data elements)
> + *
> + */
> +static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +
> +    tcg_gen_shri_i64(t1, msa_wr_d[wt * 2], 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[ws * 2], t1, 0, 32);
> +
> +    tcg_gen_shri_i64(t1, msa_wr_d[wt * 2 + 1], 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1], t1, 0, 32);
> +
> +    tcg_temp_free_i64(t1);
> +}
> +
> +/*
> + * [MSA] ILVOD.D wd, ws, wt
> + *
> + *   Vector Interleave Odd (doubleword data elements)
> + *
> + */
> +static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2 + 1]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29055,7 +29144,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_mod_u_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVOD_df:
> -        gen_helper_msa_ilvod_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvod_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvod_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvod_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvod_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>
>      case OPC_DOTP_S_df:
> --
> 2.7.4
>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions
@ 2019-04-13 16:09     ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-13 16:09 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: Aleksandar Rikalo, Richard Henderson, QEMU Developers,
	Aleksandar Markovic, Philippe Mathieu-Daudé,
	Aurelien Jarno

On Thu, Apr 4, 2019 at 3:16 PM Mateja Marjanovic
<mateja.marjanovic@rt-rk.com> wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimize set of MSA instructions ILVOD.<B|H|W|D>, using
> directly tcg registers and performing logic on them instead
> of using helpers.
>

Please see my comments for ILVEV.D.

Thanks,
Aleksandar

> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.
>
> Performance measurement is done by executing the
> instructions a large number of times on a computer
> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>
> ============================================================
> || instr    ||   before    || no-deposit || with-deposit  ||
> ============================================================
> || ilvod.b  ||  117.50 ms  ||  24.13 ms  ||   23.71 ms    ||
> || ilvod.h  ||   93.16 ms  ||  24.21 ms  ||   23.45 ms    ||
> || ilvod.w  ||  119.90 ms  ||  24.15 ms  ||   22.91 ms    ||
> || ilvod.d  ||   43.01 ms  ||  21.17 ms  ||   20.53 ms    ||
> ============================================================
>
> No-deposit column and with-deposit column have the
> same statistical values in every row, except ILVOD.W,
> which is the only function which uses the deposit
> function.
>
> No-deposit version of the ILVOD.W implementation:
>
> static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
>                                uint32_t ws, uint32_t wt)
> {
>     TCGv_i64 t1 = tcg_temp_new_i64();
>     TCGv_i64 t2 = tcg_temp_new_i64();
>     TCGv_i64 mask = tcg_const_i64(0xffffffff00000000ULL);
>
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>     tcg_gen_shri_i64(t1, t1, 32);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>     tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>
>     tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>     tcg_gen_shri_i64(t1, t1, 32);
>     tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>     tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>
>     tcg_temp_free_i64(mask);
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(t2);
> }
>
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   7 ----
>  target/mips/translate.c  | 106 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 105 insertions(+), 9 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 2863f60..02e16c7 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -865,7 +865,6 @@ DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index 6c57281..a7ea6aa 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1206,13 +1206,6 @@ MSA_FN_DF(ilvr_df)
>  MSA_FN_DF(ilvev_df)
>  #undef MSA_DO
>
> -#define MSA_DO(DF)                          \
> -    do {                                    \
> -        pwx->DF[2*i]   = pwt->DF[2*i+1];    \
> -        pwx->DF[2*i+1] = pws->DF[2*i+1];    \
> -    } while (0)
> -MSA_FN_DF(ilvod_df)
> -#undef MSA_DO
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index bba8b6c..df685e4 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28884,6 +28884,95 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
>      tcg_temp_free_i32(tws);
>  }
>
> +/*
> + * [MSA] ILVOD.B wd, ws, wt
> + *
> + *   Vector Interleave Odd (byte data elements)
> + *
> + */
> +static inline void gen_ilvod_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0xff00ff00ff00ff00ULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVOD.H wd, ws, wt
> + *
> + *   Vector Interleave Odd (halfword data elements)
> + *
> + */
> +static inline void gen_ilvod_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0xffff0000ffff0000ULL);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(mask);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVOD.W wd, ws, wt
> + *
> + *   Vector Interleave Odd (word data elements)
> + *
> + */
> +static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +
> +    tcg_gen_shri_i64(t1, msa_wr_d[wt * 2], 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[ws * 2], t1, 0, 32);
> +
> +    tcg_gen_shri_i64(t1, msa_wr_d[wt * 2 + 1], 32);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1], t1, 0, 32);
> +
> +    tcg_temp_free_i64(t1);
> +}
> +
> +/*
> + * [MSA] ILVOD.D wd, ws, wt
> + *
> + *   Vector Interleave Odd (doubleword data elements)
> + *
> + */
> +static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2 + 1]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29055,7 +29144,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_mod_u_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVOD_df:
> -        gen_helper_msa_ilvod_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvod_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvod_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvod_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvod_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>
>      case OPC_DOTP_S_df:
> --
> 2.7.4
>
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 3/4] target/mips: Optimize ILVL.<B|H|W|D> MSA instructions
@ 2019-04-13 16:15     ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-13 16:15 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: QEMU Developers, Aleksandar Rikalo, Richard Henderson,
	Philippe Mathieu-Daudé,
	Aleksandar Markovic, Aurelien Jarno

On Thu, Apr 4, 2019 at 3:16 PM Mateja Marjanovic
<mateja.marjanovic@rt-rk.com> wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimized ILVL.<B|H|W|D> instructions, using a hybrid
> approach. For byte data elements, use a helper with an
> unrolled loop (much better performance), for halfword,
> word and doubleword data elements use directly tcg
> registers and logic performed on them.
>
> Performance measurement is done by executing the
> instructions a large number of times on a computer
> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>
> ==================================================
> ||  instr  ||  helper  ||   tcg    ||  hybrid   ||
> ==================================================
> || ilvl.b: || 59.91 ms || 74.41 ms ||  59.24 ms || <-- helper
> || ilvl.h: || 41.33 ms || 33.08 ms ||  32.96 ms || <-- tcg
> || ilvl.w: || 30.99 ms || 22.87 ms ||  22.81 ms || <-- tcg
> || ilvl.d: || 26.40 ms || 19.64 ms ||  19.45 ms || <-- tcg
> ==================================================
>
> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   3 +-
>  target/mips/msa_helper.c |  33 ++++++---
>  target/mips/translate.c  | 184 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 210 insertions(+), 10 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 82f6a40..cd73723 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -862,7 +862,6 @@ DEF_HELPER_5(msa_sld_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_splat_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
> @@ -946,6 +945,8 @@ DEF_HELPER_4(msa_insert_h, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_insert_w, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_insert_d, void, env, i32, i32, i32)
>
> +DEF_HELPER_4(msa_ilvl_b, void, env, i32, i32, i32)
> +
>  DEF_HELPER_4(msa_fclass_df, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_ftrunc_s_df, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_ftrunc_u_df, void, env, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index d5c3842..84bbe6f 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1184,14 +1184,6 @@ MSA_FN_DF(pckod_df)
>
>  #define MSA_DO(DF)                      \
>      do {                                \
> -        pwx->DF[2*i]   = L##DF(pwt, i); \
> -        pwx->DF[2*i+1] = L##DF(pws, i); \
> -    } while (0)
> -MSA_FN_DF(ilvl_df)
> -#undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
>          pwx->DF[2*i]   = R##DF(pwt, i); \
>          pwx->DF[2*i+1] = R##DF(pws, i); \
>      } while (0)
> @@ -1232,6 +1224,31 @@ void helper_msa_splati_df(CPUMIPSState *env, uint32_t df, uint32_t wd,
>      msa_splat_df(df, pwd, pws, n);
>  }
>
> +void helper_msa_ilvl_b(CPUMIPSState *env, uint32_t wd,
> +                       uint32_t ws, uint32_t wt)
> +{
> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
> +
> +    pwd->b[0]  = pwt->b[8];
> +    pwd->b[1]  = pws->b[8];
> +    pwd->b[2]  = pwt->b[9];
> +    pwd->b[3]  = pws->b[9];
> +    pwd->b[4]  = pwt->b[10];
> +    pwd->b[5]  = pws->b[10];
> +    pwd->b[6]  = pwt->b[11];
> +    pwd->b[7]  = pws->b[11];
> +    pwd->b[8]  = pwt->b[12];
> +    pwd->b[9]  = pws->b[12];
> +    pwd->b[10] = pwt->b[13];
> +    pwd->b[11] = pws->b[13];
> +    pwd->b[12] = pwt->b[14];
> +    pwd->b[13] = pws->b[14];
> +    pwd->b[14] = pwt->b[15];
> +    pwd->b[15] = pws->b[15];
> +}
> +
>  void helper_msa_copy_s_b(CPUMIPSState *env, uint32_t rd,
>                           uint32_t ws, uint32_t n)
>  {
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index 3057669..6c6811e 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28885,6 +28885,173 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
>  }
>
>  /*
> + * [MSA] ILVL.B wd, ws, wt
> + *
> + *   Vector Interleave Left (byte data elements)
> + *
> + */
> +static inline void gen_ilvl_b(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x00000000000000ffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 8);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 8);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 24);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 24);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 24);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 24);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVL.H wd, ws, wt
> + *
> + *   Vector Interleave Left (halfword data elements)
> + *
> + */
> +static inline void gen_ilvl_h(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x000000000000ffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVL.W wd, ws, wt
> + *
> + *   Vector Interleave Left (word data elements)
> + *
> + */
> +static inline void gen_ilvl_w(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x00000000ffffffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 32;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVL.D wd, ws, wt
> + *
> + *   Vector Interleave Left (doubleword data elements)
> + *
> + */
> +static inline void gen_ilvl_d(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2 + 1]);

This code introduces the bug for the case wd == wt. You keep repeating the same
mistake on and on and on.

Please see also my comments for ILVR.D.

Thanks,
Aleksandar

> +}
> +
> +/*
>   * [MSA] ILVOD.B wd, ws, wt
>   *
>   *   Vector Interleave Odd (byte data elements)
> @@ -29177,7 +29344,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_div_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVL_df:
> -        gen_helper_msa_ilvl_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_helper_msa_ilvl_b(cpu_env, twd, tws, twt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvl_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvl_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvl_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BNEG_df:
>          gen_helper_msa_bneg_df(cpu_env, tdf, twd, tws, twt);
> --
> 2.7.4
>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 3/4] target/mips: Optimize ILVL.<B|H|W|D> MSA instructions
@ 2019-04-13 16:15     ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-13 16:15 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: Aleksandar Rikalo, Richard Henderson, QEMU Developers,
	Aleksandar Markovic, Philippe Mathieu-Daudé,
	Aurelien Jarno

On Thu, Apr 4, 2019 at 3:16 PM Mateja Marjanovic
<mateja.marjanovic@rt-rk.com> wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimized ILVL.<B|H|W|D> instructions, using a hybrid
> approach. For byte data elements, use a helper with an
> unrolled loop (much better performance), for halfword,
> word and doubleword data elements use directly tcg
> registers and logic performed on them.
>
> Performance measurement is done by executing the
> instructions a large number of times on a computer
> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>
> ==================================================
> ||  instr  ||  helper  ||   tcg    ||  hybrid   ||
> ==================================================
> || ilvl.b: || 59.91 ms || 74.41 ms ||  59.24 ms || <-- helper
> || ilvl.h: || 41.33 ms || 33.08 ms ||  32.96 ms || <-- tcg
> || ilvl.w: || 30.99 ms || 22.87 ms ||  22.81 ms || <-- tcg
> || ilvl.d: || 26.40 ms || 19.64 ms ||  19.45 ms || <-- tcg
> ==================================================
>
> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   3 +-
>  target/mips/msa_helper.c |  33 ++++++---
>  target/mips/translate.c  | 184 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 210 insertions(+), 10 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 82f6a40..cd73723 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -862,7 +862,6 @@ DEF_HELPER_5(msa_sld_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_splat_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
> @@ -946,6 +945,8 @@ DEF_HELPER_4(msa_insert_h, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_insert_w, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_insert_d, void, env, i32, i32, i32)
>
> +DEF_HELPER_4(msa_ilvl_b, void, env, i32, i32, i32)
> +
>  DEF_HELPER_4(msa_fclass_df, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_ftrunc_s_df, void, env, i32, i32, i32)
>  DEF_HELPER_4(msa_ftrunc_u_df, void, env, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index d5c3842..84bbe6f 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1184,14 +1184,6 @@ MSA_FN_DF(pckod_df)
>
>  #define MSA_DO(DF)                      \
>      do {                                \
> -        pwx->DF[2*i]   = L##DF(pwt, i); \
> -        pwx->DF[2*i+1] = L##DF(pws, i); \
> -    } while (0)
> -MSA_FN_DF(ilvl_df)
> -#undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
>          pwx->DF[2*i]   = R##DF(pwt, i); \
>          pwx->DF[2*i+1] = R##DF(pws, i); \
>      } while (0)
> @@ -1232,6 +1224,31 @@ void helper_msa_splati_df(CPUMIPSState *env, uint32_t df, uint32_t wd,
>      msa_splat_df(df, pwd, pws, n);
>  }
>
> +void helper_msa_ilvl_b(CPUMIPSState *env, uint32_t wd,
> +                       uint32_t ws, uint32_t wt)
> +{
> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
> +
> +    pwd->b[0]  = pwt->b[8];
> +    pwd->b[1]  = pws->b[8];
> +    pwd->b[2]  = pwt->b[9];
> +    pwd->b[3]  = pws->b[9];
> +    pwd->b[4]  = pwt->b[10];
> +    pwd->b[5]  = pws->b[10];
> +    pwd->b[6]  = pwt->b[11];
> +    pwd->b[7]  = pws->b[11];
> +    pwd->b[8]  = pwt->b[12];
> +    pwd->b[9]  = pws->b[12];
> +    pwd->b[10] = pwt->b[13];
> +    pwd->b[11] = pws->b[13];
> +    pwd->b[12] = pwt->b[14];
> +    pwd->b[13] = pws->b[14];
> +    pwd->b[14] = pwt->b[15];
> +    pwd->b[15] = pws->b[15];
> +}
> +
>  void helper_msa_copy_s_b(CPUMIPSState *env, uint32_t rd,
>                           uint32_t ws, uint32_t n)
>  {
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index 3057669..6c6811e 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28885,6 +28885,173 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
>  }
>
>  /*
> + * [MSA] ILVL.B wd, ws, wt
> + *
> + *   Vector Interleave Left (byte data elements)
> + *
> + */
> +static inline void gen_ilvl_b(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x00000000000000ffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 8);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 8);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 24);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 24);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 24);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 24);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 8;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 8);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVL.H wd, ws, wt
> + *
> + *   Vector Interleave Left (halfword data elements)
> + *
> + */
> +static inline void gen_ilvl_h(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x000000000000ffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +
> +    mask <<= 16;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 16);
> +    tcg_gen_or_i64(t2, t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVL.W wd, ws, wt
> + *
> + *   Vector Interleave Left (word data elements)
> + *
> + */
> +static inline void gen_ilvl_w(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x00000000ffffffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t1, t1, 32);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
> +
> +    mask <<= 32;
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);
> +    tcg_gen_mov_i64(t2, t1);
> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVL.D wd, ws, wt
> + *
> + *   Vector Interleave Left (doubleword data elements)
> + *
> + */
> +static inline void gen_ilvl_d(CPUMIPSState *env, uint32_t wd,
> +                              uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2 + 1]);

This code introduces the bug for the case wd == wt. You keep repeating the same
mistake on and on and on.

Please see also my comments for ILVR.D.

Thanks,
Aleksandar

> +}
> +
> +/*
>   * [MSA] ILVOD.B wd, ws, wt
>   *
>   *   Vector Interleave Odd (byte data elements)
> @@ -29177,7 +29344,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_div_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVL_df:
> -        gen_helper_msa_ilvl_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_helper_msa_ilvl_b(cpu_env, twd, tws, twt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvl_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvl_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvl_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BNEG_df:
>          gen_helper_msa_bneg_df(cpu_env, tdf, twd, tws, twt);
> --
> 2.7.4
>
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> MSA instructions
@ 2019-04-15 11:24       ` Mateja Marjanovic
  0 siblings, 0 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-15 11:24 UTC (permalink / raw)
  To: Aleksandar Markovic
  Cc: QEMU Developers, Aleksandar Rikalo, Richard Henderson,
	Philippe Mathieu-Daudé,
	Aleksandar Markovic, Aurelien Jarno


On 13.4.19. 18:05, Aleksandar Markovic wrote:
> On Thu, Apr 4, 2019 at 3:16 PM Mateja Marjanovic
> <mateja.marjanovic@rt-rk.com> wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>
>> Optimized ILVR.<B|H|W|D> instructions, using a hybrid
> Optimized -> Optimize
>
>> approach. For byte data elements, use a helper with an
>> unrolled loop (much better performance), for halfword,
> (much better performance) -> (having much better performance
> than direct tcg translation)
>
>> word and doubleword data elements use directly tcg
>> registers and logic performed on them.
>>
>> Performance measurement is done by executing the
>> instructions a large number of times on a computer
>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>
>> ===================================================
>> ||  instr  ||  helper  ||    tcg    ||   hybrid  ||
>> ===================================================
>> || ilvr.b: || 62.87 ms ||  74.76 ms ||  61.52 ms || <-- helper
>> || ilvr.h: || 44.11 ms ||  33.00 ms ||  33.55 ms || <-- tcg
>> || ilvr.w: || 34.97 ms ||  23.06 ms ||  22.67 ms || <-- tcg
>> || ilvr.d: || 27.33 ms ||  19.87 ms ||  20.02 ms || <-- tcg
>> ===================================================
>>
> instr -> instruction
>
> ||  61.52 ms || <-- helper  ->  ||  61.52 ms (helper) ||
>
> and similar for other three raws.
I will change those three in v7.
>
>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>> ---
>>   target/mips/helper.h     |   2 +-
>>   target/mips/msa_helper.c |  33 +++++++++++----
>>   target/mips/translate.c  | 107 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 132 insertions(+), 10 deletions(-)
>>
>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>> index cd73723..d4755ef 100644
>> --- a/target/mips/helper.h
>> +++ b/target/mips/helper.h
>> @@ -862,7 +862,6 @@ DEF_HELPER_5(msa_sld_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_splat_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>> -DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>> @@ -946,6 +945,7 @@ DEF_HELPER_4(msa_insert_w, void, env, i32, i32, i32)
>>   DEF_HELPER_4(msa_insert_d, void, env, i32, i32, i32)
>>
>>   DEF_HELPER_4(msa_ilvl_b, void, env, i32, i32, i32)
>> +DEF_HELPER_4(msa_ilvr_b, void, env, i32, i32, i32)
>>
>>   DEF_HELPER_4(msa_fclass_df, void, env, i32, i32, i32)
>>   DEF_HELPER_4(msa_ftrunc_s_df, void, env, i32, i32, i32)
>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>> index 84bbe6f..2470cef 100644
>> --- a/target/mips/msa_helper.c
>> +++ b/target/mips/msa_helper.c
>> @@ -1181,14 +1181,6 @@ MSA_FN_DF(pckev_df)
>>       } while (0)
>>   MSA_FN_DF(pckod_df)
>>   #undef MSA_DO
>> -
>> -#define MSA_DO(DF)                      \
>> -    do {                                \
>> -        pwx->DF[2*i]   = R##DF(pwt, i); \
>> -        pwx->DF[2*i+1] = R##DF(pws, i); \
>> -    } while (0)
>> -MSA_FN_DF(ilvr_df)
>> -#undef MSA_DO
>>   #undef MSA_LOOP_COND
>>
>>   #define MSA_LOOP_COND(DF) \
>> @@ -1249,6 +1241,31 @@ void helper_msa_ilvl_b(CPUMIPSState *env, uint32_t wd,
>>       pwd->b[15] = pws->b[15];
>>   }
>>
>> +void helper_msa_ilvr_b(CPUMIPSState *env, uint32_t wd,
>> +                       uint32_t ws, uint32_t wt)
>> +{
>> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
>> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
>> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
>> +
> Why do we use here env->active_fpu.fpr[wd].wr, while for other instructions in
> this patch, we access msa_wr_d<b|h|w|d[] directly?
With a pointer to wr_t we have an array of bytes, halfwords, words or 
doublewords
and can read from them and change them like an ordinary array. In other 
cases
we use a variable that is TCGv_i64 and would have to use tcg_gen 
functions to
modify the value of the register. Before my changes in ilvr instruction 
helpers
env->active_fpu.fpr[wd].wr was used, so I just copy-pasted that.
>
>> +    pwd->b[15] = pws->b[7];
>> +    pwd->b[14] = pwt->b[7];
>> +    pwd->b[13] = pws->b[6];
>> +    pwd->b[12] = pwt->b[6];
>> +    pwd->b[11] = pws->b[5];
>> +    pwd->b[10] = pwt->b[5];
>> +    pwd->b[9]  = pws->b[4];
>> +    pwd->b[8]  = pwt->b[4];
>> +    pwd->b[7]  = pws->b[3];
>> +    pwd->b[6]  = pwt->b[3];
>> +    pwd->b[5]  = pws->b[2];
>> +    pwd->b[4]  = pwt->b[2];
>> +    pwd->b[3]  = pws->b[1];
>> +    pwd->b[2]  = pwt->b[1];
>> +    pwd->b[1]  = pws->b[0];
>> +    pwd->b[0]  = pwt->b[0];
>> +}
>> +
>>   void helper_msa_copy_s_b(CPUMIPSState *env, uint32_t rd,
>>                            uint32_t ws, uint32_t n)
>>   {
>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>> index 6c6811e..90332fb 100644
>> --- a/target/mips/translate.c
>> +++ b/target/mips/translate.c
>> @@ -28885,6 +28885,96 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
>>   }
>>
>>   /*
>> + * [MSA] ILVR.H wd, ws, wt
>> + *
>> + *   Vector Interleave Right (halfword data elements)
>> + *
>> + */
>> +static inline void gen_ilvr_h(CPUMIPSState *env, uint32_t wd,
>> +                              uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    uint64_t mask = 0x000000000000ffffULL;
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_mov_i64(t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t1, t1, 16);
>> +    tcg_gen_or_i64(t2, t2, t1);
>> +
>> +    mask <<= 16;
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_shli_i64(t1, t1, 16);
>> +    tcg_gen_or_i64(t2, t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t1, t1, 32);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
>> +
>> +    mask <<= 16;
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_shri_i64(t1, t1, 32);
>> +    tcg_gen_mov_i64(t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shri_i64(t1, t1, 16);
>> +    tcg_gen_or_i64(t2, t2, t1);
>> +
>> +    mask <<= 16;
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_shri_i64(t1, t1, 16);
>> +    tcg_gen_or_i64(t2, t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
>> +
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
>> +
>> +/*
>> + * [MSA] ILVR.W wd, ws, wt
>> + *
>> + *   Vector Interleave Right (word data elements)
>> + *
>> + */
>> +static inline void gen_ilvr_w(CPUMIPSState *env, uint32_t wd,
>> +                              uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    uint64_t mask = 0x00000000ffffffffULL;
> Use tcg_const_i64(). The same for the previous function.
Will do in v7.
>
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_mov_i64(t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t1, t1, 32);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
>> +
>> +    mask <<= 32;
> Just assign the constant value to the mask, no need for shift operation.
> The same applies for other similar cases in this patch.
I was not sure which would have better performance, so I assumed
this with shifting, but I will add with assigning a constant to a register,
and test the performance.
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_shri_i64(t1, t1, 32);
>> +    tcg_gen_mov_i64(t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
>> +
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
>> +
>> +/*
>> + * [MSA] ILVR.D wd, ws, wt
>> + *
>> + *   Vector Interleave Right (doubleword data elements)
>> + *
>> + */
>> +static inline void gen_ilvr_d(CPUMIPSState *env, uint32_t wd,
>> +                              uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>> +}
>> +
> This function seems to be identical to the gen_ilvev_d(). Please,
> if that is the case, in this patch rename gen_ilvev_d() to gen_ilvev_ilvr_d(),
> and  use it both for hanlding ILVEV.D and ILVR.D.
I didn't notice that. I will check, and if you are right, I will do that 
in v7.
>> +
>> +/*
>>    * [MSA] ILVL.B wd, ws, wt
>>    *
>>    *   Vector Interleave Left (byte data elements)
>> @@ -29380,7 +29470,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>           gen_helper_msa_div_u_df(cpu_env, tdf, twd, tws, twt);
>>           break;
>>       case OPC_ILVR_df:
>> -        gen_helper_msa_ilvr_df(cpu_env, tdf, twd, tws, twt);
>> +        switch (df) {
>> +        case DF_BYTE:
>> +            gen_helper_msa_ilvr_b(cpu_env, twd, tws, twt);
>> +            break;
>> +        case DF_HALF:
>> +            gen_ilvr_h(env, wd, ws, wt);
>> +            break;
>> +        case DF_WORD:
>> +            gen_ilvr_w(env, wd, ws, wt);
>> +            break;
>> +        case DF_DOUBLE:
>> +            gen_ilvr_d(env, wd, ws, wt);
>> +            break;
>> +        default:
>> +            assert(0);
>> +        }
>>           break;
>>       case OPC_BINSL_df:
>>           gen_helper_msa_binsl_df(cpu_env, tdf, twd, tws, twt);
>> --
>> 2.7.4
>>
>>
> Thanks,
> Aleksandar
Thanks,
Mateja

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> MSA instructions
@ 2019-04-15 11:24       ` Mateja Marjanovic
  0 siblings, 0 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-15 11:24 UTC (permalink / raw)
  To: Aleksandar Markovic
  Cc: Aleksandar Rikalo, Richard Henderson, QEMU Developers,
	Aleksandar Markovic, Philippe Mathieu-Daudé,
	Aurelien Jarno


On 13.4.19. 18:05, Aleksandar Markovic wrote:
> On Thu, Apr 4, 2019 at 3:16 PM Mateja Marjanovic
> <mateja.marjanovic@rt-rk.com> wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>
>> Optimized ILVR.<B|H|W|D> instructions, using a hybrid
> Optimized -> Optimize
>
>> approach. For byte data elements, use a helper with an
>> unrolled loop (much better performance), for halfword,
> (much better performance) -> (having much better performance
> than direct tcg translation)
>
>> word and doubleword data elements use directly tcg
>> registers and logic performed on them.
>>
>> Performance measurement is done by executing the
>> instructions a large number of times on a computer
>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>
>> ===================================================
>> ||  instr  ||  helper  ||    tcg    ||   hybrid  ||
>> ===================================================
>> || ilvr.b: || 62.87 ms ||  74.76 ms ||  61.52 ms || <-- helper
>> || ilvr.h: || 44.11 ms ||  33.00 ms ||  33.55 ms || <-- tcg
>> || ilvr.w: || 34.97 ms ||  23.06 ms ||  22.67 ms || <-- tcg
>> || ilvr.d: || 27.33 ms ||  19.87 ms ||  20.02 ms || <-- tcg
>> ===================================================
>>
> instr -> instruction
>
> ||  61.52 ms || <-- helper  ->  ||  61.52 ms (helper) ||
>
> and similar for other three raws.
I will change those three in v7.
>
>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>> ---
>>   target/mips/helper.h     |   2 +-
>>   target/mips/msa_helper.c |  33 +++++++++++----
>>   target/mips/translate.c  | 107 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 132 insertions(+), 10 deletions(-)
>>
>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>> index cd73723..d4755ef 100644
>> --- a/target/mips/helper.h
>> +++ b/target/mips/helper.h
>> @@ -862,7 +862,6 @@ DEF_HELPER_5(msa_sld_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_splat_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>> -DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>> @@ -946,6 +945,7 @@ DEF_HELPER_4(msa_insert_w, void, env, i32, i32, i32)
>>   DEF_HELPER_4(msa_insert_d, void, env, i32, i32, i32)
>>
>>   DEF_HELPER_4(msa_ilvl_b, void, env, i32, i32, i32)
>> +DEF_HELPER_4(msa_ilvr_b, void, env, i32, i32, i32)
>>
>>   DEF_HELPER_4(msa_fclass_df, void, env, i32, i32, i32)
>>   DEF_HELPER_4(msa_ftrunc_s_df, void, env, i32, i32, i32)
>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>> index 84bbe6f..2470cef 100644
>> --- a/target/mips/msa_helper.c
>> +++ b/target/mips/msa_helper.c
>> @@ -1181,14 +1181,6 @@ MSA_FN_DF(pckev_df)
>>       } while (0)
>>   MSA_FN_DF(pckod_df)
>>   #undef MSA_DO
>> -
>> -#define MSA_DO(DF)                      \
>> -    do {                                \
>> -        pwx->DF[2*i]   = R##DF(pwt, i); \
>> -        pwx->DF[2*i+1] = R##DF(pws, i); \
>> -    } while (0)
>> -MSA_FN_DF(ilvr_df)
>> -#undef MSA_DO
>>   #undef MSA_LOOP_COND
>>
>>   #define MSA_LOOP_COND(DF) \
>> @@ -1249,6 +1241,31 @@ void helper_msa_ilvl_b(CPUMIPSState *env, uint32_t wd,
>>       pwd->b[15] = pws->b[15];
>>   }
>>
>> +void helper_msa_ilvr_b(CPUMIPSState *env, uint32_t wd,
>> +                       uint32_t ws, uint32_t wt)
>> +{
>> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
>> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
>> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
>> +
> Why do we use here env->active_fpu.fpr[wd].wr, while for other instructions in
> this patch, we access msa_wr_d<b|h|w|d[] directly?
With a pointer to wr_t we have an array of bytes, halfwords, words or 
doublewords
and can read from them and change them like an ordinary array. In other 
cases
we use a variable that is TCGv_i64 and would have to use tcg_gen 
functions to
modify the value of the register. Before my changes in ilvr instruction 
helpers
env->active_fpu.fpr[wd].wr was used, so I just copy-pasted that.
>
>> +    pwd->b[15] = pws->b[7];
>> +    pwd->b[14] = pwt->b[7];
>> +    pwd->b[13] = pws->b[6];
>> +    pwd->b[12] = pwt->b[6];
>> +    pwd->b[11] = pws->b[5];
>> +    pwd->b[10] = pwt->b[5];
>> +    pwd->b[9]  = pws->b[4];
>> +    pwd->b[8]  = pwt->b[4];
>> +    pwd->b[7]  = pws->b[3];
>> +    pwd->b[6]  = pwt->b[3];
>> +    pwd->b[5]  = pws->b[2];
>> +    pwd->b[4]  = pwt->b[2];
>> +    pwd->b[3]  = pws->b[1];
>> +    pwd->b[2]  = pwt->b[1];
>> +    pwd->b[1]  = pws->b[0];
>> +    pwd->b[0]  = pwt->b[0];
>> +}
>> +
>>   void helper_msa_copy_s_b(CPUMIPSState *env, uint32_t rd,
>>                            uint32_t ws, uint32_t n)
>>   {
>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>> index 6c6811e..90332fb 100644
>> --- a/target/mips/translate.c
>> +++ b/target/mips/translate.c
>> @@ -28885,6 +28885,96 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
>>   }
>>
>>   /*
>> + * [MSA] ILVR.H wd, ws, wt
>> + *
>> + *   Vector Interleave Right (halfword data elements)
>> + *
>> + */
>> +static inline void gen_ilvr_h(CPUMIPSState *env, uint32_t wd,
>> +                              uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    uint64_t mask = 0x000000000000ffffULL;
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_mov_i64(t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t1, t1, 16);
>> +    tcg_gen_or_i64(t2, t2, t1);
>> +
>> +    mask <<= 16;
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_shli_i64(t1, t1, 16);
>> +    tcg_gen_or_i64(t2, t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t1, t1, 32);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
>> +
>> +    mask <<= 16;
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_shri_i64(t1, t1, 32);
>> +    tcg_gen_mov_i64(t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shri_i64(t1, t1, 16);
>> +    tcg_gen_or_i64(t2, t2, t1);
>> +
>> +    mask <<= 16;
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_shri_i64(t1, t1, 16);
>> +    tcg_gen_or_i64(t2, t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
>> +
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
>> +
>> +/*
>> + * [MSA] ILVR.W wd, ws, wt
>> + *
>> + *   Vector Interleave Right (word data elements)
>> + *
>> + */
>> +static inline void gen_ilvr_w(CPUMIPSState *env, uint32_t wd,
>> +                              uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    uint64_t mask = 0x00000000ffffffffULL;
> Use tcg_const_i64(). The same for the previous function.
Will do in v7.
>
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_mov_i64(t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t1, t1, 32);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t2, t1);
>> +
>> +    mask <<= 32;
> Just assign the constant value to the mask, no need for shift operation.
> The same applies for other similar cases in this patch.
I was not sure which would have better performance, so I assumed
this with shifting, but I will add with assigning a constant to a register,
and test the performance.
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_shri_i64(t1, t1, 32);
>> +    tcg_gen_mov_i64(t2, t1);
>> +    tcg_gen_andi_i64(t1, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t2, t1);
>> +
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
>> +
>> +/*
>> + * [MSA] ILVR.D wd, ws, wt
>> + *
>> + *   Vector Interleave Right (doubleword data elements)
>> + *
>> + */
>> +static inline void gen_ilvr_d(CPUMIPSState *env, uint32_t wd,
>> +                              uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>> +}
>> +
> This function seems to be identical to the gen_ilvev_d(). Please,
> if that is the case, in this patch rename gen_ilvev_d() to gen_ilvev_ilvr_d(),
> and  use it both for hanlding ILVEV.D and ILVR.D.
I didn't notice that. I will check, and if you are right, I will do that 
in v7.
>> +
>> +/*
>>    * [MSA] ILVL.B wd, ws, wt
>>    *
>>    *   Vector Interleave Left (byte data elements)
>> @@ -29380,7 +29470,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>           gen_helper_msa_div_u_df(cpu_env, tdf, twd, tws, twt);
>>           break;
>>       case OPC_ILVR_df:
>> -        gen_helper_msa_ilvr_df(cpu_env, tdf, twd, tws, twt);
>> +        switch (df) {
>> +        case DF_BYTE:
>> +            gen_helper_msa_ilvr_b(cpu_env, twd, tws, twt);
>> +            break;
>> +        case DF_HALF:
>> +            gen_ilvr_h(env, wd, ws, wt);
>> +            break;
>> +        case DF_WORD:
>> +            gen_ilvr_w(env, wd, ws, wt);
>> +            break;
>> +        case DF_DOUBLE:
>> +            gen_ilvr_d(env, wd, ws, wt);
>> +            break;
>> +        default:
>> +            assert(0);
>> +        }
>>           break;
>>       case OPC_BINSL_df:
>>           gen_helper_msa_binsl_df(cpu_env, tdf, twd, tws, twt);
>> --
>> 2.7.4
>>
>>
> Thanks,
> Aleksandar
Thanks,
Mateja


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
@ 2019-04-15 13:48       ` Mateja Marjanovic
  0 siblings, 0 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-15 13:48 UTC (permalink / raw)
  To: Aleksandar Markovic
  Cc: QEMU Developers, Aleksandar Rikalo, Richard Henderson,
	Philippe Mathieu-Daudé,
	Aleksandar Markovic, Aurelien Jarno


On 13.4.19. 18:05, Aleksandar Markovic wrote:
> On Thu, Apr 4, 2019 at 3:18 PM Mateja Marjanovic
> <mateja.marjanovic@rt-rk.com> wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>
>> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
>> directly tcg registers and performing logic on them
>> instead of using helpers.
>>
>> In the following table, the first column is the performance
>> before this patch. The second represents the performance,
>> after converting from helpers to tcg, but without using
>> tcg_gen_deposit function. The third one is the solution
>> which is implemented in this patch.
>>
>> Performance measurement is done by executing the
>> instructions a large number of times on a computer
> What is the exact number of times?
I will add that from now on.
>
>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>
>> ============================================================
>> || instr    ||   before    || no-deposit ||  with-deposit ||
>> ============================================================
>> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
>> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
>> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
>> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
>> ============================================================
>>
> "With-deposit" for ilvev.w can't be the same as in the previous
> version (22.17 ms), since you eliminated one tcg_gen_andi_i64() in
> this version compared to the previous one. "With-deposit" for ilvev.wb
> and ilvev.h also can't be the same. It looks you just copy-pasted the
> numbers. Please retest the performance and attach the accurate
> numbers.
My mistake, I will add it in v7.
>
> Also, there should be five columns and their meanings should be:
>
>    - instruction
>    - before
>    - no-deposit-no-mask-as-tcg-constant
>    - with-deposit-no-mask-as-tcg-constant
>    - with-deposit-with-mask-as-tcg-constant (final)
Alright, but the deposit function and mask as a tcg constant
are optimizations for two different problems. The deposit
function is used only in case of word, and mask as a tcg
constant in halfword and byte.
>> No-deposit column and with-deposit column have the
>> same statistical values in every row, except ILVEV.W,
>> which is the only function which uses the deposit
>> function.
>>
>> No-deposit version of the ILVEV.W implementation:
>>
>> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>                                 uint32_t ws, uint32_t wt)
>> {
>>      TCGv_i64 t1 = tcg_temp_new_i64();
>>      TCGv_i64 t2 = tcg_temp_new_i64();
>>      uint64_t mask = 0x00000000ffffffffULL;
>>
>>      tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>      tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>      tcg_gen_shli_i64(t2, t2, 32);
>>      tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>
>>      tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>      tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>      tcg_gen_shli_i64(t2, t2, 32);
>>      tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>
>>      tcg_temp_free_i64(t1);
>>      tcg_temp_free_i64(t2);
>> }
>>
>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> You forgot Philippe.
I will add Philippe, ofcourse.
>
>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>> ---
>>   target/mips/helper.h     |   1 -
>>   target/mips/msa_helper.c |   9 -----
>>   target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 100 insertions(+), 11 deletions(-)
>>
>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>> index 02e16c7..82f6a40 100644
>> --- a/target/mips/helper.h
>> +++ b/target/mips/helper.h
>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>> index a7ea6aa..d5c3842 100644
>> --- a/target/mips/msa_helper.c
>> +++ b/target/mips/msa_helper.c
>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>       } while (0)
>>   MSA_FN_DF(ilvr_df)
>>   #undef MSA_DO
>> -
>> -#define MSA_DO(DF)                      \
>> -    do {                                \
>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>> -    } while (0)
>> -MSA_FN_DF(ilvev_df)
>> -#undef MSA_DO
>> -
>>   #undef MSA_LOOP_COND
>>
>>   #define MSA_LOOP_COND(DF) \
>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>> index df685e4..3057669 100644
>> --- a/target/mips/translate.c
>> +++ b/target/mips/translate.c
>> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>       tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>   }
>>
>> +/*
>> + * [MSA] ILVEV.B wd, ws, wt
>> + *
>> + *   Vector Interleave Even (byte data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.H wd, ws, wt
>> + *
>> + *   Vector Interleave Even (halfword data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
> Please apply Philippe's refactoring for the preceding two functions.
I will in v7.
>
>> +
>> +/*
>> + * [MSA] ILVEV.W wd, ws, wt
>> + *
>> + *   Vector Interleave Even (word data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
>> +                        msa_wr_d[ws * 2], 32, 32);
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
>> +                        msa_wr_d[ws * 2 + 1], 32, 32);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.D wd, ws, wt
>> + *
>> + *   Vector Interleave Even (Doubleword data elements)
> Doubleword -> doubleword
It will be changed in v7.
>
>> + *
>> + */
>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>> +}
>> +
>>   static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>   {
>>   #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>           gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>           break;
>>       case OPC_ILVEV_df:
>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>> +        switch (df) {
>> +        case DF_BYTE:
>> +            gen_ilvev_b(env, wd, ws, wt);
>> +            break;
>> +        case DF_HALF:
>> +            gen_ilvev_h(env, wd, ws, wt);
>> +            break;
>> +        case DF_WORD:
>> +            gen_ilvev_w(env, wd, ws, wt);
>> +            break;
>> +        case DF_DOUBLE:
>> +            gen_ilvev_d(env, wd, ws, wt);
>> +            break;
>> +        default:
>> +            assert(0);
>> +        }
>>           break;
>>       case OPC_BINSR_df:
>>           gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>> --
>> 2.7.4
>>
>>
> Thanks,
> Aleksandar
Thanks,
Mateja

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
@ 2019-04-15 13:48       ` Mateja Marjanovic
  0 siblings, 0 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-15 13:48 UTC (permalink / raw)
  To: Aleksandar Markovic
  Cc: Aleksandar Rikalo, Richard Henderson, QEMU Developers,
	Aleksandar Markovic, Philippe Mathieu-Daudé,
	Aurelien Jarno


On 13.4.19. 18:05, Aleksandar Markovic wrote:
> On Thu, Apr 4, 2019 at 3:18 PM Mateja Marjanovic
> <mateja.marjanovic@rt-rk.com> wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>
>> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
>> directly tcg registers and performing logic on them
>> instead of using helpers.
>>
>> In the following table, the first column is the performance
>> before this patch. The second represents the performance,
>> after converting from helpers to tcg, but without using
>> tcg_gen_deposit function. The third one is the solution
>> which is implemented in this patch.
>>
>> Performance measurement is done by executing the
>> instructions a large number of times on a computer
> What is the exact number of times?
I will add that from now on.
>
>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>
>> ============================================================
>> || instr    ||   before    || no-deposit ||  with-deposit ||
>> ============================================================
>> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
>> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
>> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
>> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
>> ============================================================
>>
> "With-deposit" for ilvev.w can't be the same as in the previous
> version (22.17 ms), since you eliminated one tcg_gen_andi_i64() in
> this version compared to the previous one. "With-deposit" for ilvev.wb
> and ilvev.h also can't be the same. It looks you just copy-pasted the
> numbers. Please retest the performance and attach the accurate
> numbers.
My mistake, I will add it in v7.
>
> Also, there should be five columns and their meanings should be:
>
>    - instruction
>    - before
>    - no-deposit-no-mask-as-tcg-constant
>    - with-deposit-no-mask-as-tcg-constant
>    - with-deposit-with-mask-as-tcg-constant (final)
Alright, but the deposit function and mask as a tcg constant
are optimizations for two different problems. The deposit
function is used only in case of word, and mask as a tcg
constant in halfword and byte.
>> No-deposit column and with-deposit column have the
>> same statistical values in every row, except ILVEV.W,
>> which is the only function which uses the deposit
>> function.
>>
>> No-deposit version of the ILVEV.W implementation:
>>
>> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>                                 uint32_t ws, uint32_t wt)
>> {
>>      TCGv_i64 t1 = tcg_temp_new_i64();
>>      TCGv_i64 t2 = tcg_temp_new_i64();
>>      uint64_t mask = 0x00000000ffffffffULL;
>>
>>      tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>      tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>      tcg_gen_shli_i64(t2, t2, 32);
>>      tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>
>>      tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>      tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>      tcg_gen_shli_i64(t2, t2, 32);
>>      tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>
>>      tcg_temp_free_i64(t1);
>>      tcg_temp_free_i64(t2);
>> }
>>
>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> You forgot Philippe.
I will add Philippe, ofcourse.
>
>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>> ---
>>   target/mips/helper.h     |   1 -
>>   target/mips/msa_helper.c |   9 -----
>>   target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 100 insertions(+), 11 deletions(-)
>>
>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>> index 02e16c7..82f6a40 100644
>> --- a/target/mips/helper.h
>> +++ b/target/mips/helper.h
>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>> index a7ea6aa..d5c3842 100644
>> --- a/target/mips/msa_helper.c
>> +++ b/target/mips/msa_helper.c
>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>       } while (0)
>>   MSA_FN_DF(ilvr_df)
>>   #undef MSA_DO
>> -
>> -#define MSA_DO(DF)                      \
>> -    do {                                \
>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>> -    } while (0)
>> -MSA_FN_DF(ilvev_df)
>> -#undef MSA_DO
>> -
>>   #undef MSA_LOOP_COND
>>
>>   #define MSA_LOOP_COND(DF) \
>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>> index df685e4..3057669 100644
>> --- a/target/mips/translate.c
>> +++ b/target/mips/translate.c
>> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>       tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>   }
>>
>> +/*
>> + * [MSA] ILVEV.B wd, ws, wt
>> + *
>> + *   Vector Interleave Even (byte data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.H wd, ws, wt
>> + *
>> + *   Vector Interleave Even (halfword data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
> Please apply Philippe's refactoring for the preceding two functions.
I will in v7.
>
>> +
>> +/*
>> + * [MSA] ILVEV.W wd, ws, wt
>> + *
>> + *   Vector Interleave Even (word data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
>> +                        msa_wr_d[ws * 2], 32, 32);
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
>> +                        msa_wr_d[ws * 2 + 1], 32, 32);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.D wd, ws, wt
>> + *
>> + *   Vector Interleave Even (Doubleword data elements)
> Doubleword -> doubleword
It will be changed in v7.
>
>> + *
>> + */
>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>> +}
>> +
>>   static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>   {
>>   #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>           gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>           break;
>>       case OPC_ILVEV_df:
>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>> +        switch (df) {
>> +        case DF_BYTE:
>> +            gen_ilvev_b(env, wd, ws, wt);
>> +            break;
>> +        case DF_HALF:
>> +            gen_ilvev_h(env, wd, ws, wt);
>> +            break;
>> +        case DF_WORD:
>> +            gen_ilvev_w(env, wd, ws, wt);
>> +            break;
>> +        case DF_DOUBLE:
>> +            gen_ilvev_d(env, wd, ws, wt);
>> +            break;
>> +        default:
>> +            assert(0);
>> +        }
>>           break;
>>       case OPC_BINSR_df:
>>           gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>> --
>> 2.7.4
>>
>>
> Thanks,
> Aleksandar
Thanks,
Mateja


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> MSA instructions
@ 2019-04-16 21:20         ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-16 21:20 UTC (permalink / raw)
  To: Mateja Marjanovic, Aleksandar Markovic
  Cc: QEMU Developers, Aleksandar Rikalo, Richard Henderson,
	Philippe Mathieu-Daudé,
	Aurelien Jarno

> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> >>
> >> +void helper_msa_ilvr_b(CPUMIPSState *env, uint32_t wd,
> >> +                       uint32_t ws, uint32_t wt)
> >> +{
> >> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
> >> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
> >> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
> >> +
> > Why do we use here env->active_fpu.fpr[wd].wr, while for other instructions in
> > this patch, we access msa_wr_d<b|h|w|d[] directly?
> With a pointer to wr_t we have an array of bytes, halfwords, words or
> doublewords
> and can read from them and change them like an ordinary array. In other
> cases
> we use a variable that is TCGv_i64 and would have to use tcg_gen
> functions to
> modify the value of the register. Before my changes in ilvr instruction
> helpers
> env->active_fpu.fpr[wd].wr was used, so I just copy-pasted that.
>

Your answer touches just surface, and doesn't fully answer my question.
I would like you to show deeper understanding of the code you are working
with. You can't just copy/paste without thinking.

Why do majority of MSA helpers use env->active_fpu.fpr[<index>].wr, while
your code mostly reference the MSA register directly? Is this the same
thing? If yes, why all MSA code doesn't use registers directly, which
would certainly be simpler than referencing active_fpu? What is the role
of "active_fpu"? Can it be changed? Can you analyze the underlying
reasons for referencing "active_fpu", and can you claim that it is safe
to circumvent it and reference the MSA registers directly?

Thanks,
Aleksandar

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> MSA instructions
@ 2019-04-16 21:20         ` Aleksandar Markovic
  0 siblings, 0 replies; 29+ messages in thread
From: Aleksandar Markovic @ 2019-04-16 21:20 UTC (permalink / raw)
  To: Mateja Marjanovic, Aleksandar Markovic
  Cc: Aleksandar Rikalo, Philippe Mathieu-Daudé,
	Richard Henderson, QEMU Developers, Aurelien Jarno

> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> >>
> >> +void helper_msa_ilvr_b(CPUMIPSState *env, uint32_t wd,
> >> +                       uint32_t ws, uint32_t wt)
> >> +{
> >> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
> >> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
> >> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
> >> +
> > Why do we use here env->active_fpu.fpr[wd].wr, while for other instructions in
> > this patch, we access msa_wr_d<b|h|w|d[] directly?
> With a pointer to wr_t we have an array of bytes, halfwords, words or
> doublewords
> and can read from them and change them like an ordinary array. In other
> cases
> we use a variable that is TCGv_i64 and would have to use tcg_gen
> functions to
> modify the value of the register. Before my changes in ilvr instruction
> helpers
> env->active_fpu.fpr[wd].wr was used, so I just copy-pasted that.
>

Your answer touches just surface, and doesn't fully answer my question.
I would like you to show deeper understanding of the code you are working
with. You can't just copy/paste without thinking.

Why do majority of MSA helpers use env->active_fpu.fpr[<index>].wr, while
your code mostly reference the MSA register directly? Is this the same
thing? If yes, why all MSA code doesn't use registers directly, which
would certainly be simpler than referencing active_fpu? What is the role
of "active_fpu"? Can it be changed? Can you analyze the underlying
reasons for referencing "active_fpu", and can you claim that it is safe
to circumvent it and reference the MSA registers directly?

Thanks,
Aleksandar


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> MSA instructions
@ 2019-04-17  8:16           ` Mateja Marjanovic
  0 siblings, 0 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-17  8:16 UTC (permalink / raw)
  To: Aleksandar Markovic, Aleksandar Markovic
  Cc: QEMU Developers, Aleksandar Rikalo, Richard Henderson,
	Philippe Mathieu-Daudé,
	Aurelien Jarno


On 16.4.19. 23:20, Aleksandar Markovic wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>>> +void helper_msa_ilvr_b(CPUMIPSState *env, uint32_t wd,
>>>> +                       uint32_t ws, uint32_t wt)
>>>> +{
>>>> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
>>>> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
>>>> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
>>>> +
>>> Why do we use here env->active_fpu.fpr[wd].wr, while for other instructions in
>>> this patch, we access msa_wr_d<b|h|w|d[] directly?
>> With a pointer to wr_t we have an array of bytes, halfwords, words or
>> doublewords
>> and can read from them and change them like an ordinary array. In other
>> cases
>> we use a variable that is TCGv_i64 and would have to use tcg_gen
>> functions to
>> modify the value of the register. Before my changes in ilvr instruction
>> helpers
>> env->active_fpu.fpr[wd].wr was used, so I just copy-pasted that.
>>
> Your answer touches just surface, and doesn't fully answer my question.
> I would like you to show deeper understanding of the code you are working
> with. You can't just copy/paste without thinking.
>
> Why do majority of MSA helpers use env->active_fpu.fpr[<index>].wr, while
> your code mostly reference the MSA register directly? Is this the same
> thing? If yes, why all MSA code doesn't use registers directly, which
> would certainly be simpler than referencing active_fpu? What is the role
> of "active_fpu"? Can it be changed? Can you analyze the underlying
> reasons for referencing "active_fpu", and can you claim that it is safe
> to circumvent it and reference the MSA registers directly?
I will look into that, and try to analyze it and understand it.
Thanks,
Mateja
> Thanks,
> Aleksandar

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> MSA instructions
@ 2019-04-17  8:16           ` Mateja Marjanovic
  0 siblings, 0 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-17  8:16 UTC (permalink / raw)
  To: Aleksandar Markovic, Aleksandar Markovic
  Cc: Aleksandar Rikalo, Philippe Mathieu-Daudé,
	Richard Henderson, QEMU Developers, Aurelien Jarno


On 16.4.19. 23:20, Aleksandar Markovic wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>>> +void helper_msa_ilvr_b(CPUMIPSState *env, uint32_t wd,
>>>> +                       uint32_t ws, uint32_t wt)
>>>> +{
>>>> +    wr_t *pwd = &(env->active_fpu.fpr[wd].wr);
>>>> +    wr_t *pws = &(env->active_fpu.fpr[ws].wr);
>>>> +    wr_t *pwt = &(env->active_fpu.fpr[wt].wr);
>>>> +
>>> Why do we use here env->active_fpu.fpr[wd].wr, while for other instructions in
>>> this patch, we access msa_wr_d<b|h|w|d[] directly?
>> With a pointer to wr_t we have an array of bytes, halfwords, words or
>> doublewords
>> and can read from them and change them like an ordinary array. In other
>> cases
>> we use a variable that is TCGv_i64 and would have to use tcg_gen
>> functions to
>> modify the value of the register. Before my changes in ilvr instruction
>> helpers
>> env->active_fpu.fpr[wd].wr was used, so I just copy-pasted that.
>>
> Your answer touches just surface, and doesn't fully answer my question.
> I would like you to show deeper understanding of the code you are working
> with. You can't just copy/paste without thinking.
>
> Why do majority of MSA helpers use env->active_fpu.fpr[<index>].wr, while
> your code mostly reference the MSA register directly? Is this the same
> thing? If yes, why all MSA code doesn't use registers directly, which
> would certainly be simpler than referencing active_fpu? What is the role
> of "active_fpu"? Can it be changed? Can you analyze the underlying
> reasons for referencing "active_fpu", and can you claim that it is safe
> to circumvent it and reference the MSA registers directly?
I will look into that, and try to analyze it and understand it.
Thanks,
Mateja
> Thanks,
> Aleksandar


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
@ 2019-04-17 12:45       ` Mateja Marjanovic
  0 siblings, 0 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-17 12:45 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: aurelien, richard.henderson, amarkovic, arikalo

Hello Philippe,
Sorry for replying you so late.

On 4.4.19. 15:42, Philippe Mathieu-Daudé wrote:
> Hi Mateja,
>
> On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>
>> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
>> directly tcg registers and performing logic on them
>> instead of using helpers.
>>
>> In the following table, the first column is the performance
>> before this patch. The second represents the performance,
>> after converting from helpers to tcg, but without using
>> tcg_gen_deposit function. The third one is the solution
>> which is implemented in this patch.
>>
>> Performance measurement is done by executing the
>> instructions a large number of times on a computer
>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>
>> ============================================================
>> || instr    ||   before    || no-deposit ||  with-deposit ||
>> ============================================================
>> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
>> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> I'm quite surprised there is not a single change here since your v5, are
> you sure you used the correct result? I was expecting a slighly improvement.
There is a slight improvement when using tcg constants instead of
int constants variables, but I didn't change the performance table, by
mistake. In v7, it will be more clear.
>
>> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
>> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
>> ============================================================
>>
>> No-deposit column and with-deposit column have the
>> same statistical values in every row, except ILVEV.W,
>> which is the only function which uses the deposit
>> function.
>>
>> No-deposit version of the ILVEV.W implementation:
>>
>> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>                                 uint32_t ws, uint32_t wt)
>> {
>>      TCGv_i64 t1 = tcg_temp_new_i64();
>>      TCGv_i64 t2 = tcg_temp_new_i64();
>>      uint64_t mask = 0x00000000ffffffffULL;
>>
>>      tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>      tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>      tcg_gen_shli_i64(t2, t2, 32);
>>      tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>
>>      tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>      tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>      tcg_gen_shli_i64(t2, t2, 32);
>>      tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>
>>      tcg_temp_free_i64(t1);
>>      tcg_temp_free_i64(t2);
>> }
>>
>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>> ---
>>   target/mips/helper.h     |   1 -
>>   target/mips/msa_helper.c |   9 -----
>>   target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 100 insertions(+), 11 deletions(-)
>>
>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>> index 02e16c7..82f6a40 100644
>> --- a/target/mips/helper.h
>> +++ b/target/mips/helper.h
>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>> index a7ea6aa..d5c3842 100644
>> --- a/target/mips/msa_helper.c
>> +++ b/target/mips/msa_helper.c
>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>       } while (0)
>>   MSA_FN_DF(ilvr_df)
>>   #undef MSA_DO
>> -
>> -#define MSA_DO(DF)                      \
>> -    do {                                \
>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>> -    } while (0)
>> -MSA_FN_DF(ilvev_df)
>> -#undef MSA_DO
>> -
>>   #undef MSA_LOOP_COND
>>   
>>   #define MSA_LOOP_COND(DF) \
>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>> index df685e4..3057669 100644
>> --- a/target/mips/translate.c
>> +++ b/target/mips/translate.c
>> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>       tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>   }
>>   
>> +/*
>> + * [MSA] ILVEV.B wd, ws, wt
>> + *
>> + *   Vector Interleave Even (byte data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.H wd, ws, wt
>> + *
>> + *   Vector Interleave Even (halfword data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
> Apparently you missed my comment about refactoring using mask/shift as
> arguments:
>
> static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
>                                  uint32_t ws, uint32_t wt,
>                                  int64_t mask, int64_t shift)
> {
>      TCGv_i64 t1 = tcg_temp_new_i64();
>      TCGv_i64 t2 = tcg_temp_new_i64();
>      TCGv_i64 tm = tcg_const_i64(mask);
>
>      tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
>      tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
>      tcg_gen_shli_i64(t2, t2, shift);
>      tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>
>      tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
>      tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
>      tcg_gen_shli_i64(t2, t2, shift);
>      tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>
>      tcg_temp_free_i64(tm);
>      tcg_temp_free_i64(t1);
>      tcg_temp_free_i64(t2);
> }
>
> static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>                                 uint32_t ws, uint32_t wt)
> {
>      gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
> }
>
> static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>                                 uint32_t ws, uint32_t wt)
> {
>      gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
> }
Yes, I did miss it. I will do what you suggested in v7.
It represents a nice code reorganisation.
>
>
>> +
>> +/*
>> + * [MSA] ILVEV.W wd, ws, wt
>> + *
>> + *   Vector Interleave Even (word data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
>> +                        msa_wr_d[ws * 2], 32, 32);
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
>> +                        msa_wr_d[ws * 2 + 1], 32, 32);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.D wd, ws, wt
>> + *
>> + *   Vector Interleave Even (Doubleword data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>> +}
>> +
>>   static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>   {
>>   #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>           gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>           break;
>>       case OPC_ILVEV_df:
>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>> +        switch (df) {
>> +        case DF_BYTE:
>> +            gen_ilvev_b(env, wd, ws, wt);
>> +            break;
>> +        case DF_HALF:
>> +            gen_ilvev_h(env, wd, ws, wt);
>> +            break;
>> +        case DF_WORD:
>> +            gen_ilvev_w(env, wd, ws, wt);
>> +            break;
>> +        case DF_DOUBLE:
>> +            gen_ilvev_d(env, wd, ws, wt);
>> +            break;
>> +        default:
>> +            assert(0);
>> +        }
>>           break;
>>       case OPC_BINSR_df:
>>           gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
Thank you for looking so deeply into my code.
Regards,
Mateja

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
@ 2019-04-17 12:45       ` Mateja Marjanovic
  0 siblings, 0 replies; 29+ messages in thread
From: Mateja Marjanovic @ 2019-04-17 12:45 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: arikalo, richard.henderson, amarkovic, aurelien

Hello Philippe,
Sorry for replying you so late.

On 4.4.19. 15:42, Philippe Mathieu-Daudé wrote:
> Hi Mateja,
>
> On 4/4/19 3:14 PM, Mateja Marjanovic wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>
>> Optimize set of MSA instructions ILVEV.<B|H|W|D>, using
>> directly tcg registers and performing logic on them
>> instead of using helpers.
>>
>> In the following table, the first column is the performance
>> before this patch. The second represents the performance,
>> after converting from helpers to tcg, but without using
>> tcg_gen_deposit function. The third one is the solution
>> which is implemented in this patch.
>>
>> Performance measurement is done by executing the
>> instructions a large number of times on a computer
>> with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>
>> ============================================================
>> || instr    ||   before    || no-deposit ||  with-deposit ||
>> ============================================================
>> || ilvev.b  ||  126.92 ms  ||  24.52 ms  ||   24.43 ms    ||
>> || ilvev.h  ||   93.67 ms  ||  23.92 ms  ||   23.86 ms    ||
> I'm quite surprised there is not a single change here since your v5, are
> you sure you used the correct result? I was expecting a slighly improvement.
There is a slight improvement when using tcg constants instead of
int constants variables, but I didn't change the performance table, by
mistake. In v7, it will be more clear.
>
>> || ilvev.w  ||  117.86 ms  ||  23.83 ms  ||   22.17 ms    ||
>> || ilvev.d  ||   45.49 ms  ||  19.74 ms  ||   19.71 ms    ||
>> ============================================================
>>
>> No-deposit column and with-deposit column have the
>> same statistical values in every row, except ILVEV.W,
>> which is the only function which uses the deposit
>> function.
>>
>> No-deposit version of the ILVEV.W implementation:
>>
>> static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>                                 uint32_t ws, uint32_t wt)
>> {
>>      TCGv_i64 t1 = tcg_temp_new_i64();
>>      TCGv_i64 t2 = tcg_temp_new_i64();
>>      uint64_t mask = 0x00000000ffffffffULL;
>>
>>      tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>      tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>      tcg_gen_shli_i64(t2, t2, 32);
>>      tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>
>>      tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>      tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>      tcg_gen_shli_i64(t2, t2, 32);
>>      tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>
>>      tcg_temp_free_i64(t1);
>>      tcg_temp_free_i64(t2);
>> }
>>
>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>> ---
>>   target/mips/helper.h     |   1 -
>>   target/mips/msa_helper.c |   9 -----
>>   target/mips/translate.c  | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 100 insertions(+), 11 deletions(-)
>>
>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>> index 02e16c7..82f6a40 100644
>> --- a/target/mips/helper.h
>> +++ b/target/mips/helper.h
>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>   DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>> index a7ea6aa..d5c3842 100644
>> --- a/target/mips/msa_helper.c
>> +++ b/target/mips/msa_helper.c
>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>       } while (0)
>>   MSA_FN_DF(ilvr_df)
>>   #undef MSA_DO
>> -
>> -#define MSA_DO(DF)                      \
>> -    do {                                \
>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>> -    } while (0)
>> -MSA_FN_DF(ilvev_df)
>> -#undef MSA_DO
>> -
>>   #undef MSA_LOOP_COND
>>   
>>   #define MSA_LOOP_COND(DF) \
>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>> index df685e4..3057669 100644
>> --- a/target/mips/translate.c
>> +++ b/target/mips/translate.c
>> @@ -28973,6 +28973,90 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>       tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>   }
>>   
>> +/*
>> + * [MSA] ILVEV.B wd, ws, wt
>> + *
>> + *   Vector Interleave Even (byte data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.H wd, ws, wt
>> + *
>> + *   Vector Interleave Even (halfword data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    TCGv_i64 mask = tcg_const_i64(0x0000ffff0000ffffULL);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(mask);
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
>> +}
> Apparently you missed my comment about refactoring using mask/shift as
> arguments:
>
> static inline void gen_ilvev_hb(CPUMIPSState *env, uint32_t wd,
>                                  uint32_t ws, uint32_t wt,
>                                  int64_t mask, int64_t shift)
> {
>      TCGv_i64 t1 = tcg_temp_new_i64();
>      TCGv_i64 t2 = tcg_temp_new_i64();
>      TCGv_i64 tm = tcg_const_i64(mask);
>
>      tcg_gen_and_i64(t1, msa_wr_d[wt * 2], tm);
>      tcg_gen_and_i64(t2, msa_wr_d[ws * 2], tm);
>      tcg_gen_shli_i64(t2, t2, shift);
>      tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>
>      tcg_gen_and_i64(t1, msa_wr_d[wt * 2 + 1], tm);
>      tcg_gen_and_i64(t2, msa_wr_d[ws * 2 + 1], tm);
>      tcg_gen_shli_i64(t2, t2, shift);
>      tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>
>      tcg_temp_free_i64(tm);
>      tcg_temp_free_i64(t1);
>      tcg_temp_free_i64(t2);
> }
>
> static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>                                 uint32_t ws, uint32_t wt)
> {
>      gen_ilvev_hb(env, wd, ws, wt, 0x00ff00ff00ff00ffLL, 8);
> }
>
> static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>                                 uint32_t ws, uint32_t wt)
> {
>      gen_ilvev_hb(env, wd, ws, wt, 0x0000ffff0000ffffLL, 16);
> }
Yes, I did miss it. I will do what you suggested in v7.
It represents a nice code reorganisation.
>
>
>> +
>> +/*
>> + * [MSA] ILVEV.W wd, ws, wt
>> + *
>> + *   Vector Interleave Even (word data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
>> +                        msa_wr_d[ws * 2], 32, 32);
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[wt * 2 + 1],
>> +                        msa_wr_d[ws * 2 + 1], 32, 32);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.D wd, ws, wt
>> + *
>> + *   Vector Interleave Even (Doubleword data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>> +}
>> +
>>   static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>   {
>>   #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>> @@ -29129,7 +29213,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>           gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>           break;
>>       case OPC_ILVEV_df:
>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>> +        switch (df) {
>> +        case DF_BYTE:
>> +            gen_ilvev_b(env, wd, ws, wt);
>> +            break;
>> +        case DF_HALF:
>> +            gen_ilvev_h(env, wd, ws, wt);
>> +            break;
>> +        case DF_WORD:
>> +            gen_ilvev_w(env, wd, ws, wt);
>> +            break;
>> +        case DF_DOUBLE:
>> +            gen_ilvev_d(env, wd, ws, wt);
>> +            break;
>> +        default:
>> +            assert(0);
>> +        }
>>           break;
>>       case OPC_BINSR_df:
>>           gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
Thank you for looking so deeply into my code.
Regards,
Mateja


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-04-17 12:47 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-04 13:14 [Qemu-devel] [PATCH v6 0/4] target/mips: Optimize MSA interleave instructions Mateja Marjanovic
2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 1/4] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions Mateja Marjanovic
2019-04-04 13:47   ` Philippe Mathieu-Daudé
2019-04-13 16:09   ` Aleksandar Markovic
2019-04-13 16:09     ` Aleksandar Markovic
2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 2/4] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
2019-04-04 13:42   ` Philippe Mathieu-Daudé
2019-04-04 18:19     ` Aleksandar Markovic
2019-04-04 19:17       ` Philippe Mathieu-Daudé
2019-04-05  0:26         ` Aleksandar Markovic
2019-04-05  0:26           ` Aleksandar Markovic
2019-04-17 12:45     ` Mateja Marjanovic
2019-04-17 12:45       ` Mateja Marjanovic
2019-04-13 16:05   ` Aleksandar Markovic
2019-04-13 16:05     ` Aleksandar Markovic
2019-04-15 13:48     ` Mateja Marjanovic
2019-04-15 13:48       ` Mateja Marjanovic
2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 3/4] target/mips: Optimize ILVL.<B|H|W|D> " Mateja Marjanovic
2019-04-13 16:15   ` Aleksandar Markovic
2019-04-13 16:15     ` Aleksandar Markovic
2019-04-04 13:14 ` [Qemu-devel] [PATCH v6 4/4] target/mips: Optimize ILVR.<B|H|W|D> " Mateja Marjanovic
2019-04-13 16:05   ` Aleksandar Markovic
2019-04-13 16:05     ` Aleksandar Markovic
2019-04-15 11:24     ` Mateja Marjanovic
2019-04-15 11:24       ` Mateja Marjanovic
2019-04-16 21:20       ` Aleksandar Markovic
2019-04-16 21:20         ` Aleksandar Markovic
2019-04-17  8:16         ` Mateja Marjanovic
2019-04-17  8:16           ` Mateja Marjanovic

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.