[Qemu-devel] [PATCH v5 0/2] target/mips: Optimize MSA <ILVEV|ILVOD>.<B|H|W|D> instructions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v5 0/2] target/mips: Optimize MSA <ILVEV|ILVOD>.<B|H|W|D> instructions
@ 2019-04-02 15:15 Mateja Marjanovic
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 1/2] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions Mateja Marjanovic
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
  0 siblings, 2 replies; 19+ messages in thread
From: Mateja Marjanovic @ 2019-04-02 15:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: aurelien, richard.henderson, amarkovic, arikalo

From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>

Optimize MSA instructions ILVEV.<B|H|W|D> and ILVOD.<B|H|W|D>,
using directly tcg registers and performing logic on
them insted of using helpers.

v5:
 - Use tcg_gen_deposit function.
 - Added performance number for no-deposit and
   with-deposit cases of ILVEV.W.
 - Minor changes in commit messages and cover letter.

v4:
 - Clean up typing errors.
 - Change the commit message and the cover letter.
 - Fix bug for ILVEV.D, in case where the destination
   and one of the sources are the same register.

v3:
 - Reduce the number of logic operations to a
   minimum.
 - Add comments.

v2:
 - Minor changes in commit messages and cover letter.

Mateja Marjanovic (2):
  target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions
  target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions

 target/mips/helper.h     |   2 -
 target/mips/msa_helper.c |  16 ----
 target/mips/translate.c  | 212 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 210 insertions(+), 20 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PATCH v5 1/2] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions
  2019-04-02 15:15 [Qemu-devel] [PATCH v5 0/2] target/mips: Optimize MSA <ILVEV|ILVOD>.<B|H|W|D> instructions Mateja Marjanovic
@ 2019-04-02 15:15 ` Mateja Marjanovic
  2019-04-03  7:46   ` Richard Henderson
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
  1 sibling, 1 reply; 19+ messages in thread
From: Mateja Marjanovic @ 2019-04-02 15:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: aurelien, richard.henderson, amarkovic, arikalo

From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>

Optimize set of MSA instructions ILVOD, using directly
tcg registers and performing logic on them instead of
using helpers.

 instr    ||   before    ||   after
======================================
 ilvod.b  ||  117.50 ms  ||  24.99 ms
 ilvod.h  ||   93.16 ms  ||  24.27 ms
 ilvod.w  ||  119.90 ms  ||  24.81 ms
 ilvod.d  ||   43.01 ms  ||  20.69 ms

Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
---
 target/mips/helper.h     |   1 -
 target/mips/msa_helper.c |   7 ----
 target/mips/translate.c  | 107 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 106 insertions(+), 9 deletions(-)

diff --git a/target/mips/helper.h b/target/mips/helper.h
index 2863f60..02e16c7 100644
--- a/target/mips/helper.h
+++ b/target/mips/helper.h
@@ -865,7 +865,6 @@ DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
-DEF_HELPER_5(msa_ilvod_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
index 6c57281..a7ea6aa 100644
--- a/target/mips/msa_helper.c
+++ b/target/mips/msa_helper.c
@@ -1206,13 +1206,6 @@ MSA_FN_DF(ilvr_df)
 MSA_FN_DF(ilvev_df)
 #undef MSA_DO
 
-#define MSA_DO(DF)                          \
-    do {                                    \
-        pwx->DF[2*i]   = pwt->DF[2*i+1];    \
-        pwx->DF[2*i+1] = pws->DF[2*i+1];    \
-    } while (0)
-MSA_FN_DF(ilvod_df)
-#undef MSA_DO
 #undef MSA_LOOP_COND
 
 #define MSA_LOOP_COND(DF) \
diff --git a/target/mips/translate.c b/target/mips/translate.c
index bba8b6c..04406d6 100644
--- a/target/mips/translate.c
+++ b/target/mips/translate.c
@@ -28884,6 +28884,96 @@ static void gen_msa_bit(CPUMIPSState *env, DisasContext *ctx)
     tcg_temp_free_i32(tws);
 }
 
+/*
+ * [MSA] ILVOD.B wd, ws, wt
+ *
+ *   Vector Interleave Odd (byte data elements)
+ *
+ */
+static inline void gen_ilvod_b(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    const uint64_t mask = 0xff00ff00ff00ff00ULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_shri_i64(t1, t1, 8);
+    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 8);
+    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVOD.H wd, ws, wt
+ *
+ *   Vector Interleave Odd (halfword data elements)
+ *
+ */
+static inline void gen_ilvod_h(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    const uint64_t mask = 0xffff0000ffff0000ULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 16);
+    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVOD.W wd, ws, wt
+ *
+ *   Vector Interleave Odd (word data elements)
+ *
+ */
+static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    const uint64_t mask = 0xffffffff00000000ULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_shri_i64(t1, t1, 32);
+    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[ws * 2], t1, 0, 32);
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_shri_i64(t1, t1, 32);
+    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1], t1, 0, 32);
+
+    tcg_temp_free_i64(t1);
+}
+
+/*
+ * [MSA] ILVOD.D wd, ws, wt
+ *
+ *   Vector Interleave Odd (double data elements)
+ *
+ */
+static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2 + 1]);
+    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
+}
+
 static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
 {
 #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
@@ -29055,7 +29145,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
         gen_helper_msa_mod_u_df(cpu_env, tdf, twd, tws, twt);
         break;
     case OPC_ILVOD_df:
-        gen_helper_msa_ilvod_df(cpu_env, tdf, twd, tws, twt);
+        switch (df) {
+        case DF_BYTE:
+            gen_ilvod_b(env, wd, ws, wt);
+            break;
+        case DF_HALF:
+            gen_ilvod_h(env, wd, ws, wt);
+            break;
+        case DF_WORD:
+            gen_ilvod_w(env, wd, ws, wt);
+            break;
+        case DF_DOUBLE:
+            gen_ilvod_d(env, wd, ws, wt);
+            break;
+        default:
+            assert(0);
+        }
         break;
 
     case OPC_DOTP_S_df:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 15:15 [Qemu-devel] [PATCH v5 0/2] target/mips: Optimize MSA <ILVEV|ILVOD>.<B|H|W|D> instructions Mateja Marjanovic
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 1/2] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions Mateja Marjanovic
@ 2019-04-02 15:15 ` Mateja Marjanovic
  2019-04-02 16:19   ` Philippe Mathieu-Daudé
                     ` (4 more replies)
  1 sibling, 5 replies; 19+ messages in thread
From: Mateja Marjanovic @ 2019-04-02 15:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: aurelien, richard.henderson, amarkovic, arikalo

From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>

Optimize set of MSA instructions ILVEV, using directly
tcg registers and performing logic on them instead of
using helpers.

In the following table, the first column is the performance
before this patch. The second represents the performance,
after converting from helpers to tcg, but without using
tcg_gen_deposit function. The third one is the solution
which is implemented in this patch.

 instr    ||   before    || no-deposit ||  with-deposit
========================================================
 ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
 ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
 ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
 ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms

The solution with deposit is suggested by Richard Henderson.

Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
---
 target/mips/helper.h     |   1 -
 target/mips/msa_helper.c |   9 ----
 target/mips/translate.c  | 105 ++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 104 insertions(+), 11 deletions(-)

diff --git a/target/mips/helper.h b/target/mips/helper.h
index 02e16c7..82f6a40 100644
--- a/target/mips/helper.h
+++ b/target/mips/helper.h
@@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
-DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
 DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
index a7ea6aa..d5c3842 100644
--- a/target/mips/msa_helper.c
+++ b/target/mips/msa_helper.c
@@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
     } while (0)
 MSA_FN_DF(ilvr_df)
 #undef MSA_DO
-
-#define MSA_DO(DF)                      \
-    do {                                \
-        pwx->DF[2*i]   = pwt->DF[2*i];  \
-        pwx->DF[2*i+1] = pws->DF[2*i];  \
-    } while (0)
-MSA_FN_DF(ilvev_df)
-#undef MSA_DO
-
 #undef MSA_LOOP_COND
 
 #define MSA_LOOP_COND(DF) \
diff --git a/target/mips/translate.c b/target/mips/translate.c
index 04406d6..e26c6a6 100644
--- a/target/mips/translate.c
+++ b/target/mips/translate.c
@@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
     tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
 }
 
+/*
+ * [MSA] ILVEV.B wd, ws, wt
+ *
+ *   Vector Interleave Even (byte data elements)
+ *
+ */
+static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    const uint64_t mask = 0x00ff00ff00ff00ffULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
+    tcg_gen_shli_i64(t2, t2, 8);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t2, t2, 8);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVEV.H wd, ws, wt
+ *
+ *   Vector Interleave Even (halfword data elements)
+ *
+ */
+static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 t2 = tcg_temp_new_i64();
+    const uint64_t mask = 0x0000ffff0000ffffULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
+    tcg_gen_shli_i64(t2, t2, 16);
+    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
+    tcg_gen_shli_i64(t2, t2, 16);
+    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
+
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
+}
+
+/*
+ * [MSA] ILVEV.W wd, ws, wt
+ *
+ *   Vector Interleave Even (word data elements)
+ *
+ */
+static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    const uint64_t mask = 0x00000000ffffffffULL;
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
+    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
+
+    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
+    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1], 32, 32);
+
+    tcg_temp_free_i64(t1);
+}
+
+/*
+ * [MSA] ILVEV.D wd, ws, wt
+ *
+ *   Vector Interleave Even (Double data elements)
+ *
+ */
+static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
+                               uint32_t ws, uint32_t wt)
+{
+    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
+    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
+}
+
 static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
 {
 #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
@@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
         gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
         break;
     case OPC_ILVEV_df:
-        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
+        switch (df) {
+        case DF_BYTE:
+            gen_ilvev_b(env, wd, ws, wt);
+            break;
+        case DF_HALF:
+            gen_ilvev_h(env, wd, ws, wt);
+            break;
+        case DF_WORD:
+            gen_ilvev_w(env, wd, ws, wt);
+            break;
+        case DF_DOUBLE:
+            gen_ilvev_d(env, wd, ws, wt);
+            break;
+        default:
+            assert(0);
+        }
         break;
     case OPC_BINSR_df:
         gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
@ 2019-04-02 16:19   ` Philippe Mathieu-Daudé
  2019-04-02 17:07     ` Aleksandar Markovic
  2019-04-03  7:49     ` Richard Henderson
  2019-04-02 18:51   ` Aleksandar Markovic
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-04-02 16:19 UTC (permalink / raw)
  To: Mateja Marjanovic, qemu-devel
  Cc: arikalo, richard.henderson, amarkovic, aurelien

Hi Mateja,

On 4/2/19 5:15 PM, Mateja Marjanovic wrote:
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> 
> Optimize set of MSA instructions ILVEV, using directly
> tcg registers and performing logic on them instead of
> using helpers.

Maybe you can still let this previous comment (if still valid):

  Performance measurement is done by executing the
  instructions large number of times on a computer
  with Intel Core i7-3770 CPU @ 3.40GHz×8.

> 
> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.

You are describing the "no-deposit" which refers to a previous series
but won't be accessible in the git repository.

I think this table is useful in the cover of this series, and in the
commit message you should use || before || after || and drop the
"no-deposit" case.

> 
>  instr    ||   before    || no-deposit ||  with-deposit
> ========================================================
>  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
>  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
>  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
>  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
> 
> The solution with deposit is suggested by Richard Henderson.
> 

The gitdm parsable form is:

Suggested-by: Richard Henderson <richard.henderson@linaro.org>

> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   9 ----
>  target/mips/translate.c  | 105 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 104 insertions(+), 11 deletions(-)
> 
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 02e16c7..82f6a40 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index a7ea6aa..d5c3842 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>      } while (0)
>  MSA_FN_DF(ilvr_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> -    } while (0)
> -MSA_FN_DF(ilvev_df)
> -#undef MSA_DO
> -
>  #undef MSA_LOOP_COND
>  
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index 04406d6..e26c6a6 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>  }
>  
> +/*
> + * [MSA] ILVEV.B wd, ws, wt
> + *
> + *   Vector Interleave Even (byte data elements)
> + *
> + */
> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +

Richard, is it cheaper to use another register to keep the constant mask
(here reused 4x)?

Such:

       TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);

       tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
       tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
       tcg_gen_shli_i64(t2, t2, 8);
       tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);

> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);

Here use tcg_gen_and_i64() too.

> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +

       tcg_temp_free_i64(mask);

> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);

Mateja: Can you test for perf easily?

> +}
> +
> +/*
> + * [MSA] ILVEV.H wd, ws, wt
> + *
> + *   Vector Interleave Even (halfword data elements)
> + *
> + */
> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x0000ffff0000ffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);

Very similiar to gen_ilvev_b(), changing the mask and shift. Maybe worth
a refactor.

> +}
> +
> +/*
> + * [MSA] ILVEV.W wd, ws, wt
> + *
> + *   Vector Interleave Even (word data elements)
> + *
> + */
> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x00000000ffffffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1], 32, 32);
> +
> +    tcg_temp_free_i64(t1);
> +}
> +
> +/*
> + * [MSA] ILVEV.D wd, ws, wt
> + *
> + *   Vector Interleave Even (Double data elements)
> + *
> + */
> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVEV_df:
> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvev_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvev_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvev_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvev_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSR_df:
>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> 

Thanks,

Phil.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 16:19   ` Philippe Mathieu-Daudé
@ 2019-04-02 17:07     ` Aleksandar Markovic
  2019-04-02 18:37       ` Philippe Mathieu-Daudé
  2019-04-03  7:49     ` Richard Henderson
  1 sibling, 1 reply; 19+ messages in thread
From: Aleksandar Markovic @ 2019-04-02 17:07 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Mateja Marjanovic, qemu-devel
  Cc: Aleksandar Rikalo, richard.henderson, aurelien

> From: Philippe Mathieu-Daudé <philmd@redhat.com>
> Subject: Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
>
> Hi Mateja,
>
> On 4/2/19 5:15 PM, Mateja Marjanovic wrote:
> > From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> >
> > Optimize set of MSA instructions ILVEV, using directly
> > tcg registers and performing logic on them instead of
> > using helpers.
>
> Maybe you can still let this previous comment (if still valid):
>
>   Performance measurement is done by executing the
>   instructions large number of times on a computer
>   with Intel Core i7-3770 CPU @ 3.40GHz×8.
>

Agreed.

> >
> > In the following table, the first column is the performance
> > before this patch. The second represents the performance,
> > after converting from helpers to tcg, but without using
> > tcg_gen_deposit function. The third one is the solution
> > which is implemented in this patch.
>
> You are describing the "no-deposit" which refers to a previous series
> but won't be accessible in the git repository.
>
> I think this table is useful in the cover of this series, and in the
> commit message you should use || before || after || and drop the
> "no-deposit" case.
>
> >
> >  instr    ||   before    || no-deposit ||  with-deposit
> > ========================================================
> >  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
> >  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
> >  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
> >  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
> >
> > The solution with deposit is suggested by Richard Henderson.
> >
>

I think the table should remain in the commit message, to keep it
visible in the git logs.

You could insert the "no-deposit" source code of gen_ilvev_w()
in the commit message, for reference reasons - it is not a too
large function.

Thanks,
Aleksandar

> The gitdm parsable form is:
>
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>
> > Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> > ---
> >  target/mips/helper.h     |   1 -
> >  target/mips/msa_helper.c |   9 ----
> >  target/mips/translate.c  | 105 ++++++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 104 insertions(+), 11 deletions(-)
> >
> > diff --git a/target/mips/helper.h b/target/mips/helper.h
> > index 02e16c7..82f6a40 100644
> > --- a/target/mips/helper.h
> > +++ b/target/mips/helper.h
> > @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> > -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> > diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> > index a7ea6aa..d5c3842 100644
> > --- a/target/mips/msa_helper.c
> > +++ b/target/mips/msa_helper.c
> > @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
> >      } while (0)
> >  MSA_FN_DF(ilvr_df)
> >  #undef MSA_DO
> > -
> > -#define MSA_DO(DF)                      \
> > -    do {                                \
> > -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> > -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> > -    } while (0)
> > -MSA_FN_DF(ilvev_df)
> > -#undef MSA_DO
> > -
> >  #undef MSA_LOOP_COND
> >
> >  #define MSA_LOOP_COND(DF) \
> > diff --git a/target/mips/translate.c b/target/mips/translate.c
> > index 04406d6..e26c6a6 100644
> > --- a/target/mips/translate.c
> > +++ b/target/mips/translate.c
> > @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
> >      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> >  }
> >
> > +/*
> > + * [MSA] ILVEV.B wd, ws, wt
> > + *
> > + *   Vector Interleave Even (byte data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    TCGv_i64 t1 = tcg_temp_new_i64();
> > +    TCGv_i64 t2 = tcg_temp_new_i64();
> > +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> > +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> > +    tcg_gen_shli_i64(t2, t2, 8);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> > +
>
> Richard, is it cheaper to use another register to keep the constant mask
> (here reused 4x)?
>
> Such:
>
>        TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>
>        tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>        tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>        tcg_gen_shli_i64(t2, t2, 8);
>        tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> > +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>
> Here use tcg_gen_and_i64() too.
>
> > +    tcg_gen_shli_i64(t2, t2, 8);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> > +
>
>        tcg_temp_free_i64(mask);
>
> > +    tcg_temp_free_i64(t1);
> > +    tcg_temp_free_i64(t2);
>
> Mateja: Can you test for perf easily?
>
> > +}
> > +
> > +/*
> > + * [MSA] ILVEV.H wd, ws, wt
> > + *
> > + *   Vector Interleave Even (halfword data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    TCGv_i64 t1 = tcg_temp_new_i64();
> > +    TCGv_i64 t2 = tcg_temp_new_i64();
> > +    const uint64_t mask = 0x0000ffff0000ffffULL;
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> > +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> > +    tcg_gen_shli_i64(t2, t2, 16);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> > +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> > +    tcg_gen_shli_i64(t2, t2, 16);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> > +
> > +    tcg_temp_free_i64(t1);
> > +    tcg_temp_free_i64(t2);
>
> Very similiar to gen_ilvev_b(), changing the mask and shift. Maybe worth
> a refactor.
>
> > +}
> > +
> > +/*
> > + * [MSA] ILVEV.W wd, ws, wt
> > + *
> > + *   Vector Interleave Even (word data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    TCGv_i64 t1 = tcg_temp_new_i64();
> > +    const uint64_t mask = 0x00000000ffffffffULL;
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> > +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> > +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1], 32, 32);
> > +
> > +    tcg_temp_free_i64(t1);
> > +}
> > +
> > +/*
> > + * [MSA] ILVEV.D wd, ws, wt
> > + *
> > + *   Vector Interleave Even (Double data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> > +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> > +}
> > +
> >  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
> >  {
> >  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> > @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
> >          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
> >          break;
> >      case OPC_ILVEV_df:
> > -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> > +        switch (df) {
> > +        case DF_BYTE:
> > +            gen_ilvev_b(env, wd, ws, wt);
> > +            break;
> > +        case DF_HALF:
> > +            gen_ilvev_h(env, wd, ws, wt);
> > +            break;
> > +        case DF_WORD:
> > +            gen_ilvev_w(env, wd, ws, wt);
> > +            break;
> > +        case DF_DOUBLE:
> > +            gen_ilvev_d(env, wd, ws, wt);
> > +            break;
> > +        default:
> > +            assert(0);
> > +        }
> >          break;
> >      case OPC_BINSR_df:
> >          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> >
>
> Thanks,
>
> Phil.
>

________________________________________
From: Philippe Mathieu-Daudé <philmd@redhat.com>
Sent: Tuesday, April 2, 2019 6:19:19 PM
To: Mateja Marjanovic; qemu-devel@nongnu.org
Cc: Aleksandar Rikalo; richard.henderson@linaro.org; Aleksandar Markovic; aurelien@aurel32.net
Subject: Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions

Hi Mateja,

On 4/2/19 5:15 PM, Mateja Marjanovic wrote:
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimize set of MSA instructions ILVEV, using directly
> tcg registers and performing logic on them instead of
> using helpers.

Maybe you can still let this previous comment (if still valid):

  Performance measurement is done by executing the
  instructions large number of times on a computer
  with Intel Core i7-3770 CPU @ 3.40GHz×8.

>
> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.

You are describing the "no-deposit" which refers to a previous series
but won't be accessible in the git repository.

I think this table is useful in the cover of this series, and in the
commit message you should use || before || after || and drop the
"no-deposit" case.

>
>  instr    ||   before    || no-deposit ||  with-deposit
> ========================================================
>  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
>  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
>  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
>  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
>
> The solution with deposit is suggested by Richard Henderson.
>

The gitdm parsable form is:

Suggested-by: Richard Henderson <richard.henderson@linaro.org>

> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   9 ----
>  target/mips/translate.c  | 105 ++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 104 insertions(+), 11 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 02e16c7..82f6a40 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index a7ea6aa..d5c3842 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>      } while (0)
>  MSA_FN_DF(ilvr_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> -    } while (0)
> -MSA_FN_DF(ilvev_df)
> -#undef MSA_DO
> -
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index 04406d6..e26c6a6 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>  }
>
> +/*
> + * [MSA] ILVEV.B wd, ws, wt
> + *
> + *   Vector Interleave Even (byte data elements)
> + *
> + */
> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +

Richard, is it cheaper to use another register to keep the constant mask
(here reused 4x)?

Such:

       TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);

       tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
       tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
       tcg_gen_shli_i64(t2, t2, 8);
       tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);

> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);

Here use tcg_gen_and_i64() too.

> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +

       tcg_temp_free_i64(mask);

> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);

Mateja: Can you test for perf easily?

> +}
> +
> +/*
> + * [MSA] ILVEV.H wd, ws, wt
> + *
> + *   Vector Interleave Even (halfword data elements)
> + *
> + */
> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x0000ffff0000ffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);

Very similiar to gen_ilvev_b(), changing the mask and shift. Maybe worth
a refactor.

> +}
> +
> +/*
> + * [MSA] ILVEV.W wd, ws, wt
> + *
> + *   Vector Interleave Even (word data elements)
> + *
> + */
> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x00000000ffffffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1], 32, 32);
> +
> +    tcg_temp_free_i64(t1);
> +}
> +
> +/*
> + * [MSA] ILVEV.D wd, ws, wt
> + *
> + *   Vector Interleave Even (Double data elements)
> + *
> + */
> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVEV_df:
> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvev_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvev_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvev_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvev_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSR_df:
>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>

Thanks,

Phil.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 17:07     ` Aleksandar Markovic
@ 2019-04-02 18:37       ` Philippe Mathieu-Daudé
  2019-04-03  8:32         ` Mateja Marjanovic
  0 siblings, 1 reply; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-04-02 18:37 UTC (permalink / raw)
  To: Aleksandar Markovic, Mateja Marjanovic, qemu-devel
  Cc: Aleksandar Rikalo, richard.henderson, aurelien

On 4/2/19 7:07 PM, Aleksandar Markovic wrote:
>> From: Philippe Mathieu-Daudé <philmd@redhat.com>
>> Subject: Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
>>
>> Hi Mateja,
>>
>> On 4/2/19 5:15 PM, Mateja Marjanovic wrote:
>>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>>
>>> Optimize set of MSA instructions ILVEV, using directly
>>> tcg registers and performing logic on them instead of
>>> using helpers.
>>
>> Maybe you can still let this previous comment (if still valid):
>>
>>   Performance measurement is done by executing the
>>   instructions large number of times on a computer
>>   with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>
> 
> Agreed.
> 
>>>
>>> In the following table, the first column is the performance
>>> before this patch. The second represents the performance,
>>> after converting from helpers to tcg, but without using
>>> tcg_gen_deposit function. The third one is the solution
>>> which is implemented in this patch.
>>
>> You are describing the "no-deposit" which refers to a previous series
>> but won't be accessible in the git repository.
>>
>> I think this table is useful in the cover of this series, and in the
>> commit message you should use || before || after || and drop the
>> "no-deposit" case.
>>
>>>
>>>  instr    ||   before    || no-deposit ||  with-deposit
>>> ========================================================
>>>  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
>>>  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
>>>  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
>>>  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
>>>
>>> The solution with deposit is suggested by Richard Henderson.
>>>
>>
> 
> I think the table should remain in the commit message, to keep it
> visible in the git logs.
> 
> You could insert the "no-deposit" source code of gen_ilvev_w()
> in the commit message, for reference reasons - it is not a too
> large function.

Clever :)

> 
> Thanks,
> Aleksandar
> 
>> The gitdm parsable form is:
>>
>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>>
>>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>>> ---
>>>  target/mips/helper.h     |   1 -
>>>  target/mips/msa_helper.c |   9 ----
>>>  target/mips/translate.c  | 105 ++++++++++++++++++++++++++++++++++++++++++++++-
>>>  3 files changed, 104 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>>> index 02e16c7..82f6a40 100644
>>> --- a/target/mips/helper.h
>>> +++ b/target/mips/helper.h
>>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>>> index a7ea6aa..d5c3842 100644
>>> --- a/target/mips/msa_helper.c
>>> +++ b/target/mips/msa_helper.c
>>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>>      } while (0)
>>>  MSA_FN_DF(ilvr_df)
>>>  #undef MSA_DO
>>> -
>>> -#define MSA_DO(DF)                      \
>>> -    do {                                \
>>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>>> -    } while (0)
>>> -MSA_FN_DF(ilvev_df)
>>> -#undef MSA_DO
>>> -
>>>  #undef MSA_LOOP_COND
>>>
>>>  #define MSA_LOOP_COND(DF) \
>>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>>> index 04406d6..e26c6a6 100644
>>> --- a/target/mips/translate.c
>>> +++ b/target/mips/translate.c
>>> @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>>  }
>>>
>>> +/*
>>> + * [MSA] ILVEV.B wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (byte data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>>> +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 8);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>> +
>>
>> Richard, is it cheaper to use another register to keep the constant mask
>> (here reused 4x)?
>>
>> Such:
>>
>>        TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>>
>>        tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>>        tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>>        tcg_gen_shli_i64(t2, t2, 8);
>>        tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>
>> Here use tcg_gen_and_i64() too.
>>
>>> +    tcg_gen_shli_i64(t2, t2, 8);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>> +
>>
>>        tcg_temp_free_i64(mask);
>>
>>> +    tcg_temp_free_i64(t1);
>>> +    tcg_temp_free_i64(t2);
>>
>> Mateja: Can you test for perf easily?
>>
>>> +}
>>> +
>>> +/*
>>> + * [MSA] ILVEV.H wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (halfword data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>>> +    const uint64_t mask = 0x0000ffff0000ffffULL;
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 16);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 16);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>> +
>>> +    tcg_temp_free_i64(t1);
>>> +    tcg_temp_free_i64(t2);
>>
>> Very similiar to gen_ilvev_b(), changing the mask and shift. Maybe worth
>> a refactor.
>>
>>> +}
>>> +
>>> +/*
>>> + * [MSA] ILVEV.W wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (word data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>> +    const uint64_t mask = 0x00000000ffffffffULL;
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1], 32, 32);
>>> +
>>> +    tcg_temp_free_i64(t1);
>>> +}
>>> +
>>> +/*
>>> + * [MSA] ILVEV.D wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (Double data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>>> +}
>>> +
>>>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>>  {
>>>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>>> @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>>          break;
>>>      case OPC_ILVEV_df:
>>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>>> +        switch (df) {
>>> +        case DF_BYTE:
>>> +            gen_ilvev_b(env, wd, ws, wt);
>>> +            break;
>>> +        case DF_HALF:
>>> +            gen_ilvev_h(env, wd, ws, wt);
>>> +            break;
>>> +        case DF_WORD:
>>> +            gen_ilvev_w(env, wd, ws, wt);
>>> +            break;
>>> +        case DF_DOUBLE:
>>> +            gen_ilvev_d(env, wd, ws, wt);
>>> +            break;
>>> +        default:
>>> +            assert(0);
>>> +        }
>>>          break;
>>>      case OPC_BINSR_df:
>>>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>>>
>>
>> Thanks,
>>
>> Phil.
>>
> 
> ________________________________________
> From: Philippe Mathieu-Daudé <philmd@redhat.com>
> Sent: Tuesday, April 2, 2019 6:19:19 PM
> To: Mateja Marjanovic; qemu-devel@nongnu.org
> Cc: Aleksandar Rikalo; richard.henderson@linaro.org; Aleksandar Markovic; aurelien@aurel32.net
> Subject: Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
> 
> Hi Mateja,
> 
> On 4/2/19 5:15 PM, Mateja Marjanovic wrote:
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>
>> Optimize set of MSA instructions ILVEV, using directly
>> tcg registers and performing logic on them instead of
>> using helpers.
> 
> Maybe you can still let this previous comment (if still valid):
> 
>   Performance measurement is done by executing the
>   instructions large number of times on a computer
>   with Intel Core i7-3770 CPU @ 3.40GHz×8.
> 
>>
>> In the following table, the first column is the performance
>> before this patch. The second represents the performance,
>> after converting from helpers to tcg, but without using
>> tcg_gen_deposit function. The third one is the solution
>> which is implemented in this patch.
> 
> You are describing the "no-deposit" which refers to a previous series
> but won't be accessible in the git repository.
> 
> I think this table is useful in the cover of this series, and in the
> commit message you should use || before || after || and drop the
> "no-deposit" case.
> 
>>
>>  instr    ||   before    || no-deposit ||  with-deposit
>> ========================================================
>>  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
>>  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
>>  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
>>  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
>>
>> The solution with deposit is suggested by Richard Henderson.
>>
> 
> The gitdm parsable form is:
> 
> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
> 
>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>> ---
>>  target/mips/helper.h     |   1 -
>>  target/mips/msa_helper.c |   9 ----
>>  target/mips/translate.c  | 105 ++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 104 insertions(+), 11 deletions(-)
>>
>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>> index 02e16c7..82f6a40 100644
>> --- a/target/mips/helper.h
>> +++ b/target/mips/helper.h
>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>> index a7ea6aa..d5c3842 100644
>> --- a/target/mips/msa_helper.c
>> +++ b/target/mips/msa_helper.c
>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>      } while (0)
>>  MSA_FN_DF(ilvr_df)
>>  #undef MSA_DO
>> -
>> -#define MSA_DO(DF)                      \
>> -    do {                                \
>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>> -    } while (0)
>> -MSA_FN_DF(ilvev_df)
>> -#undef MSA_DO
>> -
>>  #undef MSA_LOOP_COND
>>
>>  #define MSA_LOOP_COND(DF) \
>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>> index 04406d6..e26c6a6 100644
>> --- a/target/mips/translate.c
>> +++ b/target/mips/translate.c
>> @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>  }
>>
>> +/*
>> + * [MSA] ILVEV.B wd, ws, wt
>> + *
>> + *   Vector Interleave Even (byte data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
> 
> Richard, is it cheaper to use another register to keep the constant mask
> (here reused 4x)?
> 
> Such:
> 
>        TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> 
>        tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>        tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>        tcg_gen_shli_i64(t2, t2, 8);
>        tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> 
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> 
> Here use tcg_gen_and_i64() too.
> 
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
> 
>        tcg_temp_free_i64(mask);
> 
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
> 
> Mateja: Can you test for perf easily?
> 
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.H wd, ws, wt
>> + *
>> + *   Vector Interleave Even (halfword data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    const uint64_t mask = 0x0000ffff0000ffffULL;
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> +    tcg_gen_shli_i64(t2, t2, 16);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>> +
>> +    tcg_temp_free_i64(t1);
>> +    tcg_temp_free_i64(t2);
> 
> Very similiar to gen_ilvev_b(), changing the mask and shift. Maybe worth
> a refactor.
> 
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.W wd, ws, wt
>> + *
>> + *   Vector Interleave Even (word data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    const uint64_t mask = 0x00000000ffffffffULL;
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1], 32, 32);
>> +
>> +    tcg_temp_free_i64(t1);
>> +}
>> +
>> +/*
>> + * [MSA] ILVEV.D wd, ws, wt
>> + *
>> + *   Vector Interleave Even (Double data elements)
>> + *
>> + */
>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>> +}
>> +
>>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>  {
>>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>> @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>          break;
>>      case OPC_ILVEV_df:
>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>> +        switch (df) {
>> +        case DF_BYTE:
>> +            gen_ilvev_b(env, wd, ws, wt);
>> +            break;
>> +        case DF_HALF:
>> +            gen_ilvev_h(env, wd, ws, wt);
>> +            break;
>> +        case DF_WORD:
>> +            gen_ilvev_w(env, wd, ws, wt);
>> +            break;
>> +        case DF_DOUBLE:
>> +            gen_ilvev_d(env, wd, ws, wt);
>> +            break;
>> +        default:
>> +            assert(0);
>> +        }
>>          break;
>>      case OPC_BINSR_df:
>>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>>
> 
> Thanks,
> 
> Phil.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
  2019-04-02 16:19   ` Philippe Mathieu-Daudé
@ 2019-04-02 18:51   ` Aleksandar Markovic
  2019-04-03  8:32     ` Mateja Marjanovic
  2019-04-02 22:28   ` Aleksandar Markovic
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Aleksandar Markovic @ 2019-04-02 18:51 UTC (permalink / raw)
  To: Mateja Marjanovic, qemu-devel
  Cc: aurelien, richard.henderson, Aleksandar Rikalo

> +/*
> + * [MSA] ILVEV.D wd, ws, wt
> + *
> + *   Vector Interleave Even (Double data elements)
> + *
> + */

Double -> Doubleword

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
  2019-04-02 16:19   ` Philippe Mathieu-Daudé
  2019-04-02 18:51   ` Aleksandar Markovic
@ 2019-04-02 22:28   ` Aleksandar Markovic
  2019-04-03  7:53     ` Richard Henderson
  2019-04-02 23:25   ` Aleksandar Markovic
  2019-04-03  7:52   ` Richard Henderson
  4 siblings, 1 reply; 19+ messages in thread
From: Aleksandar Markovic @ 2019-04-02 22:28 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: amarkovic, qemu-devel, richard.henderson, aurelien, arikalo

On Apr 2, 2019 5:20 PM, "Mateja Marjanovic" <mateja.marjanovic@rt-rk.com>
wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimize set of MSA instructions ILVEV, using directly
> tcg registers and performing logic on them instead of
> using helpers.
>
> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.
>
>  instr    ||   before    || no-deposit ||  with-deposit
> ========================================================
>  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
>  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
>  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
>  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
>
> The solution with deposit is suggested by Richard Henderson.
>
> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---
>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   9 ----
>  target/mips/translate.c  | 105
++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 104 insertions(+), 11 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 02e16c7..82f6a40 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32,
i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index a7ea6aa..d5c3842 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>      } while (0)
>  MSA_FN_DF(ilvr_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> -    } while (0)
> -MSA_FN_DF(ilvev_df)
> -#undef MSA_DO
> -
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index 04406d6..e26c6a6 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState
*env, uint32_t wd,
>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>  }
>
> +/*
> + * [MSA] ILVEV.B wd, ws, wt
> + *
> + *   Vector Interleave Even (byte data elements)
> + *
> + */
> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVEV.H wd, ws, wt
> + *
> + *   Vector Interleave Even (halfword data elements)
> + *
> + */
> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x0000ffff0000ffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVEV.W wd, ws, wt
> + *
> + *   Vector Interleave Even (word data elements)
> + *
> + */
> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x00000000ffffffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1],
32, 32);
> +
> +    tcg_temp_free_i64(t1);
> +}
> +

This can be further optimized this way: (doublecheck the accuracy, writing
from home)

gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
                                uint32_t ws, uint32_t wt)
{
    tcg_gen_shli_i64(msa_wr_d[wd * 2], msa_wr_d[ws * 2], 32);
    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[wt * 2], 0, 32);

tcg_gen_shli_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1], 32);
    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[wt * 2 + 1], 0,
32);

}

> +/*
> + * [MSA] ILVEV.D wd, ws, wt
> + *
> + *   Vector Interleave Even (Double data elements)
> + *
> + */
> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env,
DisasContext *ctx)
>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVEV_df:
> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvev_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvev_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvev_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvev_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSR_df:
>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> --
> 2.7.4
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
                     ` (2 preceding siblings ...)
  2019-04-02 22:28   ` Aleksandar Markovic
@ 2019-04-02 23:25   ` Aleksandar Markovic
  2019-04-03  8:50     ` Mateja Marjanovic
  2019-04-03  9:49     ` Richard Henderson
  2019-04-03  7:52   ` Richard Henderson
  4 siblings, 2 replies; 19+ messages in thread
From: Aleksandar Markovic @ 2019-04-02 23:25 UTC (permalink / raw)
  To: Mateja Marjanovic
  Cc: amarkovic, qemu-devel, richard.henderson, aurelien, arikalo

On Apr 2, 2019 5:20 PM, "Mateja Marjanovic" <mateja.marjanovic@rt-rk.com>
wrote:
>
> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>
> Optimize set of MSA instructions ILVEV, using directly

Use full instruction names, with the only exception of possible Bachus-Naur
forms... again.

> tcg registers and performing logic on them instead of
> using helpers.
>
> In the following table, the first column is the performance
> before this patch. The second represents the performance,
> after converting from helpers to tcg, but without using
> tcg_gen_deposit function. The third one is the solution
> which is implemented in this patch.
>
>  instr    ||   before    || no-deposit ||  with-deposit
> ========================================================
>  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
>  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
>  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
>  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
>
> The solution with deposit is suggested by Richard Henderson.
>
> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
> ---

The byte and halfword cases of this patch most likely produce highly
unoptimized code for cases:

wd == wt == ws
wd == wt != ws
wd != ws == wt
wd == ws != wt

Please take these cases into account.

The same for patch 1/2.

Thanks,
Aleksandar

>  target/mips/helper.h     |   1 -
>  target/mips/msa_helper.c |   9 ----
>  target/mips/translate.c  | 105
++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 104 insertions(+), 11 deletions(-)
>
> diff --git a/target/mips/helper.h b/target/mips/helper.h
> index 02e16c7..82f6a40 100644
> --- a/target/mips/helper.h
> +++ b/target/mips/helper.h
> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32,
i32)
>  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> index a7ea6aa..d5c3842 100644
> --- a/target/mips/msa_helper.c
> +++ b/target/mips/msa_helper.c
> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>      } while (0)
>  MSA_FN_DF(ilvr_df)
>  #undef MSA_DO
> -
> -#define MSA_DO(DF)                      \
> -    do {                                \
> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> -    } while (0)
> -MSA_FN_DF(ilvev_df)
> -#undef MSA_DO
> -
>  #undef MSA_LOOP_COND
>
>  #define MSA_LOOP_COND(DF) \
> diff --git a/target/mips/translate.c b/target/mips/translate.c
> index 04406d6..e26c6a6 100644
> --- a/target/mips/translate.c
> +++ b/target/mips/translate.c
> @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState
*env, uint32_t wd,
>      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>  }
>
> +/*
> + * [MSA] ILVEV.B wd, ws, wt
> + *
> + *   Vector Interleave Even (byte data elements)
> + *
> + */
> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 8);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVEV.H wd, ws, wt
> + *
> + *   Vector Interleave Even (halfword data elements)
> + *
> + */
> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    TCGv_i64 t2 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x0000ffff0000ffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> +    tcg_gen_shli_i64(t2, t2, 16);
> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> +
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
> +}
> +
> +/*
> + * [MSA] ILVEV.W wd, ws, wt
> + *
> + *   Vector Interleave Even (word data elements)
> + *
> + */
> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x00000000ffffffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1],
32, 32);
> +
> +    tcg_temp_free_i64(t1);
> +}
> +
> +/*
> + * [MSA] ILVEV.D wd, ws, wt
> + *
> + *   Vector Interleave Even (Double data elements)
> + *
> + */
> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> +}
> +
>  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>  {
>  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env,
DisasContext *ctx)
>          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>          break;
>      case OPC_ILVEV_df:
> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> +        switch (df) {
> +        case DF_BYTE:
> +            gen_ilvev_b(env, wd, ws, wt);
> +            break;
> +        case DF_HALF:
> +            gen_ilvev_h(env, wd, ws, wt);
> +            break;
> +        case DF_WORD:
> +            gen_ilvev_w(env, wd, ws, wt);
> +            break;
> +        case DF_DOUBLE:
> +            gen_ilvev_d(env, wd, ws, wt);
> +            break;
> +        default:
> +            assert(0);
> +        }
>          break;
>      case OPC_BINSR_df:
>          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> --
> 2.7.4
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/2] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 1/2] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions Mateja Marjanovic
@ 2019-04-03  7:46   ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2019-04-03  7:46 UTC (permalink / raw)
  To: Mateja Marjanovic, qemu-devel; +Cc: aurelien, amarkovic, arikalo

On 4/2/19 10:15 PM, Mateja Marjanovic wrote:
> +static inline void gen_ilvod_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    const uint64_t mask = 0xffffffff00000000ULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_shri_i64(t1, t1, 32);

The andi is useless.  The bits that it discards are also discarded by the shift.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 16:19   ` Philippe Mathieu-Daudé
  2019-04-02 17:07     ` Aleksandar Markovic
@ 2019-04-03  7:49     ` Richard Henderson
  1 sibling, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2019-04-03  7:49 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Mateja Marjanovic, qemu-devel
  Cc: arikalo, amarkovic, aurelien

On 4/2/19 11:19 PM, Philippe Mathieu-Daudé wrote:
>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>> +                               uint32_t ws, uint32_t wt)
>> +{
>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>> +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>> +    tcg_gen_shli_i64(t2, t2, 8);
>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>> +
> 
> Richard, is it cheaper to use another register to keep the constant mask
> (here reused 4x)?
> 
> Such:
> 
>        TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
> 
>        tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>        tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>        tcg_gen_shli_i64(t2, t2, 8);
>        tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);

With the current state of the tcg optimizer, yes.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
                     ` (3 preceding siblings ...)
  2019-04-02 23:25   ` Aleksandar Markovic
@ 2019-04-03  7:52   ` Richard Henderson
  4 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2019-04-03  7:52 UTC (permalink / raw)
  To: Mateja Marjanovic, qemu-devel; +Cc: aurelien, amarkovic, arikalo

On 4/2/19 10:15 PM, Mateja Marjanovic wrote:
> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> +                               uint32_t ws, uint32_t wt)
> +{
> +    TCGv_i64 t1 = tcg_temp_new_i64();
> +    const uint64_t mask = 0x00000000ffffffffULL;
> +
> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);

The andi of mask is redundant with the deposit.  Remove it.
This should be just

    tcg_gen_deposit_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2],
                        msa_wr_d[ws * 2], 32, 32);


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 22:28   ` Aleksandar Markovic
@ 2019-04-03  7:53     ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2019-04-03  7:53 UTC (permalink / raw)
  To: Aleksandar Markovic, Mateja Marjanovic
  Cc: amarkovic, qemu-devel, aurelien, arikalo

On 4/3/19 5:28 AM, Aleksandar Markovic wrote:
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
>> +
>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1], 32, 32);
>> +
>> +    tcg_temp_free_i64(t1);
>> +}
>> +
> 
> This can be further optimized this way: (doublecheck the accuracy, writing from
> home)
> 
> gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>                                 uint32_t ws, uint32_t wt)
> {
>     tcg_gen_shli_i64(msa_wr_d[wd * 2], msa_wr_d[ws * 2], 32);
>     tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[wt * 2], 0, 32);
> 

No, the shift can be performed by the deposit.
See my other reply in this thread.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 18:37       ` Philippe Mathieu-Daudé
@ 2019-04-03  8:32         ` Mateja Marjanovic
  0 siblings, 0 replies; 19+ messages in thread
From: Mateja Marjanovic @ 2019-04-03  8:32 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Aleksandar Markovic, qemu-devel
  Cc: Aleksandar Rikalo, richard.henderson, aurelien


On 2.4.19. 20:37, Philippe Mathieu-Daudé wrote:
> On 4/2/19 7:07 PM, Aleksandar Markovic wrote:
>>> From: Philippe Mathieu-Daudé <philmd@redhat.com>
>>> Subject: Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
>>>
>>> Hi Mateja,
>>>
>>> On 4/2/19 5:15 PM, Mateja Marjanovic wrote:
>>>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>>>
>>>> Optimize set of MSA instructions ILVEV, using directly
>>>> tcg registers and performing logic on them instead of
>>>> using helpers.
>>> Maybe you can still let this previous comment (if still valid):
>>>
>>>    Performance measurement is done by executing the
>>>    instructions large number of times on a computer
>>>    with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>>
>> Agreed.
I will add that in v6.
>>
>>>> In the following table, the first column is the performance
>>>> before this patch. The second represents the performance,
>>>> after converting from helpers to tcg, but without using
>>>> tcg_gen_deposit function. The third one is the solution
>>>> which is implemented in this patch.
>>> You are describing the "no-deposit" which refers to a previous series
>>> but won't be accessible in the git repository.
>>>
>>> I think this table is useful in the cover of this series, and in the
>>> commit message you should use || before || after || and drop the
>>> "no-deposit" case.
>>>
>>>>   instr    ||   before    || no-deposit ||  with-deposit
>>>> ========================================================
>>>>   ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
>>>>   ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
>>>>   ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
>>>>   ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
>>>>
>>>> The solution with deposit is suggested by Richard Henderson.
>>>>
>> I think the table should remain in the commit message, to keep it
>> visible in the git logs.
>>
>> You could insert the "no-deposit" source code of gen_ilvev_w()
>> in the commit message, for reference reasons - it is not a too
>> large function.
> Clever :)
I agree, I will add the code in the commit message in v6.
>
>> Thanks,
>> Aleksandar
>>
>>> The gitdm parsable form is:
>>>
>>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>>>
>>>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>>>> ---
>>>>   target/mips/helper.h     |   1 -
>>>>   target/mips/msa_helper.c |   9 ----
>>>>   target/mips/translate.c  | 105 ++++++++++++++++++++++++++++++++++++++++++++++-
>>>>   3 files changed, 104 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>>>> index 02e16c7..82f6a40 100644
>>>> --- a/target/mips/helper.h
>>>> +++ b/target/mips/helper.h
>>>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>>>   DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>>>   DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>>>   DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>>>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>>>   DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>>>   DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>>>   DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>>>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>>>> index a7ea6aa..d5c3842 100644
>>>> --- a/target/mips/msa_helper.c
>>>> +++ b/target/mips/msa_helper.c
>>>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>>>       } while (0)
>>>>   MSA_FN_DF(ilvr_df)
>>>>   #undef MSA_DO
>>>> -
>>>> -#define MSA_DO(DF)                      \
>>>> -    do {                                \
>>>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>>>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>>>> -    } while (0)
>>>> -MSA_FN_DF(ilvev_df)
>>>> -#undef MSA_DO
>>>> -
>>>>   #undef MSA_LOOP_COND
>>>>
>>>>   #define MSA_LOOP_COND(DF) \
>>>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>>>> index 04406d6..e26c6a6 100644
>>>> --- a/target/mips/translate.c
>>>> +++ b/target/mips/translate.c
>>>> @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>>>       tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>>>   }
>>>>
>>>> +/*
>>>> + * [MSA] ILVEV.B wd, ws, wt
>>>> + *
>>>> + *   Vector Interleave Even (byte data elements)
>>>> + *
>>>> + */
>>>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>>>> +                               uint32_t ws, uint32_t wt)
>>>> +{
>>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>>>> +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
>>>> +
>>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>>> +    tcg_gen_shli_i64(t2, t2, 8);
>>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>>> +
>>> Richard, is it cheaper to use another register to keep the constant mask
>>> (here reused 4x)?
>>>
>>> Such:
>>>
>>>         TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>>>
>>>         tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>>>         tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>>>         tcg_gen_shli_i64(t2, t2, 8);
>>>         tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>>
>>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>> Here use tcg_gen_and_i64() too.
>>>
>>>> +    tcg_gen_shli_i64(t2, t2, 8);
>>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>>> +
>>>         tcg_temp_free_i64(mask);
>>>
>>>> +    tcg_temp_free_i64(t1);
>>>> +    tcg_temp_free_i64(t2);
>>> Mateja: Can you test for perf easily?
I will test that, and notify you with the results.
>>>
>>>> +}
>>>> +
>>>> +/*
>>>> + * [MSA] ILVEV.H wd, ws, wt
>>>> + *
>>>> + *   Vector Interleave Even (halfword data elements)
>>>> + *
>>>> + */
>>>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>>>> +                               uint32_t ws, uint32_t wt)
>>>> +{
>>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>>>> +    const uint64_t mask = 0x0000ffff0000ffffULL;
>>>> +
>>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>>> +    tcg_gen_shli_i64(t2, t2, 16);
>>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>>> +
>>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>>> +    tcg_gen_shli_i64(t2, t2, 16);
>>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>>> +
>>>> +    tcg_temp_free_i64(t1);
>>>> +    tcg_temp_free_i64(t2);
>>> Very similiar to gen_ilvev_b(), changing the mask and shift. Maybe worth
>>> a refactor.
>>>
>>>> +}
>>>> +
>>>> +/*
>>>> + * [MSA] ILVEV.W wd, ws, wt
>>>> + *
>>>> + *   Vector Interleave Even (word data elements)
>>>> + *
>>>> + */
>>>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>>> +                               uint32_t ws, uint32_t wt)
>>>> +{
>>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>>> +    const uint64_t mask = 0x00000000ffffffffULL;
>>>> +
>>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
>>>> +
>>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1], 32, 32);
>>>> +
>>>> +    tcg_temp_free_i64(t1);
>>>> +}
>>>> +
>>>> +/*
>>>> + * [MSA] ILVEV.D wd, ws, wt
>>>> + *
>>>> + *   Vector Interleave Even (Double data elements)
>>>> + *
>>>> + */
>>>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>>>> +                               uint32_t ws, uint32_t wt)
>>>> +{
>>>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>>>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>>>> +}
>>>> +
>>>>   static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>>>   {
>>>>   #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>>>> @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>>>           gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>>>           break;
>>>>       case OPC_ILVEV_df:
>>>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>>>> +        switch (df) {
>>>> +        case DF_BYTE:
>>>> +            gen_ilvev_b(env, wd, ws, wt);
>>>> +            break;
>>>> +        case DF_HALF:
>>>> +            gen_ilvev_h(env, wd, ws, wt);
>>>> +            break;
>>>> +        case DF_WORD:
>>>> +            gen_ilvev_w(env, wd, ws, wt);
>>>> +            break;
>>>> +        case DF_DOUBLE:
>>>> +            gen_ilvev_d(env, wd, ws, wt);
>>>> +            break;
>>>> +        default:
>>>> +            assert(0);
>>>> +        }
>>>>           break;
>>>>       case OPC_BINSR_df:
>>>>           gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>>>>
>>> Thanks,
>>>
>>> Phil.
>>>
Thanks,
Mateja
>> ________________________________________
>> From: Philippe Mathieu-Daudé <philmd@redhat.com>
>> Sent: Tuesday, April 2, 2019 6:19:19 PM
>> To: Mateja Marjanovic; qemu-devel@nongnu.org
>> Cc: Aleksandar Rikalo; richard.henderson@linaro.org; Aleksandar Markovic; aurelien@aurel32.net
>> Subject: Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
>>
>> Hi Mateja,
>>
>> On 4/2/19 5:15 PM, Mateja Marjanovic wrote:
>>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
>>>
>>> Optimize set of MSA instructions ILVEV, using directly
>>> tcg registers and performing logic on them instead of
>>> using helpers.
>> Maybe you can still let this previous comment (if still valid):
>>
>>    Performance measurement is done by executing the
>>    instructions large number of times on a computer
>>    with Intel Core i7-3770 CPU @ 3.40GHz×8.
>>
>>> In the following table, the first column is the performance
>>> before this patch. The second represents the performance,
>>> after converting from helpers to tcg, but without using
>>> tcg_gen_deposit function. The third one is the solution
>>> which is implemented in this patch.
>> You are describing the "no-deposit" which refers to a previous series
>> but won't be accessible in the git repository.
>>
>> I think this table is useful in the cover of this series, and in the
>> commit message you should use || before || after || and drop the
>> "no-deposit" case.
>>
>>>   instr    ||   before    || no-deposit ||  with-deposit
>>> ========================================================
>>>   ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
>>>   ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
>>>   ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
>>>   ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
>>>
>>> The solution with deposit is suggested by Richard Henderson.
>>>
>> The gitdm parsable form is:
>>
>> Suggested-by: Richard Henderson <richard.henderson@linaro.org>
>>
>>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com>
>>> ---
>>>   target/mips/helper.h     |   1 -
>>>   target/mips/msa_helper.c |   9 ----
>>>   target/mips/translate.c  | 105 ++++++++++++++++++++++++++++++++++++++++++++++-
>>>   3 files changed, 104 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/target/mips/helper.h b/target/mips/helper.h
>>> index 02e16c7..82f6a40 100644
>>> --- a/target/mips/helper.h
>>> +++ b/target/mips/helper.h
>>> @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, i32, i32)
>>>   DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
>>>   DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
>>>   DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
>>> -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
>>>   DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
>>>   DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
>>>   DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
>>> diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
>>> index a7ea6aa..d5c3842 100644
>>> --- a/target/mips/msa_helper.c
>>> +++ b/target/mips/msa_helper.c
>>> @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
>>>       } while (0)
>>>   MSA_FN_DF(ilvr_df)
>>>   #undef MSA_DO
>>> -
>>> -#define MSA_DO(DF)                      \
>>> -    do {                                \
>>> -        pwx->DF[2*i]   = pwt->DF[2*i];  \
>>> -        pwx->DF[2*i+1] = pws->DF[2*i];  \
>>> -    } while (0)
>>> -MSA_FN_DF(ilvev_df)
>>> -#undef MSA_DO
>>> -
>>>   #undef MSA_LOOP_COND
>>>
>>>   #define MSA_LOOP_COND(DF) \
>>> diff --git a/target/mips/translate.c b/target/mips/translate.c
>>> index 04406d6..e26c6a6 100644
>>> --- a/target/mips/translate.c
>>> +++ b/target/mips/translate.c
>>> @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState *env, uint32_t wd,
>>>       tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
>>>   }
>>>
>>> +/*
>>> + * [MSA] ILVEV.B wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (byte data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>>> +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 8);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>> +
>> Richard, is it cheaper to use another register to keep the constant mask
>> (here reused 4x)?
>>
>> Such:
>>
>>         TCGv_i64 mask = tcg_const_i64(0x00ff00ff00ff00ffULL);
>>
>>         tcg_gen_and_i64(t1, msa_wr_d[wt * 2], mask);
>>         tcg_gen_and_i64(t2, msa_wr_d[ws * 2], mask);
>>         tcg_gen_shli_i64(t2, t2, 8);
>>         tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>> Here use tcg_gen_and_i64() too.
>>
>>> +    tcg_gen_shli_i64(t2, t2, 8);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>> +
>>         tcg_temp_free_i64(mask);
>>
>>> +    tcg_temp_free_i64(t1);
>>> +    tcg_temp_free_i64(t2);
>> Mateja: Can you test for perf easily?
>>
>>> +}
>>> +
>>> +/*
>>> + * [MSA] ILVEV.H wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (halfword data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>> +    TCGv_i64 t2 = tcg_temp_new_i64();
>>> +    const uint64_t mask = 0x0000ffff0000ffffULL;
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 16);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>> +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
>>> +    tcg_gen_shli_i64(t2, t2, 16);
>>> +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
>>> +
>>> +    tcg_temp_free_i64(t1);
>>> +    tcg_temp_free_i64(t2);
>> Very similiar to gen_ilvev_b(), changing the mask and shift. Maybe worth
>> a refactor.
>>
>>> +}
>>> +
>>> +/*
>>> + * [MSA] ILVEV.W wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (word data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    TCGv_i64 t1 = tcg_temp_new_i64();
>>> +    const uint64_t mask = 0x00000000ffffffffULL;
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
>>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 32);
>>> +
>>> +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
>>> +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 1], 32, 32);
>>> +
>>> +    tcg_temp_free_i64(t1);
>>> +}
>>> +
>>> +/*
>>> + * [MSA] ILVEV.D wd, ws, wt
>>> + *
>>> + *   Vector Interleave Even (Double data elements)
>>> + *
>>> + */
>>> +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
>>> +                               uint32_t ws, uint32_t wt)
>>> +{
>>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
>>> +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
>>> +}
>>> +
>>>   static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>>   {
>>>   #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
>>> @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
>>>           gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
>>>           break;
>>>       case OPC_ILVEV_df:
>>> -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
>>> +        switch (df) {
>>> +        case DF_BYTE:
>>> +            gen_ilvev_b(env, wd, ws, wt);
>>> +            break;
>>> +        case DF_HALF:
>>> +            gen_ilvev_h(env, wd, ws, wt);
>>> +            break;
>>> +        case DF_WORD:
>>> +            gen_ilvev_w(env, wd, ws, wt);
>>> +            break;
>>> +        case DF_DOUBLE:
>>> +            gen_ilvev_d(env, wd, ws, wt);
>>> +            break;
>>> +        default:
>>> +            assert(0);
>>> +        }
>>>           break;
>>>       case OPC_BINSR_df:
>>>           gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
>>>
>> Thanks,
>>
>> Phil.
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 18:51   ` Aleksandar Markovic
@ 2019-04-03  8:32     ` Mateja Marjanovic
  0 siblings, 0 replies; 19+ messages in thread
From: Mateja Marjanovic @ 2019-04-03  8:32 UTC (permalink / raw)
  To: Aleksandar Markovic, qemu-devel
  Cc: aurelien, richard.henderson, Aleksandar Rikalo


On 2.4.19. 20:51, Aleksandar Markovic wrote:
>> +/*
>> + * [MSA] ILVEV.D wd, ws, wt
>> + *
>> + *   Vector Interleave Even (Double data elements)
>> + *
>> + */
> Double -> Doubleword

I will change it in v6.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 23:25   ` Aleksandar Markovic
@ 2019-04-03  8:50     ` Mateja Marjanovic
  2019-04-03 10:03       ` Aleksandar Markovic
  2019-04-03  9:49     ` Richard Henderson
  1 sibling, 1 reply; 19+ messages in thread
From: Mateja Marjanovic @ 2019-04-03  8:50 UTC (permalink / raw)
  To: Aleksandar Markovic
  Cc: amarkovic, qemu-devel, richard.henderson, aurelien, arikalo


On 3.4.19. 01:25, Aleksandar Markovic wrote:
>
>
> On Apr 2, 2019 5:20 PM, "Mateja Marjanovic" 
> <mateja.marjanovic@rt-rk.com <mailto:mateja.marjanovic@rt-rk.com>> wrote:
> >
> > From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com 
> <mailto:Mateja.Marjanovic@rt-rk.com>>
> >
> > Optimize set of MSA instructions ILVEV, using directly
>
> Use full instruction names, with the only exception of possible 
> Bachus-Naur forms... again.
>
I will, I didn't change it from some of the previous versions.
>
> > tcg registers and performing logic on them instead of
> > using helpers.
> >
> > In the following table, the first column is the performance
> > before this patch. The second represents the performance,
> > after converting from helpers to tcg, but without using
> > tcg_gen_deposit function. The third one is the solution
> > which is implemented in this patch.
> >
> >  instr    ||   before    || no-deposit ||  with-deposit
> > ========================================================
> >  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
> >  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
> >  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
> >  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
> >
> > The solution with deposit is suggested by Richard Henderson.
> >
> > Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com 
> <mailto:mateja.marjanovic@rt-rk.com>>
> > ---
>
> The byte and halfword cases of this patch most likely produce highly 
> unoptimized code for cases:
>
> wd == wt == ws
> wd == wt != ws
> wd != ws == wt
> wd == ws != wt
>
> Please take these cases into account.
>
> The same for patch 1/2.
>
Maybe, but if I put if statements asking are the registers the same,
it would affect the performance significantly in all cases. If some
registers were equal, it would be faster, but if not, just those if 
statements
would slow things down.
>
> Thanks,
> Aleksandar
>
Thanks,
Mateja

> >  target/mips/helper.h     |   1 -
> >  target/mips/msa_helper.c |   9 ----
> >  target/mips/translate.c  | 105 
> ++++++++++++++++++++++++++++++++++++++++++++++-
> >  3 files changed, 104 insertions(+), 11 deletions(-)
> >
> > diff --git a/target/mips/helper.h b/target/mips/helper.h
> > index 02e16c7..82f6a40 100644
> > --- a/target/mips/helper.h
> > +++ b/target/mips/helper.h
> > @@ -864,7 +864,6 @@ DEF_HELPER_5(msa_pckev_df, void, env, i32, i32, 
> i32, i32)
> >  DEF_HELPER_5(msa_pckod_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_ilvl_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_ilvr_df, void, env, i32, i32, i32, i32)
> > -DEF_HELPER_5(msa_ilvev_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_vshf_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_srar_df, void, env, i32, i32, i32, i32)
> >  DEF_HELPER_5(msa_srlr_df, void, env, i32, i32, i32, i32)
> > diff --git a/target/mips/msa_helper.c b/target/mips/msa_helper.c
> > index a7ea6aa..d5c3842 100644
> > --- a/target/mips/msa_helper.c
> > +++ b/target/mips/msa_helper.c
> > @@ -1197,15 +1197,6 @@ MSA_FN_DF(ilvl_df)
> >      } while (0)
> >  MSA_FN_DF(ilvr_df)
> >  #undef MSA_DO
> > -
> > -#define MSA_DO(DF)                      \
> > -    do {                                \
> > -        pwx->DF[2*i]   = pwt->DF[2*i];  \
> > -        pwx->DF[2*i+1] = pws->DF[2*i];  \
> > -    } while (0)
> > -MSA_FN_DF(ilvev_df)
> > -#undef MSA_DO
> > -
> >  #undef MSA_LOOP_COND
> >
> >  #define MSA_LOOP_COND(DF) \
> > diff --git a/target/mips/translate.c b/target/mips/translate.c
> > index 04406d6..e26c6a6 100644
> > --- a/target/mips/translate.c
> > +++ b/target/mips/translate.c
> > @@ -28974,6 +28974,94 @@ static inline void gen_ilvod_d(CPUMIPSState 
> *env, uint32_t wd,
> >      tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2 + 1]);
> >  }
> >
> > +/*
> > + * [MSA] ILVEV.B wd, ws, wt
> > + *
> > + *   Vector Interleave Even (byte data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_b(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    TCGv_i64 t1 = tcg_temp_new_i64();
> > +    TCGv_i64 t2 = tcg_temp_new_i64();
> > +    const uint64_t mask = 0x00ff00ff00ff00ffULL;
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> > +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> > +    tcg_gen_shli_i64(t2, t2, 8);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> > +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> > +    tcg_gen_shli_i64(t2, t2, 8);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> > +
> > +    tcg_temp_free_i64(t1);
> > +    tcg_temp_free_i64(t2);
> > +}
> > +
> > +/*
> > + * [MSA] ILVEV.H wd, ws, wt
> > + *
> > + *   Vector Interleave Even (halfword data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_h(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    TCGv_i64 t1 = tcg_temp_new_i64();
> > +    TCGv_i64 t2 = tcg_temp_new_i64();
> > +    const uint64_t mask = 0x0000ffff0000ffffULL;
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> > +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2], mask);
> > +    tcg_gen_shli_i64(t2, t2, 16);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2], t1, t2);
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> > +    tcg_gen_andi_i64(t2, msa_wr_d[ws * 2 + 1], mask);
> > +    tcg_gen_shli_i64(t2, t2, 16);
> > +    tcg_gen_or_i64(msa_wr_d[wd * 2 + 1], t1, t2);
> > +
> > +    tcg_temp_free_i64(t1);
> > +    tcg_temp_free_i64(t2);
> > +}
> > +
> > +/*
> > + * [MSA] ILVEV.W wd, ws, wt
> > + *
> > + *   Vector Interleave Even (word data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_w(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    TCGv_i64 t1 = tcg_temp_new_i64();
> > +    const uint64_t mask = 0x00000000ffffffffULL;
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2], mask);
> > +    tcg_gen_deposit_i64(msa_wr_d[wd * 2], t1, msa_wr_d[ws * 2], 32, 
> 32);
> > +
> > +    tcg_gen_andi_i64(t1, msa_wr_d[wt * 2 + 1], mask);
> > +    tcg_gen_deposit_i64(msa_wr_d[wd * 2 + 1], t1, msa_wr_d[ws * 2 + 
> 1], 32, 32);
> > +
> > +    tcg_temp_free_i64(t1);
> > +}
> > +
> > +/*
> > + * [MSA] ILVEV.D wd, ws, wt
> > + *
> > + *   Vector Interleave Even (Double data elements)
> > + *
> > + */
> > +static inline void gen_ilvev_d(CPUMIPSState *env, uint32_t wd,
> > +                               uint32_t ws, uint32_t wt)
> > +{
> > +    tcg_gen_mov_i64(msa_wr_d[wd * 2 + 1], msa_wr_d[ws * 2]);
> > +    tcg_gen_mov_i64(msa_wr_d[wd * 2], msa_wr_d[wt * 2]);
> > +}
> > +
> >  static void gen_msa_3r(CPUMIPSState *env, DisasContext *ctx)
> >  {
> >  #define MASK_MSA_3R(op)    (MASK_MSA_MINOR(op) | (op & (0x7 << 23)))
> > @@ -29130,7 +29218,22 @@ static void gen_msa_3r(CPUMIPSState *env, 
> DisasContext *ctx)
> >          gen_helper_msa_mod_s_df(cpu_env, tdf, twd, tws, twt);
> >          break;
> >      case OPC_ILVEV_df:
> > -        gen_helper_msa_ilvev_df(cpu_env, tdf, twd, tws, twt);
> > +        switch (df) {
> > +        case DF_BYTE:
> > +            gen_ilvev_b(env, wd, ws, wt);
> > +            break;
> > +        case DF_HALF:
> > +            gen_ilvev_h(env, wd, ws, wt);
> > +            break;
> > +        case DF_WORD:
> > +            gen_ilvev_w(env, wd, ws, wt);
> > +            break;
> > +        case DF_DOUBLE:
> > +            gen_ilvev_d(env, wd, ws, wt);
> > +            break;
> > +        default:
> > +            assert(0);
> > +        }
> >          break;
> >      case OPC_BINSR_df:
> >          gen_helper_msa_binsr_df(cpu_env, tdf, twd, tws, twt);
> > --
> > 2.7.4
> >
> >
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-02 23:25   ` Aleksandar Markovic
  2019-04-03  8:50     ` Mateja Marjanovic
@ 2019-04-03  9:49     ` Richard Henderson
  2019-04-03 10:18       ` Aleksandar Markovic
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2019-04-03  9:49 UTC (permalink / raw)
  To: Aleksandar Markovic, Mateja Marjanovic
  Cc: amarkovic, qemu-devel, aurelien, arikalo

On 4/3/19 6:25 AM, Aleksandar Markovic wrote:
> 
> On Apr 2, 2019 5:20 PM, "Mateja Marjanovic" <mateja.marjanovic@rt-rk.com
> <mailto:mateja.marjanovic@rt-rk.com>> wrote:
>>
>> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com
> <mailto:Mateja.Marjanovic@rt-rk.com>>
>>
>> Optimize set of MSA instructions ILVEV, using directly
> 
> Use full instruction names, with the only exception of possible Bachus-Naur
> forms... again.
> 
>> tcg registers and performing logic on them instead of
>> using helpers.
>>
>> In the following table, the first column is the performance
>> before this patch. The second represents the performance,
>> after converting from helpers to tcg, but without using
>> tcg_gen_deposit function. The third one is the solution
>> which is implemented in this patch.
>>
>>  instr    ||   before    || no-deposit ||  with-deposit
>> ========================================================
>>  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
>>  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
>>  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
>>  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
>>
>> The solution with deposit is suggested by Richard Henderson.
>>
>> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com
> <mailto:mateja.marjanovic@rt-rk.com>>
>> ---
> 
> The byte and halfword cases of this patch most likely produce highly
> unoptimized code for cases:
> 
> wd == wt == ws
> wd == wt != ws
> wd != ws == wt
> wd == ws != wt
> 
> Please take these cases into account.

I beg to differ.  We want to minimize the amount of special cases.

If you multiply the different cases like this you also multiply the maintenance
overhead.  You force future maintainers to wonder if the cases are truly
distinct or if they are mere optimization.

The only special cases that I advocate that you add are driven by standard
macros that the assembler generates -- e.g. register move (via or), register
negate (via nor), etc.


r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-03  8:50     ` Mateja Marjanovic
@ 2019-04-03 10:03       ` Aleksandar Markovic
  0 siblings, 0 replies; 19+ messages in thread
From: Aleksandar Markovic @ 2019-04-03 10:03 UTC (permalink / raw)
  To: Mateja Marjanovic, Aleksandar Markovic
  Cc: qemu-devel, richard.henderson, aurelien, Aleksandar Rikalo

> > From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com>
> > Subject: Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA > > instructions
> > 
> > 
> > On 3.4.19. 01:25, Aleksandar Markovic wrote:
> > 
> > On Apr 2, 2019 5:20 PM, "Mateja Marjanovic" <mateja.marjanovic@rt-rk.com<> > mailto:mateja.marjanovic@rt-rk.com>> wrote:
> > >
> > > From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com<> > mailto:Mateja.Marjanovic@rt-rk.com>>
> > >
> > > Optimize set of MSA instructions ILVEV
> > >
> > > ...
> > 
> > The byte and halfword cases of this patch most likely produce highly unoptimized code > > for cases:
> > 
> > wd == wt == ws
> > wd == wt != ws
> > wd != ws == wt
> > wd == ws != wt
> > 
> > Please take these cases into account.
> > 
> > The same for patch 1/2.
> 
> Maybe, but if I put if statements asking are the registers the same,
> it would affect the performance significantly in all cases. If some
> registers were equal, it would be faster, but if not, just those if statements
> would slow things down.

Mateja,

It won't affect the performance significantly at all. Distinguish between
the code executed in translate time (rarely) and execute time (often).
If statement you mention are executed in translate time.

Thanks,
Aleksandar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA instructions
  2019-04-03  9:49     ` Richard Henderson
@ 2019-04-03 10:18       ` Aleksandar Markovic
  0 siblings, 0 replies; 19+ messages in thread
From: Aleksandar Markovic @ 2019-04-03 10:18 UTC (permalink / raw)
  To: Richard Henderson, Aleksandar Markovic, Mateja Marjanovic
  Cc: qemu-devel, aurelien, Aleksandar Rikalo

> From: Richard Henderson <richard.henderson@linaro.org>
> Subject: Re: [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> MSA > instructions
> 
> On 4/3/19 6:25 AM, Aleksandar Markovic wrote:
> >
> > On Apr 2, 2019 5:20 PM, "Mateja Marjanovic" <mateja.marjanovic@rt-rk.com
> > <mailto:mateja.marjanovic@rt-rk.com>> wrote:
> >>
> >> From: Mateja Marjanovic <Mateja.Marjanovic@rt-rk.com
> > <mailto:Mateja.Marjanovic@rt-rk.com>>
> >>
> >> Optimize set of MSA instructions ILVEV, using directly
> >
> > Use full instruction names, with the only exception of possible Bachus-Naur
> > forms... again.
> >
> >> tcg registers and performing logic on them instead of
> >> using helpers.
> >>
> >> In the following table, the first column is the performance
> >> before this patch. The second represents the performance,
> >> after converting from helpers to tcg, but without using
> >> tcg_gen_deposit function. The third one is the solution
> >> which is implemented in this patch.
> >>
> >>  instr    ||   before    || no-deposit ||  with-deposit
> >> ========================================================
> >>  ilvev.b  ||  126.92 ms  ||  24.52 ms  ||  24.43 ms
> >>  ilvev.h  ||   93.67 ms  ||  23.92 ms  ||  23.86 ms
> >>  ilvev.w  ||  117.86 ms  ||  23.83 ms  ||  22.17 ms
> >>  ilvev.d  ||   45.49 ms  ||  19.74 ms  ||  19.71 ms
> >>
> >> The solution with deposit is suggested by Richard Henderson.
> >>
> >> Signed-off-by: Mateja Marjanovic <mateja.marjanovic@rt-rk.com
> > <mailto:mateja.marjanovic@rt-rk.com>>
> >> ---
> >
> > The byte and halfword cases of this patch most likely produce highly
> > unoptimized code for cases:
> >
> > wd == wt == ws
> > wd == wt != ws
> > wd != ws == wt
> > wd == ws != wt
> >
> > Please take these cases into account.
> 
> I beg to differ.  We want to minimize the amount of special cases.
> 
> If you multiply the different cases like this you also multiply the maintenance
> overhead.  You force future maintainers to wonder if the cases are truly
> distinct or if they are mere optimization.
> 

I find your objection hard to understand.

The subject and the goal of the patch is obviously optimization. If there is
a danger of unclarity in the resulting source code, this is easily alleviated
by, for example, inserting informative comments, as is routinely done in
many other areas of QEMU or any software product.

Sincerely,
Aleksandar

> The only special cases that I advocate that you add are driven by standard
> macros that the assembler generates -- e.g. register move (via or), register
> negate (via nor), etc.
> 
> 
> r~

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-04-03 10:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-02 15:15 [Qemu-devel] [PATCH v5 0/2] target/mips: Optimize MSA <ILVEV|ILVOD>.<B|H|W|D> instructions Mateja Marjanovic
2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 1/2] target/mips: Optimize ILVOD.<B|H|W|D> MSA instructions Mateja Marjanovic
2019-04-03  7:46   ` Richard Henderson
2019-04-02 15:15 ` [Qemu-devel] [PATCH v5 2/2] target/mips: Optimize ILVEV.<B|H|W|D> " Mateja Marjanovic
2019-04-02 16:19   ` Philippe Mathieu-Daudé
2019-04-02 17:07     ` Aleksandar Markovic
2019-04-02 18:37       ` Philippe Mathieu-Daudé
2019-04-03  8:32         ` Mateja Marjanovic
2019-04-03  7:49     ` Richard Henderson
2019-04-02 18:51   ` Aleksandar Markovic
2019-04-03  8:32     ` Mateja Marjanovic
2019-04-02 22:28   ` Aleksandar Markovic
2019-04-03  7:53     ` Richard Henderson
2019-04-02 23:25   ` Aleksandar Markovic
2019-04-03  8:50     ` Mateja Marjanovic
2019-04-03 10:03       ` Aleksandar Markovic
2019-04-03  9:49     ` Richard Henderson
2019-04-03 10:18       ` Aleksandar Markovic
2019-04-03  7:52   ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.