[Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions
@ 2019-08-27  9:37 Stefan Brankovic
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Stefan Brankovic @ 2019-08-27  9:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, david

Optimize emulation of ten Altivec instructions: lvsl, lvsr, vsl, vsr, vpkpx,
vgbbd, vclzb, vclzh, vclzw and vclzd.

This series buils up on and complements recent work of Thomas Murta, Mark
Cave-Ayland and Richard Henderson in the same area. It is based on devising TCG
translation implementation for selected instructions rather than using helpers.
The selected instructions are most of the time idiosyncratic to ppc platform,
so relatively complex TCG translation (without direct mapping to host
instruction that is not possible in these cases) seems to be the best option,
and that approach is presented in this series. The performance improvements
are significant in all cases.

V6:

Rebased series to the latest qemu code.
Excluded all patches that are already accepted.

V5:

Fixed vpkpx bug and added it back in patch.
Fixed graphical distortions on OSX 10.3 and 10.4.
Removed conversion of vmrgh and vmrgl instructions to vector operations for
further investigation.

V4:

Addressed Richard's Henderson's suggestions.
Removed vpkpx's optimization for further investigation on graphical distortions
it caused on OSX 10.2-4 guests.
Added opcodes for vector vmrgh(b|h|w) and vmrgl(b|h|w) in tcg.
Implemented vector vmrgh and vmrgl instructions for i386.
Converted vmrgh and vmrgl instructions to vector operations.

V3:

Fixed problem during build.

V2:

Addressed Richard's Henderson's suggestions.
Fixed problem during build on patch 2/8.
Rebased series to the latest qemu code.

Stefan Brankovic (3):
  target/ppc: Optimize emulation of vpkpx instruction
  target/ppc: Optimize emulation of vclzh and vclzb instructions
  target/ppc: Refactor emulation of vmrgew and vmrgow instructions

 target/ppc/helper.h                 |   3 -
 target/ppc/int_helper.c             |  30 ----
 target/ppc/translate/vmx-impl.inc.c | 301 ++++++++++++++++++++++++++++++++----
 3 files changed, 269 insertions(+), 65 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction
  2019-08-27  9:37 [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
@ 2019-08-27  9:37 ` Stefan Brankovic
  2019-08-27 18:52   ` Richard Henderson
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
  2 siblings, 1 reply; 12+ messages in thread
From: Stefan Brankovic @ 2019-08-27  9:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, david

Optimize altivec instruction vpkpx (Vector Pack Pixel).
Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register)
into contigous array of bits in the destination register.

In each iteration of outer loop, the instruction is to be done with
the 6-5-5 pack for 2 pixels of each doubleword element of each
source register. The first thing to be done in outer loop is
choosing which doubleword element of which register is to be used
in current iteration and it is to be placed in avr variable. The
next step is to perform 6-5-5 pack of pixels on avr variable in inner
for loop(2 iterations, 1 for each pixel) and save result in tmp variable.
In the end of outer for loop, the result is merged in variable called
result and saved in appropriate doubleword element of vD if the whole
doubleword is finished(every second iteration). The outer loop has 4
iterations.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |  1 -
 target/ppc/int_helper.c             | 21 --------
 target/ppc/translate/vmx-impl.inc.c | 99 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 98 insertions(+), 23 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 54ea9b9..940a115 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -258,7 +258,6 @@ DEF_HELPER_4(vpkudus, void, env, avr, avr, avr)
 DEF_HELPER_4(vpkuhum, void, env, avr, avr, avr)
 DEF_HELPER_4(vpkuwum, void, env, avr, avr, avr)
 DEF_HELPER_4(vpkudum, void, env, avr, avr, avr)
-DEF_HELPER_3(vpkpx, void, avr, avr, avr)
 DEF_HELPER_5(vmhaddshs, void, env, avr, avr, avr, avr)
 DEF_HELPER_5(vmhraddshs, void, env, avr, avr, avr, avr)
 DEF_HELPER_5(vmsumuhm, void, env, avr, avr, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 46deb57..9ff3b03 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1262,27 +1262,6 @@ void helper_vpmsumd(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 #else
 #define PKBIG 0
 #endif
-void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
-{
-    int i, j;
-    ppc_avr_t result;
-#if defined(HOST_WORDS_BIGENDIAN)
-    const ppc_avr_t *x[2] = { a, b };
-#else
-    const ppc_avr_t *x[2] = { b, a };
-#endif
-
-    VECTOR_FOR_INORDER_I(i, u64) {
-        VECTOR_FOR_INORDER_I(j, u32) {
-            uint32_t e = x[i]->u32[j];
-
-            result.u16[4 * i + j] = (((e >> 9) & 0xfc00) |
-                                     ((e >> 6) & 0x3e0) |
-                                     ((e >> 3) & 0x1f));
-        }
-    }
-    *r = result;
-}
 
 #define VPK(suffix, from, to, cvt, dosat)                               \
     void helper_vpk##suffix(CPUPPCState *env, ppc_avr_t *r,             \
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 0d71c10..456666a 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -571,6 +571,103 @@ static void trans_lvsr(DisasContext *ctx)
 }
 
 /*
+ * vpkpx VRT,VRA,VRB - Vector Pack Pixel
+ *
+ * Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register)
+ * into contigous array of bits in the destination register.
+ */
+static void trans_vpkpx(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 shifted = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 result1 = tcg_temp_new_i64();
+    TCGv_i64 result2 = tcg_temp_new_i64();
+    int64_t mask1 = 0x1fULL;
+    int64_t mask2 = 0x1fULL << 5;
+    int64_t mask3 = 0x3fULL << 10;
+    int i, j;
+    /*
+     * In each iteration do the 6-5-5 pack for 2 pixels of each doubleword
+     * element of each source register.
+     */
+    for (i = 0; i < 4; i++) {
+        switch (i) {
+        case 0:
+            /*
+             * Get high doubleword of vA to perfrom 6-5-5 pack of pixels
+             * 1 and 2.
+             */
+            get_avr64(avr, VA, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 1:
+            /*
+             * Get low doubleword of vA to perfrom 6-5-5 pack of pixels
+             * 3 and 4.
+             */
+            get_avr64(avr, VA, false);
+            break;
+        case 2:
+            /*
+             * Get high doubleword of vB to perfrom 6-5-5 pack of pixels
+             * 5 and 6.
+             */
+            get_avr64(avr, VB, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 3:
+            /*
+             * Get low doubleword of vB to perfrom 6-5-5 pack of pixels
+             * 7 and 8.
+             */
+            get_avr64(avr, VB, false);
+            break;
+        }
+        /* Perform the packing for 2 pixels(each iteration for 1). */
+        tcg_gen_movi_i64(tmp, 0x0ULL);
+        for (j = 0; j < 2; j++) {
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 3));
+            tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 6));
+            tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 9));
+            tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+        }
+        if ((i == 0) || (i == 2)) {
+            tcg_gen_shli_i64(tmp, tmp, 32);
+        }
+        tcg_gen_or_i64(result, result, tmp);
+        if (i == 1) {
+            /* Place packed pixels 1:4 to high doubleword of vD. */
+            tcg_gen_mov_i64(result1, result);
+        }
+        if (i == 3) {
+            /* Place packed pixels 5:8 to low doubleword of vD. */
+            tcg_gen_mov_i64(result2, result);
+        }
+    }
+    set_avr64(VT, result1, true);
+    set_avr64(VT, result2, false);
+
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(shifted);
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(result1);
+    tcg_temp_free_i64(result2);
+}
+
+/*
  * vsl VRT,VRA,VRB - Vector Shift Left
  *
  * Shifting left 128 bit value of vA by value specified in bits 125-127 of vB.
@@ -923,7 +1020,7 @@ GEN_VXFORM_ENV(vpksdus, 7, 21);
 GEN_VXFORM_ENV(vpkshss, 7, 6);
 GEN_VXFORM_ENV(vpkswss, 7, 7);
 GEN_VXFORM_ENV(vpksdss, 7, 23);
-GEN_VXFORM(vpkpx, 7, 12);
+GEN_VXFORM_TRANS(vpkpx, 7, 12);
 GEN_VXFORM_ENV(vsum4ubs, 4, 24);
 GEN_VXFORM_ENV(vsum4sbs, 4, 28);
 GEN_VXFORM_ENV(vsum4shs, 4, 25);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions
  2019-08-27  9:37 [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
@ 2019-08-27  9:37 ` Stefan Brankovic
  2019-08-27 19:14   ` Richard Henderson
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
  2 siblings, 1 reply; 12+ messages in thread
From: Stefan Brankovic @ 2019-08-27  9:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, david

Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
This instruction counts the number of leading zeros of each halfword element
in source register and places result in the appropriate halfword element of
destination register.

In each iteration of outer for loop count operation is performed on one
doubleword element of source register vB. In the first iteration, higher
doubleword element of vB is placed in variable avr, and then counting
for every halfword element is performed by  using tcg_gen_clzi_i64.
Since it counts leading zeros on 64 bit lenght, ith byte element has to
be moved to the highest 16 bits of tmp, or-ed with mask(in order to get all
ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move it's result
in appropriate halfword element of result. This is done in inner for loop.
After the operation is finished, the result is saved in the appropriate
doubleword element of destination register vD. The same sequence of orders
is to be applied again for the  lower doubleword element of vB.

Optimize Altivec instruction vclzb (Vector Count Leading Zeros Byte).
This instruction counts the number of leading zeros of each byte element
in source register and places result in the appropriate byte element of
destination register.

In each iteration of the outer for loop, counting operation is done on one
doubleword element of source register vB. In the first iteration, the
higher doubleword element of vB is placed in variable avr, and then counting
for every byte element is performed using tcg_gen_clzi_i64. Since it counts
leading zeros on 64 bit lenght, ith byte element has to be moved to the highest
8 bits of variable  tmp, or-ed with mask(in order to get all ones in the lowest
56 bits), then perform tcg_gen_clzi_i64 and move it's result in the appropriate
byte element of result. This is done in inner for loop. After the operation is
finished, the result is saved in the  appropriate doubleword element of destination
register vD. The same sequence of orders is to be applied again for the lower
doubleword element of vB.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |   2 -
 target/ppc/int_helper.c             |   9 ---
 target/ppc/translate/vmx-impl.inc.c | 136 +++++++++++++++++++++++++++++++++++-
 3 files changed, 134 insertions(+), 13 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 940a115..39c202f 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -307,8 +307,6 @@ DEF_HELPER_4(vcfsx, void, env, avr, avr, i32)
 DEF_HELPER_4(vctuxs, void, env, avr, avr, i32)
 DEF_HELPER_4(vctsxs, void, env, avr, avr, i32)
 
-DEF_HELPER_2(vclzb, void, avr, avr)
-DEF_HELPER_2(vclzh, void, avr, avr)
 DEF_HELPER_2(vctzb, void, avr, avr)
 DEF_HELPER_2(vctzh, void, avr, avr)
 DEF_HELPER_2(vctzw, void, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 9ff3b03..65a9387 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1796,15 +1796,6 @@ VUPK(lsw, s64, s32, UPKLO)
         }                                                               \
     }
 
-#define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8)
-#define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16)
-
-VGENERIC_DO(clzb, u8)
-VGENERIC_DO(clzh, u16)
-
-#undef clzb
-#undef clzh
-
 #define ctzb(v) ((v) ? ctz32(v) : 8)
 #define ctzh(v) ((v) ? ctz32(v) : 16)
 #define ctzw(v) ctz32((v))
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 456666a..e8a0fb6 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -840,6 +840,138 @@ static void trans_vgbbd(DisasContext *ctx)
 }
 
 /*
+ * vclzb VRT,VRB - Vector Count Leading Zeros Byte
+ *
+ * Counting the number of leading zero bits of each byte element in source
+ * register and placing result in appropriate byte element of destination
+ * register.
+ */
+static void trans_vclzb(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 result1 = tcg_temp_new_i64();
+    TCGv_i64 result2 = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword of vB in avr. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword of vB in avr. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every byte element using tcg_gen_clzi_i64.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 8 bits of tmp, or it with mask(so we get
+         * all ones in lowest 56 bits), then perform tcg_gen_clzi_i64 and move
+         * it's result in appropriate byte element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 56);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 7; j++) {
+            tcg_gen_shli_i64(tmp, avr, (7 - j) * 8);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 8, 8);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 56, 8);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            tcg_gen_mov_i64(result1, result);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            tcg_gen_mov_i64(result2, result);
+        }
+    }
+
+    set_avr64(VT, result1, true);
+    set_avr64(VT, result2, false);
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(result1);
+    tcg_temp_free_i64(result2);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
+ * vclzh VRT,VRB - Vector Count Leading Zeros Halfword
+ *
+ * Counting the number of leading zero bits of each halfword element in source
+ * register and placing result in appropriate halfword element of destination
+ * register.
+ */
+static void trans_vclzh(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 result1 = tcg_temp_new_i64();
+    TCGv_i64 result2 = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword element of vB in avr. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword element of vB in avr. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every halfword element using tcg_gen_clzi_i64.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 16 bits of tmp, or it with mask(so we get
+         * all ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move
+         * it's result in appropriate halfword element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 48);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 3; j++) {
+            tcg_gen_shli_i64(tmp, avr, (3 - j) * 16);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 16, 16);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 48, 16);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            tcg_gen_mov_i64(result1, result);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            tcg_gen_mov_i64(result2, result);
+        }
+    }
+
+    set_avr64(VT, result1, true);
+    set_avr64(VT, result2, false);
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(result1);
+    tcg_temp_free_i64(result2);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
  * vclzw VRT,VRB - Vector Count Leading Zeros Word
  *
  * Counting the number of leading zero bits of each word element in source
@@ -1404,8 +1536,8 @@ GEN_VAFORM_PAIRED(vmsumshm, vmsumshs, 20)
 GEN_VAFORM_PAIRED(vsel, vperm, 21)
 GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
 
-GEN_VXFORM_NOA(vclzb, 1, 28)
-GEN_VXFORM_NOA(vclzh, 1, 29)
+GEN_VXFORM_TRANS(vclzb, 1, 28)
+GEN_VXFORM_TRANS(vclzh, 1, 29)
 GEN_VXFORM_TRANS(vclzw, 1, 30)
 GEN_VXFORM_TRANS(vclzd, 1, 31)
 GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions
  2019-08-27  9:37 [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
@ 2019-08-27  9:37 ` Stefan Brankovic
  2019-08-27 19:19   ` Richard Henderson
  2 siblings, 1 reply; 12+ messages in thread
From: Stefan Brankovic @ 2019-08-27  9:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, david

Since I found this two instructions implemented with tcg, I refactored
them so they are consistent with other similar implementations that
I introduced in this patch.

Also, a new dual macro GEN_VXFORM_TRANS_DUAL is added. This macro is
used if one instruction is realized with direct translation, and second
one with a helper.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 66 +++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index e8a0fb6..6af9c73 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -350,6 +350,28 @@ static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
     }                                                                  \
 }
 
+/*
+ * We use this macro if one instruction is realized with direct
+ * translation, and second one with helper.
+ */
+#define GEN_VXFORM_TRANS_DUAL(name0, flg0, flg2_0, name1, flg1, flg2_1)\
+static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
+{                                                                      \
+    if ((Rc(ctx->opcode) == 0) &&                                      \
+        ((ctx->insns_flags & flg0) || (ctx->insns_flags2 & flg2_0))) { \
+        if (unlikely(!ctx->altivec_enabled)) {                         \
+            gen_exception(ctx, POWERPC_EXCP_VPU);                      \
+            return;                                                    \
+        }                                                              \
+        trans_##name0(ctx);                                            \
+    } else if ((Rc(ctx->opcode) == 1) &&                               \
+        ((ctx->insns_flags & flg1) || (ctx->insns_flags2 & flg2_1))) { \
+        gen_##name1(ctx);                                              \
+    } else {                                                           \
+        gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);            \
+    }                                                                  \
+}
+
 /* Adds support to provide invalid mask */
 #define GEN_VXFORM_DUAL_EXT(name0, flg0, flg2_0, inval0,                \
                             name1, flg1, flg2_1, inval1)                \
@@ -431,20 +453,13 @@ GEN_VXFORM(vmrglb, 6, 4);
 GEN_VXFORM(vmrglh, 6, 5);
 GEN_VXFORM(vmrglw, 6, 6);
 
-static void gen_vmrgew(DisasContext *ctx)
+static void trans_vmrgew(DisasContext *ctx)
 {
-    TCGv_i64 tmp;
-    TCGv_i64 avr;
-    int VT, VA, VB;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    VT = rD(ctx->opcode);
-    VA = rA(ctx->opcode);
-    VB = rB(ctx->opcode);
-    tmp = tcg_temp_new_i64();
-    avr = tcg_temp_new_i64();
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
 
     get_avr64(avr, VB, true);
     tcg_gen_shri_i64(tmp, avr, 32);
@@ -462,21 +477,14 @@ static void gen_vmrgew(DisasContext *ctx)
     tcg_temp_free_i64(avr);
 }
 
-static void gen_vmrgow(DisasContext *ctx)
+static void trans_vmrgow(DisasContext *ctx)
 {
-    TCGv_i64 t0, t1;
-    TCGv_i64 avr;
-    int VT, VA, VB;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    VT = rD(ctx->opcode);
-    VA = rA(ctx->opcode);
-    VB = rB(ctx->opcode);
-    t0 = tcg_temp_new_i64();
-    t1 = tcg_temp_new_i64();
-    avr = tcg_temp_new_i64();
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 t0 = tcg_temp_new_i64();
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
 
     get_avr64(t0, VB, true);
     get_avr64(t1, VA, true);
@@ -1165,14 +1173,14 @@ GEN_VXFORM_ENV(vminfp, 5, 17);
 GEN_VXFORM_HETRO(vextublx, 6, 24)
 GEN_VXFORM_HETRO(vextuhlx, 6, 25)
 GEN_VXFORM_HETRO(vextuwlx, 6, 26)
-GEN_VXFORM_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
+GEN_VXFORM_TRANS_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
                 vextuwlx, PPC_NONE, PPC2_ISA300)
 GEN_VXFORM_HETRO(vextubrx, 6, 28)
 GEN_VXFORM_HETRO(vextuhrx, 6, 29)
 GEN_VXFORM_HETRO(vextuwrx, 6, 30)
 GEN_VXFORM_TRANS(lvsl, 6, 31)
 GEN_VXFORM_TRANS(lvsr, 6, 32)
-GEN_VXFORM_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207, \
+GEN_VXFORM_TRANS_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207,
                 vextuwrx, PPC_NONE, PPC2_ISA300)
 
 #define GEN_VXRFORM1(opname, name, str, opc2, opc3)                     \
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
@ 2019-08-27 18:52   ` Richard Henderson
  2019-08-27 19:04     ` BALATON Zoltan
  2019-08-29 13:34     ` Stefan Brankovic
  0 siblings, 2 replies; 12+ messages in thread
From: Richard Henderson @ 2019-08-27 18:52 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 8/27/19 2:37 AM, Stefan Brankovic wrote:
> +    for (i = 0; i < 4; i++) {
> +        switch (i) {
> +        case 0:
> +            /*
> +             * Get high doubleword of vA to perfrom 6-5-5 pack of pixels
> +             * 1 and 2.
> +             */
> +            get_avr64(avr, VA, true);
> +            tcg_gen_movi_i64(result, 0x0ULL);
> +            break;
> +        case 1:
> +            /*
> +             * Get low doubleword of vA to perfrom 6-5-5 pack of pixels
> +             * 3 and 4.
> +             */
> +            get_avr64(avr, VA, false);
> +            break;
> +        case 2:
> +            /*
> +             * Get high doubleword of vB to perfrom 6-5-5 pack of pixels
> +             * 5 and 6.
> +             */
> +            get_avr64(avr, VB, true);
> +            tcg_gen_movi_i64(result, 0x0ULL);
> +            break;
> +        case 3:
> +            /*
> +             * Get low doubleword of vB to perfrom 6-5-5 pack of pixels
> +             * 7 and 8.
> +             */
> +            get_avr64(avr, VB, false);
> +            break;
> +        }
> +        /* Perform the packing for 2 pixels(each iteration for 1). */
> +        tcg_gen_movi_i64(tmp, 0x0ULL);
> +        for (j = 0; j < 2; j++) {
> +            tcg_gen_shri_i64(shifted, avr, (j * 16 + 3));
> +            tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16));
> +            tcg_gen_or_i64(tmp, tmp, shifted);
> +
> +            tcg_gen_shri_i64(shifted, avr, (j * 16 + 6));
> +            tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16));
> +            tcg_gen_or_i64(tmp, tmp, shifted);
> +
> +            tcg_gen_shri_i64(shifted, avr, (j * 16 + 9));
> +            tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16));
> +            tcg_gen_or_i64(tmp, tmp, shifted);
> +        }
> +        if ((i == 0) || (i == 2)) {
> +            tcg_gen_shli_i64(tmp, tmp, 32);
> +        }
> +        tcg_gen_or_i64(result, result, tmp);
> +        if (i == 1) {
> +            /* Place packed pixels 1:4 to high doubleword of vD. */
> +            tcg_gen_mov_i64(result1, result);
> +        }
> +        if (i == 3) {
> +            /* Place packed pixels 5:8 to low doubleword of vD. */
> +            tcg_gen_mov_i64(result2, result);
> +        }
> +    }
> +    set_avr64(VT, result1, true);
> +    set_avr64(VT, result2, false);

I really have a hard time believing that it is worthwhile to inline all of this
code.  By my count this is 82 non-move opcodes.  That is a *lot* of inline
expansion.

However, I can well imagine that the existing out-of-line helper is less than
optimal.

> -void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
> -{
> -    int i, j;
> -    ppc_avr_t result;
> -#if defined(HOST_WORDS_BIGENDIAN)
> -    const ppc_avr_t *x[2] = { a, b };
> -#else
> -    const ppc_avr_t *x[2] = { b, a };
> -#endif
> -
> -    VECTOR_FOR_INORDER_I(i, u64) {
> -        VECTOR_FOR_INORDER_I(j, u32) {
> -            uint32_t e = x[i]->u32[j];

Double indirect loads?

> -
> -            result.u16[4 * i + j] = (((e >> 9) & 0xfc00) |
> -                                     ((e >> 6) & 0x3e0) |
> -                                     ((e >> 3) & 0x1f));

Store to temporary ...

> -        }
> -    }
> -    *r = result;

... and then copy?

Try replacing the existing helper with something like the following.


r~



static inline uint64_t pkpx_1(uint64_t a, int shr, int shl)
{
    uint64_t r;

    r  = ((a >> (shr + 9)) & 0x3f) << shl;
    r |= ((a >> (shr + 6)) & 0x1f) << shl;
    r |= ((a >> (shr + 3)) & 0x1f) << shl;

    return r;
}

static inline uint64_t pkpx_2(uint64_t ah, uint64_t al)
{
    return pkpx_1(ah, 32, 48)
         | pkpx_1(ah,  0, 32)
         | pkpx_1(al, 32, 16)
         | pkpx_1(al,  0,  0);
}

void helper_vpkpx(uint64_t *r, uint64_t *a, uint64_t *b)
{
    uint64_t rh = pkpx_2(a->VsrD(0), a->VsrD(1));
    uint64_t rl = pkpx_2(b->VsrD(0), b->VsrD(1));
    r->VsrD(0) = rh;
    r->VsrD(1) = rl;
}


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction
  2019-08-27 18:52   ` Richard Henderson
@ 2019-08-27 19:04     ` BALATON Zoltan
  2019-08-29 13:34     ` Stefan Brankovic
  1 sibling, 0 replies; 12+ messages in thread
From: BALATON Zoltan @ 2019-08-27 19:04 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Stefan Brankovic, qemu-devel, david

On Tue, 27 Aug 2019, Richard Henderson wrote:
> On 8/27/19 2:37 AM, Stefan Brankovic wrote:
>> +    for (i = 0; i < 4; i++) {
>> +        switch (i) {
>> +        case 0:
>> +            /*
>> +             * Get high doubleword of vA to perfrom 6-5-5 pack of pixels
>> +             * 1 and 2.
>> +             */
>> +            get_avr64(avr, VA, true);
>> +            tcg_gen_movi_i64(result, 0x0ULL);
>> +            break;
>> +        case 1:
>> +            /*
>> +             * Get low doubleword of vA to perfrom 6-5-5 pack of pixels
>> +             * 3 and 4.
>> +             */
>> +            get_avr64(avr, VA, false);
>> +            break;
>> +        case 2:
>> +            /*
>> +             * Get high doubleword of vB to perfrom 6-5-5 pack of pixels
>> +             * 5 and 6.
>> +             */
>> +            get_avr64(avr, VB, true);
>> +            tcg_gen_movi_i64(result, 0x0ULL);
>> +            break;
>> +        case 3:
>> +            /*
>> +             * Get low doubleword of vB to perfrom 6-5-5 pack of pixels

If this is replaced by Richard's suggested version it does not matter but 
there's a typo in above comments. Probably you've meant perfrom -> perform

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
@ 2019-08-27 19:14   ` Richard Henderson
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Henderson @ 2019-08-27 19:14 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 8/27/19 2:37 AM, Stefan Brankovic wrote:
> +    for (i = 0; i < 2; i++) {
> +        if (i == 0) {
> +            /* Get high doubleword of vB in avr. */
> +            get_avr64(avr, VB, true);
> +        } else {
> +            /* Get low doubleword of vB in avr. */
> +            get_avr64(avr, VB, false);
> +        }
> +        /*
> +         * Perform count for every byte element using tcg_gen_clzi_i64.
> +         * Since it counts leading zeros on 64 bit lenght, we have to move
> +         * ith byte element to highest 8 bits of tmp, or it with mask(so we get
> +         * all ones in lowest 56 bits), then perform tcg_gen_clzi_i64 and move
> +         * it's result in appropriate byte element of result.
> +         */
> +        tcg_gen_shli_i64(tmp, avr, 56);
> +        tcg_gen_or_i64(tmp, tmp, mask);
> +        tcg_gen_clzi_i64(result, tmp, 64);
> +        for (j = 1; j < 7; j++) {
> +            tcg_gen_shli_i64(tmp, avr, (7 - j) * 8);
> +            tcg_gen_or_i64(tmp, tmp, mask);
> +            tcg_gen_clzi_i64(tmp, tmp, 64);
> +            tcg_gen_deposit_i64(result, result, tmp, j * 8, 8);
> +        }
> +        tcg_gen_or_i64(tmp, avr, mask);
> +        tcg_gen_clzi_i64(tmp, tmp, 64);
> +        tcg_gen_deposit_i64(result, result, tmp, 56, 8);
> +        if (i == 0) {
> +            /* Place result in high doubleword element of vD. */
> +            tcg_gen_mov_i64(result1, result);
> +        } else {
> +            /* Place result in low doubleword element of vD. */
> +            tcg_gen_mov_i64(result2, result);
> +        }
> +    }

By my count, 60 non-move operations.  This is too many to inline.

Moreover, unlike vpkpx, which I can see being used for graphics
format conversion in old operating systems (who else uses 16-bit
graphics formats now?), I would be very surprised to see vclzb
or vclzh being used frequently.

How did you determine that these instructions needed optimization?

I can see wanting to apply

--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1817,8 +1817,8 @@ VUPK(lsw, s64, s32, UPKLO)
         }                                                               \
     }

-#define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8)
-#define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16)
+#define clzb(v) clz32(((uint32_t)(v) << 24) | 0x00ffffffu)
+#define clzh(v) clz32(((uint32_t)(v) << 16) | 0x0000ffffu)

 VGENERIC_DO(clzb, u8)
 VGENERIC_DO(clzh, u16)

as the cmov instruction required by the current implementation is going to be
quite a bit slower than the OR instruction.

And similarly for ctzb() and ctzh().


r~


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions
  2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
@ 2019-08-27 19:19   ` Richard Henderson
  2019-08-28  0:42     ` David Gibson
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2019-08-27 19:19 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 8/27/19 2:37 AM, Stefan Brankovic wrote:
> Since I found this two instructions implemented with tcg, I refactored
> them so they are consistent with other similar implementations that
> I introduced in this patch.
> 
> Also, a new dual macro GEN_VXFORM_TRANS_DUAL is added. This macro is
> used if one instruction is realized with direct translation, and second
> one with a helper.
> 
> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/translate/vmx-impl.inc.c | 66 +++++++++++++++++++++----------------
>  1 file changed, 37 insertions(+), 29 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions
  2019-08-27 19:19   ` Richard Henderson
@ 2019-08-28  0:42     ` David Gibson
  0 siblings, 0 replies; 12+ messages in thread
From: David Gibson @ 2019-08-28  0:42 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Stefan Brankovic, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1020 bytes --]

On Tue, Aug 27, 2019 at 12:19:27PM -0700, Richard Henderson wrote:
> On 8/27/19 2:37 AM, Stefan Brankovic wrote:
> > Since I found this two instructions implemented with tcg, I refactored
> > them so they are consistent with other similar implementations that
> > I introduced in this patch.
> > 
> > Also, a new dual macro GEN_VXFORM_TRANS_DUAL is added. This macro is
> > used if one instruction is realized with direct translation, and second
> > one with a helper.
> > 
> > Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> > ---
> >  target/ppc/translate/vmx-impl.inc.c | 66 +++++++++++++++++++++----------------
> >  1 file changed, 37 insertions(+), 29 deletions(-)
> 
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

Applied to ppc-for-4.2, thanks.

> 
> 
> r~
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction
  2019-08-27 18:52   ` Richard Henderson
  2019-08-27 19:04     ` BALATON Zoltan
@ 2019-08-29 13:34     ` Stefan Brankovic
  2019-08-29 15:31       ` Richard Henderson
  1 sibling, 1 reply; 12+ messages in thread
From: Stefan Brankovic @ 2019-08-29 13:34 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david

[-- Attachment #1: Type: text/plain, Size: 5764 bytes --]


On 27.8.19. 20:52, Richard Henderson wrote:
> On 8/27/19 2:37 AM, Stefan Brankovic wrote:
>> +    for (i = 0; i < 4; i++) {
>> +        switch (i) {
>> +        case 0:
>> +            /*
>> +             * Get high doubleword of vA to perfrom 6-5-5 pack of pixels
>> +             * 1 and 2.
>> +             */
>> +            get_avr64(avr, VA, true);
>> +            tcg_gen_movi_i64(result, 0x0ULL);
>> +            break;
>> +        case 1:
>> +            /*
>> +             * Get low doubleword of vA to perfrom 6-5-5 pack of pixels
>> +             * 3 and 4.
>> +             */
>> +            get_avr64(avr, VA, false);
>> +            break;
>> +        case 2:
>> +            /*
>> +             * Get high doubleword of vB to perfrom 6-5-5 pack of pixels
>> +             * 5 and 6.
>> +             */
>> +            get_avr64(avr, VB, true);
>> +            tcg_gen_movi_i64(result, 0x0ULL);
>> +            break;
>> +        case 3:
>> +            /*
>> +             * Get low doubleword of vB to perfrom 6-5-5 pack of pixels
>> +             * 7 and 8.
>> +             */
>> +            get_avr64(avr, VB, false);
>> +            break;
>> +        }
>> +        /* Perform the packing for 2 pixels(each iteration for 1). */
>> +        tcg_gen_movi_i64(tmp, 0x0ULL);
>> +        for (j = 0; j < 2; j++) {
>> +            tcg_gen_shri_i64(shifted, avr, (j * 16 + 3));
>> +            tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16));
>> +            tcg_gen_or_i64(tmp, tmp, shifted);
>> +
>> +            tcg_gen_shri_i64(shifted, avr, (j * 16 + 6));
>> +            tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16));
>> +            tcg_gen_or_i64(tmp, tmp, shifted);
>> +
>> +            tcg_gen_shri_i64(shifted, avr, (j * 16 + 9));
>> +            tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16));
>> +            tcg_gen_or_i64(tmp, tmp, shifted);
>> +        }
>> +        if ((i == 0) || (i == 2)) {
>> +            tcg_gen_shli_i64(tmp, tmp, 32);
>> +        }
>> +        tcg_gen_or_i64(result, result, tmp);
>> +        if (i == 1) {
>> +            /* Place packed pixels 1:4 to high doubleword of vD. */
>> +            tcg_gen_mov_i64(result1, result);
>> +        }
>> +        if (i == 3) {
>> +            /* Place packed pixels 5:8 to low doubleword of vD. */
>> +            tcg_gen_mov_i64(result2, result);
>> +        }
>> +    }
>> +    set_avr64(VT, result1, true);
>> +    set_avr64(VT, result2, false);
> I really have a hard time believing that it is worthwhile to inline all of this
> code.  By my count this is 82 non-move opcodes.  That is a *lot* of inline
> expansion.
>
> However, I can well imagine that the existing out-of-line helper is less than
> optimal.
>
>> -void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
>> -{
>> -    int i, j;
>> -    ppc_avr_t result;
>> -#if defined(HOST_WORDS_BIGENDIAN)
>> -    const ppc_avr_t *x[2] = { a, b };
>> -#else
>> -    const ppc_avr_t *x[2] = { b, a };
>> -#endif
>> -
>> -    VECTOR_FOR_INORDER_I(i, u64) {
>> -        VECTOR_FOR_INORDER_I(j, u32) {
>> -            uint32_t e = x[i]->u32[j];
> Double indirect loads?
>
>> -
>> -            result.u16[4 * i + j] = (((e >> 9) & 0xfc00) |
>> -                                     ((e >> 6) & 0x3e0) |
>> -                                     ((e >> 3) & 0x1f));
> Store to temporary ...
>
>> -        }
>> -    }
>> -    *r = result;
> ... and then copy?
>
> Try replacing the existing helper with something like the following.
>
>
> r~
>
>
>
> static inline uint64_t pkpx_1(uint64_t a, int shr, int shl)
> {
>      uint64_t r;
>
>      r  = ((a >> (shr + 9)) & 0x3f) << shl;
>      r |= ((a >> (shr + 6)) & 0x1f) << shl;
>      r |= ((a >> (shr + 3)) & 0x1f) << shl;
>
>      return r;
> }
>
> static inline uint64_t pkpx_2(uint64_t ah, uint64_t al)
> {
>      return pkpx_1(ah, 32, 48)
>           | pkpx_1(ah,  0, 32)
>           | pkpx_1(al, 32, 16)
>           | pkpx_1(al,  0,  0);
> }
>
> void helper_vpkpx(uint64_t *r, uint64_t *a, uint64_t *b)
> {
>      uint64_t rh = pkpx_2(a->VsrD(0), a->VsrD(1));
>      uint64_t rl = pkpx_2(b->VsrD(0), b->VsrD(1));
>      r->VsrD(0) = rh;
>      r->VsrD(1) = rl;
> }

I implemented vpkpx as you suggested above with small modifications(so 
it builds and gives correct result). It looks like this:

static inline uint64_t pkpx_1(uint64_t a, int shr, int shl)
{
     uint64_t r;

     r  = ((a >> (shr + 9)) & 0xfc00) << shl;
     r |= ((a >> (shr + 6)) & 0x3e0) << shl;
     r |= ((a >> (shr + 3)) & 0x1f) << shl;

     return r;
}

static inline uint64_t pkpx_2(uint64_t ah, uint64_t al)
{
     return pkpx_1(ah, 32, 48)
          | pkpx_1(ah,  0, 32)
          | pkpx_1(al, 32, 16)
          | pkpx_1(al,  0,  0);
}

void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
{
     uint64_t rh = pkpx_2(a->u64[1], a->u64[0]);
     uint64_t rl = pkpx_2(b->u64[1], b->u64[0]);
     r->u64[1] = rh;
     r->u64[0] = rl;
}

I also noticed that this would work only for little_endian hosts, so we 
would need to modify it in order to support big_endian hosts (this 
shouldn't affect performance results).

Then I run my performance tests and I got following results(test is 
calling vpkpx 100000 times):

1) Current helper implementation: ~ 157 ms

2) helper implementation you suggested: ~94 ms

3) tcg implementation: ~75 ms

Attached file contains assembly code for both current implementation and 
implementation you suggested, so please take a look at that as well.

Kind Regards,

Stefan


[-- Attachment #2: vpkpx_assembly.txt --]
[-- Type: text/plain, Size: 23654 bytes --]

Current vpkpx implementation:

1)Both c and assembly code:

Dump of assembler code for function helper_vpkpx:
1267	{
   0x0000000000195fe0 <+0>:	48 83 ec 38	sub    $0x38,%rsp

1268	    int i, j;
1269	    ppc_avr_t result;
1270	#if defined(HOST_WORDS_BIGENDIAN)
1271	    const ppc_avr_t *x[2] = { a, b };
1272	#else
1273	    const ppc_avr_t *x[2] = { b, a };
   0x0000000000195fe4 <+4>:	b9 07 00 00 00	mov    $0x7,%ecx

1267	{
   0x0000000000195fe9 <+9>:	64 48 8b 04 25 28 00 00 00	mov    %fs:0x28,%rax
   0x0000000000195ff2 <+18>:	48 89 44 24 28	mov    %rax,0x28(%rsp)
   0x0000000000195ff7 <+23>:	31 c0	xor    %eax,%eax
   0x0000000000195ff9 <+25>:	4c 8d 4c 24 10	lea    0x10(%rsp),%r9

1268	    int i, j;
1269	    ppc_avr_t result;
1270	#if defined(HOST_WORDS_BIGENDIAN)
1271	    const ppc_avr_t *x[2] = { a, b };
1272	#else
1273	    const ppc_avr_t *x[2] = { b, a };
   0x0000000000195ffe <+30>:	48 89 54 24 10	mov    %rdx,0x10(%rsp)
   0x0000000000196003 <+35>:	48 89 74 24 18	mov    %rsi,0x18(%rsp)
   0x0000000000196008 <+40>:	44 8d 51 fc	lea    -0x4(%rcx),%r10d
   0x000000000019600c <+44>:	48 83 c6 0c	add    $0xc,%rsi

1278	            uint32_t e = x[i]->u32[j];
   0x0000000000196010 <+48>:	8b 06	mov    (%rsi),%eax

1279	
1280	            result.u16[4 * i + j] = (((e >> 9) & 0xfc00) |
   0x0000000000196012 <+50>:	4c 63 d9	movslq %ecx,%r11
   0x0000000000196015 <+53>:	83 e9 01	sub    $0x1,%ecx
   0x0000000000196018 <+56>:	48 83 ee 04	sub    $0x4,%rsi
   0x000000000019601c <+60>:	89 c2	mov    %eax,%edx
   0x000000000019601e <+62>:	c1 ea 09	shr    $0x9,%edx
   0x0000000000196021 <+65>:	41 89 d0	mov    %edx,%r8d
   0x0000000000196024 <+68>:	89 c2	mov    %eax,%edx
   0x0000000000196026 <+70>:	c1 e8 03	shr    $0x3,%eax
   0x0000000000196029 <+73>:	c1 ea 06	shr    $0x6,%edx
   0x000000000019602c <+76>:	66 41 81 e0 00 fc	and    $0xfc00,%r8w
   0x0000000000196032 <+82>:	83 e0 1f	and    $0x1f,%eax
   0x0000000000196035 <+85>:	66 81 e2 e0 03	and    $0x3e0,%dx
   0x000000000019603a <+90>:	44 09 c2	or     %r8d,%edx
   0x000000000019603d <+93>:	09 d0	or     %edx,%eax

1277	        VECTOR_FOR_INORDER_I(j, u32) {
   0x000000000019603f <+95>:	41 39 ca	cmp    %ecx,%r10d

1279	
1280	            result.u16[4 * i + j] = (((e >> 9) & 0xfc00) |
   0x0000000000196042 <+98>:	66 42 89 04 5c	mov    %ax,(%rsp,%r11,2)

1277	        VECTOR_FOR_INORDER_I(j, u32) {
   0x0000000000196047 <+103>:	75 c7	jne    0x196010 <helper_vpkpx+48>

1276	    VECTOR_FOR_INORDER_I(i, u64) {
   0x0000000000196049 <+105>:	41 83 fa ff	cmp    $0xffffffff,%r10d
   0x000000000019604d <+109>:	44 89 d1	mov    %r10d,%ecx
   0x0000000000196050 <+112>:	74 0e	je     0x196060 <helper_vpkpx+128>
   0x0000000000196052 <+114>:	49 8b 31	mov    (%r9),%rsi
   0x0000000000196055 <+117>:	49 83 e9 08	sub    $0x8,%r9
   0x0000000000196059 <+121>:	eb ad	jmp    0x196008 <helper_vpkpx+40>
   0x000000000019605b <+123>:	0f 1f 44 00 00	nopl   0x0(%rax,%rax,1)

1281	                                     ((e >> 6) & 0x3e0) |
1282	                                     ((e >> 3) & 0x1f));
1283	//            printf("%x\n",result.u16[4 * i + j]);
1284	        }
1285	    }
1286	//    printf("%lx\n",result.u64[0]);
1287	//    printf("%lx\n",result.u64[1]);
1288	    *r = result;
   0x0000000000196060 <+128>:	48 8b 04 24	mov    (%rsp),%rax
   0x0000000000196064 <+132>:	48 8b 54 24 08	mov    0x8(%rsp),%rdx
   0x0000000000196069 <+137>:	48 89 07	mov    %rax,(%rdi)
   0x000000000019606c <+140>:	48 89 57 08	mov    %rdx,0x8(%rdi)

1289	}
   0x0000000000196070 <+144>:	48 8b 44 24 28	mov    0x28(%rsp),%rax
   0x0000000000196075 <+149>:	64 48 33 04 25 28 00 00 00	xor    %fs:0x28,%rax
   0x000000000019607e <+158>:	75 05	jne    0x196085 <helper_vpkpx+165>
   0x0000000000196080 <+160>:	48 83 c4 38	add    $0x38,%rsp
   0x0000000000196084 <+164>:	c3	retq   
   0x0000000000196085 <+165>:	e8 2e 66 f0 ff	callq  0x9c6b8
End of assembler dump.

2) Only assembly code:

Dump of assembler code for function helper_vpkpx:
   0x0000000000195fe0 <+0>:	48 83 ec 38	sub    $0x38,%rsp
   0x0000000000195fe4 <+4>:	b9 07 00 00 00	mov    $0x7,%ecx
   0x0000000000195fe9 <+9>:	64 48 8b 04 25 28 00 00 00	mov    %fs:0x28,%rax
   0x0000000000195ff2 <+18>:	48 89 44 24 28	mov    %rax,0x28(%rsp)
   0x0000000000195ff7 <+23>:	31 c0	xor    %eax,%eax
   0x0000000000195ff9 <+25>:	4c 8d 4c 24 10	lea    0x10(%rsp),%r9
   0x0000000000195ffe <+30>:	48 89 54 24 10	mov    %rdx,0x10(%rsp)
   0x0000000000196003 <+35>:	48 89 74 24 18	mov    %rsi,0x18(%rsp)
   0x0000000000196008 <+40>:	44 8d 51 fc	lea    -0x4(%rcx),%r10d
   0x000000000019600c <+44>:	48 83 c6 0c	add    $0xc,%rsi
   0x0000000000196010 <+48>:	8b 06	mov    (%rsi),%eax
   0x0000000000196012 <+50>:	4c 63 d9	movslq %ecx,%r11
   0x0000000000196015 <+53>:	83 e9 01	sub    $0x1,%ecx
   0x0000000000196018 <+56>:	48 83 ee 04	sub    $0x4,%rsi
   0x000000000019601c <+60>:	89 c2	mov    %eax,%edx
   0x000000000019601e <+62>:	c1 ea 09	shr    $0x9,%edx
   0x0000000000196021 <+65>:	41 89 d0	mov    %edx,%r8d
   0x0000000000196024 <+68>:	89 c2	mov    %eax,%edx
   0x0000000000196026 <+70>:	c1 e8 03	shr    $0x3,%eax
   0x0000000000196029 <+73>:	c1 ea 06	shr    $0x6,%edx
   0x000000000019602c <+76>:	66 41 81 e0 00 fc	and    $0xfc00,%r8w
   0x0000000000196032 <+82>:	83 e0 1f	and    $0x1f,%eax
   0x0000000000196035 <+85>:	66 81 e2 e0 03	and    $0x3e0,%dx
   0x000000000019603a <+90>:	44 09 c2	or     %r8d,%edx
   0x000000000019603d <+93>:	09 d0	or     %edx,%eax
   0x000000000019603f <+95>:	41 39 ca	cmp    %ecx,%r10d
   0x0000000000196042 <+98>:	66 42 89 04 5c	mov    %ax,(%rsp,%r11,2)
   0x0000000000196047 <+103>:	75 c7	jne    0x196010 <helper_vpkpx+48>
   0x0000000000196049 <+105>:	41 83 fa ff	cmp    $0xffffffff,%r10d
   0x000000000019604d <+109>:	44 89 d1	mov    %r10d,%ecx
   0x0000000000196050 <+112>:	74 0e	je     0x196060 <helper_vpkpx+128>
   0x0000000000196052 <+114>:	49 8b 31	mov    (%r9),%rsi
   0x0000000000196055 <+117>:	49 83 e9 08	sub    $0x8,%r9
   0x0000000000196059 <+121>:	eb ad	jmp    0x196008 <helper_vpkpx+40>
   0x000000000019605b <+123>:	0f 1f 44 00 00	nopl   0x0(%rax,%rax,1)
   0x0000000000196060 <+128>:	48 8b 04 24	mov    (%rsp),%rax
   0x0000000000196064 <+132>:	48 8b 54 24 08	mov    0x8(%rsp),%rdx
   0x0000000000196069 <+137>:	48 89 07	mov    %rax,(%rdi)
   0x000000000019606c <+140>:	48 89 57 08	mov    %rdx,0x8(%rdi)
   0x0000000000196070 <+144>:	48 8b 44 24 28	mov    0x28(%rsp),%rax
   0x0000000000196075 <+149>:	64 48 33 04 25 28 00 00 00	xor    %fs:0x28,%rax
   0x000000000019607e <+158>:	75 05	jne    0x196085 <helper_vpkpx+165>
   0x0000000000196080 <+160>:	48 83 c4 38	add    $0x38,%rsp
   0x0000000000196084 <+164>:	c3	retq   
   0x0000000000196085 <+165>:	e8 2e 66 f0 ff	callq  0x9c6b8
End of assembler dump.

Implementation you suggested:

1)Both c and assembly code:

Dump of assembler code for function helper_vpkpx:
1313	{
   0x0000000000195fe0 <+0>:	55	push   %rbp
   0x0000000000195fe1 <+1>:	53	push   %rbx

1314	    uint64_t rh = pkpx_2(a->u64[1], a->u64[0]);
   0x0000000000195fe2 <+2>:	48 8b 46 08	mov    0x8(%rsi),%rax
   0x0000000000195fe6 <+6>:	48 8b 0e	mov    (%rsi),%rcx

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000195fe9 <+9>:	49 89 c1	mov    %rax,%r9

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x0000000000195fec <+12>:	48 89 c6	mov    %rax,%rsi
   0x0000000000195fef <+15>:	49 89 c3	mov    %rax,%r11
   0x0000000000195ff2 <+18>:	48 c1 ee 29	shr    $0x29,%rsi

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000195ff6 <+22>:	49 c1 e9 26	shr    $0x26,%r9

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x0000000000195ffa <+26>:	49 c1 eb 09	shr    $0x9,%r11
   0x0000000000195ffe <+30>:	81 e6 00 fc 00 00	and    $0xfc00,%esi

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000196004 <+36>:	41 81 e1 e0 03 00 00	and    $0x3e0,%r9d
   0x000000000019600b <+43>:	49 89 ca	mov    %rcx,%r10

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x000000000019600e <+46>:	49 89 f0	mov    %rsi,%r8

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000196011 <+49>:	4c 89 ce	mov    %r9,%rsi

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x0000000000196014 <+52>:	49 89 c1	mov    %rax,%r9
   0x0000000000196017 <+55>:	49 c1 e9 23	shr    $0x23,%r9

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x000000000019601b <+59>:	4c 09 c6	or     %r8,%rsi
   0x000000000019601e <+62>:	49 c1 ea 26	shr    $0x26,%r10

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x0000000000196022 <+66>:	41 83 e1 1f	and    $0x1f,%r9d

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000196026 <+70>:	41 81 e2 e0 03 00 00	and    $0x3e0,%r10d

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x000000000019602d <+77>:	49 09 f1	or     %rsi,%r9

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x0000000000196030 <+80>:	4c 89 de	mov    %r11,%rsi

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000196033 <+83>:	49 89 c3	mov    %rax,%r11
   0x0000000000196036 <+86>:	49 c1 eb 06	shr    $0x6,%r11

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x000000000019603a <+90>:	81 e6 00 fc 00 00	and    $0xfc00,%esi

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x0000000000196040 <+96>:	48 c1 e8 03	shr    $0x3,%rax

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000196044 <+100>:	41 81 e3 e0 03 00 00	and    $0x3e0,%r11d

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x000000000019604b <+107>:	83 e0 1f	and    $0x1f,%eax

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x000000000019604e <+110>:	49 09 f3	or     %rsi,%r11

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x0000000000196051 <+113>:	49 09 c3	or     %rax,%r11

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x0000000000196054 <+116>:	48 89 c8	mov    %rcx,%rax
   0x0000000000196057 <+119>:	48 c1 e8 29	shr    $0x29,%rax
   0x000000000019605b <+123>:	25 00 fc 00 00	and    $0xfc00,%eax
   0x0000000000196060 <+128>:	48 89 c6	mov    %rax,%rsi

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000196063 <+131>:	4c 89 d0	mov    %r10,%rax

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x0000000000196066 <+134>:	49 89 ca	mov    %rcx,%r10
   0x0000000000196069 <+137>:	49 c1 ea 23	shr    $0x23,%r10

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x000000000019606d <+141>:	48 09 f0	or     %rsi,%rax

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x0000000000196070 <+144>:	41 83 e2 1f	and    $0x1f,%r10d
   0x0000000000196074 <+148>:	49 09 c2	or     %rax,%r10

1315	    uint64_t rl = pkpx_2(b->u64[1], b->u64[0]);
   0x0000000000196077 <+151>:	48 8b 02	mov    (%rdx),%rax
   0x000000000019607a <+154>:	48 8b 52 08	mov    0x8(%rdx),%rdx

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x000000000019607e <+158>:	49 89 d0	mov    %rdx,%r8

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x0000000000196081 <+161>:	48 89 d6	mov    %rdx,%rsi

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000196084 <+164>:	49 c1 e8 26	shr    $0x26,%r8

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x0000000000196088 <+168>:	48 c1 ee 29	shr    $0x29,%rsi

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x000000000019608c <+172>:	41 81 e0 e0 03 00 00	and    $0x3e0,%r8d

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x0000000000196093 <+179>:	48 89 f3	mov    %rsi,%rbx

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000196096 <+182>:	4c 89 c6	mov    %r8,%rsi

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x0000000000196099 <+185>:	49 89 d0	mov    %rdx,%r8

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x000000000019609c <+188>:	81 e3 00 fc 00 00	and    $0xfc00,%ebx

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x00000000001960a2 <+194>:	49 c1 e8 23	shr    $0x23,%r8

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x00000000001960a6 <+198>:	48 09 de	or     %rbx,%rsi

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x00000000001960a9 <+201>:	41 83 e0 1f	and    $0x1f,%r8d
   0x00000000001960ad <+205>:	49 09 f0	or     %rsi,%r8

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x00000000001960b0 <+208>:	48 89 d6	mov    %rdx,%rsi
   0x00000000001960b3 <+211>:	48 c1 ee 09	shr    $0x9,%rsi

1316	    r->u64[1] = rh;
   0x00000000001960b7 <+215>:	49 c1 e1 30	shl    $0x30,%r9
   0x00000000001960bb <+219>:	49 c1 e3 20	shl    $0x20,%r11

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x00000000001960bf <+223>:	48 89 f3	mov    %rsi,%rbx

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x00000000001960c2 <+226>:	48 89 d6	mov    %rdx,%rsi

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x00000000001960c5 <+229>:	48 c1 ea 03	shr    $0x3,%rdx

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x00000000001960c9 <+233>:	48 c1 ee 06	shr    $0x6,%rsi

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x00000000001960cd <+237>:	81 e3 00 fc 00 00	and    $0xfc00,%ebx

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x00000000001960d3 <+243>:	83 e2 1f	and    $0x1f,%edx

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x00000000001960d6 <+246>:	81 e6 e0 03 00 00	and    $0x3e0,%esi

1316	    r->u64[1] = rh;
   0x00000000001960dc <+252>:	49 c1 e2 10	shl    $0x10,%r10

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x00000000001960e0 <+256>:	48 09 de	or     %rbx,%rsi

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x00000000001960e3 <+259>:	48 89 c3	mov    %rax,%rbx

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x00000000001960e6 <+262>:	48 09 f2	or     %rsi,%rdx

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x00000000001960e9 <+265>:	48 89 c6	mov    %rax,%rsi

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x00000000001960ec <+268>:	48 c1 eb 29	shr    $0x29,%rbx

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x00000000001960f0 <+272>:	48 c1 ee 26	shr    $0x26,%rsi

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x00000000001960f4 <+276>:	81 e3 00 fc 00 00	and    $0xfc00,%ebx

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x00000000001960fa <+282>:	81 e6 e0 03 00 00	and    $0x3e0,%esi

1296	    r  = ((a >> (shr + 9)) & 0xfc00);
   0x0000000000196100 <+288>:	48 89 dd	mov    %rbx,%rbp

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x0000000000196103 <+291>:	48 89 f3	mov    %rsi,%rbx

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x0000000000196106 <+294>:	48 89 c6	mov    %rax,%rsi
   0x0000000000196109 <+297>:	48 c1 ee 23	shr    $0x23,%rsi

1297	    r |= ((a >> (shr + 6)) & 0x3e0);
   0x000000000019610d <+301>:	48 09 eb	or     %rbp,%rbx

1298	    r |= ((a >> (shr + 3)) & 0x1f);
   0x0000000000196110 <+304>:	83 e6 1f	and    $0x1f,%esi
   0x0000000000196113 <+307>:	48 09 de	or     %rbx,%rsi

1316	    r->u64[1] = rh;
   0x0000000000196116 <+310>:	48 89 cb	mov    %rcx,%rbx
   0x0000000000196119 <+313>:	48 c1 eb 09	shr    $0x9,%rbx
   0x000000000019611d <+317>:	81 e3 00 fc 00 00	and    $0xfc00,%ebx
   0x0000000000196123 <+323>:	48 89 dd	mov    %rbx,%rbp
   0x0000000000196126 <+326>:	48 89 cb	mov    %rcx,%rbx
   0x0000000000196129 <+329>:	48 c1 e9 03	shr    $0x3,%rcx
   0x000000000019612d <+333>:	48 c1 eb 06	shr    $0x6,%rbx
   0x0000000000196131 <+337>:	83 e1 1f	and    $0x1f,%ecx
   0x0000000000196134 <+340>:	81 e3 e0 03 00 00	and    $0x3e0,%ebx
   0x000000000019613a <+346>:	48 09 eb	or     %rbp,%rbx
   0x000000000019613d <+349>:	48 09 d9	or     %rbx,%rcx
   0x0000000000196140 <+352>:	4c 09 c9	or     %r9,%rcx
   0x0000000000196143 <+355>:	4c 09 d9	or     %r11,%rcx
   0x0000000000196146 <+358>:	4c 09 d1	or     %r10,%rcx
   0x0000000000196149 <+361>:	48 89 4f 08	mov    %rcx,0x8(%rdi)

1317	    r->u64[0] = rl;
   0x000000000019614d <+365>:	48 89 c1	mov    %rax,%rcx
   0x0000000000196150 <+368>:	48 c1 e9 09	shr    $0x9,%rcx
   0x0000000000196154 <+372>:	81 e1 00 fc 00 00	and    $0xfc00,%ecx
   0x000000000019615a <+378>:	49 89 c9	mov    %rcx,%r9
   0x000000000019615d <+381>:	48 89 c1	mov    %rax,%rcx
   0x0000000000196160 <+384>:	48 c1 e9 06	shr    $0x6,%rcx
   0x0000000000196164 <+388>:	48 c1 e8 03	shr    $0x3,%rax
   0x0000000000196168 <+392>:	49 c1 e0 30	shl    $0x30,%r8
   0x000000000019616c <+396>:	81 e1 e0 03 00 00	and    $0x3e0,%ecx
   0x0000000000196172 <+402>:	83 e0 1f	and    $0x1f,%eax
   0x0000000000196175 <+405>:	48 c1 e2 20	shl    $0x20,%rdx
   0x0000000000196179 <+409>:	4c 09 c9	or     %r9,%rcx
   0x000000000019617c <+412>:	48 09 c8	or     %rcx,%rax
   0x000000000019617f <+415>:	48 89 f1	mov    %rsi,%rcx
   0x0000000000196182 <+418>:	4c 09 c0	or     %r8,%rax
   0x0000000000196185 <+421>:	48 c1 e1 10	shl    $0x10,%rcx
   0x0000000000196189 <+425>:	48 09 d0	or     %rdx,%rax
   0x000000000019618c <+428>:	48 09 c8	or     %rcx,%rax

1318	}
   0x000000000019618f <+431>:	5b	pop    %rbx

1317	    r->u64[0] = rl;
   0x0000000000196190 <+432>:	48 89 07	mov    %rax,(%rdi)

1318	}
   0x0000000000196193 <+435>:	5d	pop    %rbp
   0x0000000000196194 <+436>:	c3	retq   
End of assembler dump.

2) Only assembly code:

Dump of assembler code for function helper_vpkpx:
   0x0000000000195fe0 <+0>:	55	push   %rbp
   0x0000000000195fe1 <+1>:	53	push   %rbx
   0x0000000000195fe2 <+2>:	48 8b 46 08	mov    0x8(%rsi),%rax
   0x0000000000195fe6 <+6>:	48 8b 0e	mov    (%rsi),%rcx
   0x0000000000195fe9 <+9>:	49 89 c1	mov    %rax,%r9
   0x0000000000195fec <+12>:	48 89 c6	mov    %rax,%rsi
   0x0000000000195fef <+15>:	49 89 c3	mov    %rax,%r11
   0x0000000000195ff2 <+18>:	48 c1 ee 29	shr    $0x29,%rsi
   0x0000000000195ff6 <+22>:	49 c1 e9 26	shr    $0x26,%r9
   0x0000000000195ffa <+26>:	49 c1 eb 09	shr    $0x9,%r11
   0x0000000000195ffe <+30>:	81 e6 00 fc 00 00	and    $0xfc00,%esi
   0x0000000000196004 <+36>:	41 81 e1 e0 03 00 00	and    $0x3e0,%r9d
   0x000000000019600b <+43>:	49 89 ca	mov    %rcx,%r10
   0x000000000019600e <+46>:	49 89 f0	mov    %rsi,%r8
   0x0000000000196011 <+49>:	4c 89 ce	mov    %r9,%rsi
   0x0000000000196014 <+52>:	49 89 c1	mov    %rax,%r9
   0x0000000000196017 <+55>:	49 c1 e9 23	shr    $0x23,%r9
   0x000000000019601b <+59>:	4c 09 c6	or     %r8,%rsi
   0x000000000019601e <+62>:	49 c1 ea 26	shr    $0x26,%r10
   0x0000000000196022 <+66>:	41 83 e1 1f	and    $0x1f,%r9d
   0x0000000000196026 <+70>:	41 81 e2 e0 03 00 00	and    $0x3e0,%r10d
   0x000000000019602d <+77>:	49 09 f1	or     %rsi,%r9
   0x0000000000196030 <+80>:	4c 89 de	mov    %r11,%rsi
   0x0000000000196033 <+83>:	49 89 c3	mov    %rax,%r11
   0x0000000000196036 <+86>:	49 c1 eb 06	shr    $0x6,%r11
   0x000000000019603a <+90>:	81 e6 00 fc 00 00	and    $0xfc00,%esi
   0x0000000000196040 <+96>:	48 c1 e8 03	shr    $0x3,%rax
   0x0000000000196044 <+100>:	41 81 e3 e0 03 00 00	and    $0x3e0,%r11d
   0x000000000019604b <+107>:	83 e0 1f	and    $0x1f,%eax
   0x000000000019604e <+110>:	49 09 f3	or     %rsi,%r11
   0x0000000000196051 <+113>:	49 09 c3	or     %rax,%r11
   0x0000000000196054 <+116>:	48 89 c8	mov    %rcx,%rax
   0x0000000000196057 <+119>:	48 c1 e8 29	shr    $0x29,%rax
   0x000000000019605b <+123>:	25 00 fc 00 00	and    $0xfc00,%eax
   0x0000000000196060 <+128>:	48 89 c6	mov    %rax,%rsi
   0x0000000000196063 <+131>:	4c 89 d0	mov    %r10,%rax
   0x0000000000196066 <+134>:	49 89 ca	mov    %rcx,%r10
   0x0000000000196069 <+137>:	49 c1 ea 23	shr    $0x23,%r10
   0x000000000019606d <+141>:	48 09 f0	or     %rsi,%rax
   0x0000000000196070 <+144>:	41 83 e2 1f	and    $0x1f,%r10d
   0x0000000000196074 <+148>:	49 09 c2	or     %rax,%r10
   0x0000000000196077 <+151>:	48 8b 02	mov    (%rdx),%rax
   0x000000000019607a <+154>:	48 8b 52 08	mov    0x8(%rdx),%rdx
   0x000000000019607e <+158>:	49 89 d0	mov    %rdx,%r8
   0x0000000000196081 <+161>:	48 89 d6	mov    %rdx,%rsi
   0x0000000000196084 <+164>:	49 c1 e8 26	shr    $0x26,%r8
   0x0000000000196088 <+168>:	48 c1 ee 29	shr    $0x29,%rsi
   0x000000000019608c <+172>:	41 81 e0 e0 03 00 00	and    $0x3e0,%r8d
   0x0000000000196093 <+179>:	48 89 f3	mov    %rsi,%rbx
   0x0000000000196096 <+182>:	4c 89 c6	mov    %r8,%rsi
   0x0000000000196099 <+185>:	49 89 d0	mov    %rdx,%r8
   0x000000000019609c <+188>:	81 e3 00 fc 00 00	and    $0xfc00,%ebx
   0x00000000001960a2 <+194>:	49 c1 e8 23	shr    $0x23,%r8
   0x00000000001960a6 <+198>:	48 09 de	or     %rbx,%rsi
   0x00000000001960a9 <+201>:	41 83 e0 1f	and    $0x1f,%r8d
   0x00000000001960ad <+205>:	49 09 f0	or     %rsi,%r8
   0x00000000001960b0 <+208>:	48 89 d6	mov    %rdx,%rsi
   0x00000000001960b3 <+211>:	48 c1 ee 09	shr    $0x9,%rsi
   0x00000000001960b7 <+215>:	49 c1 e1 30	shl    $0x30,%r9
   0x00000000001960bb <+219>:	49 c1 e3 20	shl    $0x20,%r11
   0x00000000001960bf <+223>:	48 89 f3	mov    %rsi,%rbx
   0x00000000001960c2 <+226>:	48 89 d6	mov    %rdx,%rsi
   0x00000000001960c5 <+229>:	48 c1 ea 03	shr    $0x3,%rdx
   0x00000000001960c9 <+233>:	48 c1 ee 06	shr    $0x6,%rsi
   0x00000000001960cd <+237>:	81 e3 00 fc 00 00	and    $0xfc00,%ebx
   0x00000000001960d3 <+243>:	83 e2 1f	and    $0x1f,%edx
   0x00000000001960d6 <+246>:	81 e6 e0 03 00 00	and    $0x3e0,%esi
   0x00000000001960dc <+252>:	49 c1 e2 10	shl    $0x10,%r10
   0x00000000001960e0 <+256>:	48 09 de	or     %rbx,%rsi
   0x00000000001960e3 <+259>:	48 89 c3	mov    %rax,%rbx
   0x00000000001960e6 <+262>:	48 09 f2	or     %rsi,%rdx
   0x00000000001960e9 <+265>:	48 89 c6	mov    %rax,%rsi
   0x00000000001960ec <+268>:	48 c1 eb 29	shr    $0x29,%rbx
   0x00000000001960f0 <+272>:	48 c1 ee 26	shr    $0x26,%rsi
   0x00000000001960f4 <+276>:	81 e3 00 fc 00 00	and    $0xfc00,%ebx
   0x00000000001960fa <+282>:	81 e6 e0 03 00 00	and    $0x3e0,%esi
   0x0000000000196100 <+288>:	48 89 dd	mov    %rbx,%rbp
   0x0000000000196103 <+291>:	48 89 f3	mov    %rsi,%rbx
   0x0000000000196106 <+294>:	48 89 c6	mov    %rax,%rsi
   0x0000000000196109 <+297>:	48 c1 ee 23	shr    $0x23,%rsi
   0x000000000019610d <+301>:	48 09 eb	or     %rbp,%rbx
   0x0000000000196110 <+304>:	83 e6 1f	and    $0x1f,%esi
   0x0000000000196113 <+307>:	48 09 de	or     %rbx,%rsi
   0x0000000000196116 <+310>:	48 89 cb	mov    %rcx,%rbx
   0x0000000000196119 <+313>:	48 c1 eb 09	shr    $0x9,%rbx
   0x000000000019611d <+317>:	81 e3 00 fc 00 00	and    $0xfc00,%ebx
   0x0000000000196123 <+323>:	48 89 dd	mov    %rbx,%rbp
   0x0000000000196126 <+326>:	48 89 cb	mov    %rcx,%rbx
   0x0000000000196129 <+329>:	48 c1 e9 03	shr    $0x3,%rcx
   0x000000000019612d <+333>:	48 c1 eb 06	shr    $0x6,%rbx
   0x0000000000196131 <+337>:	83 e1 1f	and    $0x1f,%ecx
   0x0000000000196134 <+340>:	81 e3 e0 03 00 00	and    $0x3e0,%ebx
   0x000000000019613a <+346>:	48 09 eb	or     %rbp,%rbx
   0x000000000019613d <+349>:	48 09 d9	or     %rbx,%rcx
   0x0000000000196140 <+352>:	4c 09 c9	or     %r9,%rcx
   0x0000000000196143 <+355>:	4c 09 d9	or     %r11,%rcx
   0x0000000000196146 <+358>:	4c 09 d1	or     %r10,%rcx
   0x0000000000196149 <+361>:	48 89 4f 08	mov    %rcx,0x8(%rdi)
   0x000000000019614d <+365>:	48 89 c1	mov    %rax,%rcx
   0x0000000000196150 <+368>:	48 c1 e9 09	shr    $0x9,%rcx
   0x0000000000196154 <+372>:	81 e1 00 fc 00 00	and    $0xfc00,%ecx
   0x000000000019615a <+378>:	49 89 c9	mov    %rcx,%r9
   0x000000000019615d <+381>:	48 89 c1	mov    %rax,%rcx
   0x0000000000196160 <+384>:	48 c1 e9 06	shr    $0x6,%rcx
   0x0000000000196164 <+388>:	48 c1 e8 03	shr    $0x3,%rax
   0x0000000000196168 <+392>:	49 c1 e0 30	shl    $0x30,%r8
   0x000000000019616c <+396>:	81 e1 e0 03 00 00	and    $0x3e0,%ecx
   0x0000000000196172 <+402>:	83 e0 1f	and    $0x1f,%eax
   0x0000000000196175 <+405>:	48 c1 e2 20	shl    $0x20,%rdx
   0x0000000000196179 <+409>:	4c 09 c9	or     %r9,%rcx
   0x000000000019617c <+412>:	48 09 c8	or     %rcx,%rax
   0x000000000019617f <+415>:	48 89 f1	mov    %rsi,%rcx
   0x0000000000196182 <+418>:	4c 09 c0	or     %r8,%rax
   0x0000000000196185 <+421>:	48 c1 e1 10	shl    $0x10,%rcx
   0x0000000000196189 <+425>:	48 09 d0	or     %rdx,%rax
   0x000000000019618c <+428>:	48 09 c8	or     %rcx,%rax
   0x000000000019618f <+431>:	5b	pop    %rbx
   0x0000000000196190 <+432>:	48 89 07	mov    %rax,(%rdi)
   0x0000000000196193 <+435>:	5d	pop    %rbp
   0x0000000000196194 <+436>:	c3	retq   
End of assembler dump.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction
  2019-08-29 13:34     ` Stefan Brankovic
@ 2019-08-29 15:31       ` Richard Henderson
  2019-10-16 13:53         ` Stefan Brankovic
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2019-08-29 15:31 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 8/29/19 6:34 AM, Stefan Brankovic wrote:
> Then I run my performance tests and I got following results(test is calling
> vpkpx 100000 times):
> 
> 1) Current helper implementation: ~ 157 ms
> 
> 2) helper implementation you suggested: ~94 ms
> 
> 3) tcg implementation: ~75 ms

I assume you tested in a loop.  If you have just the one expansion, you'll not
see the penalty for the icache expansion.  To show the other extreme, you'd
want to test as separate sequential invocations.

That said, I'd be more interested in a real test case that isn't just calling
one instruction over and over.  Is there a real test case that shows vpkpx in
the top 25 of the profile?  With more than 0.5% of runtime?

r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction
  2019-08-29 15:31       ` Richard Henderson
@ 2019-10-16 13:53         ` Stefan Brankovic
  0 siblings, 0 replies; 12+ messages in thread
From: Stefan Brankovic @ 2019-10-16 13:53 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david


On 29.8.19. 17:31, Richard Henderson wrote:
> On 8/29/19 6:34 AM, Stefan Brankovic wrote:
>> Then I run my performance tests and I got following results(test is calling
>> vpkpx 100000 times):
>>
>> 1) Current helper implementation: ~ 157 ms
>>
>> 2) helper implementation you suggested: ~94 ms
>>
>> 3) tcg implementation: ~75 ms
> I assume you tested in a loop.  If you have just the one expansion, you'll not
> see the penalty for the icache expansion.  To show the other extreme, you'd
> want to test as separate sequential invocations.
Yes, testing is done in a loop.
>
> That said, I'd be more interested in a real test case that isn't just calling
> one instruction over and over.  Is there a real test case that shows vpkpx in
> the top 25 of the profile?  With more than 0.5% of runtime?
>
>
> r~

I made an experiment where I started MAC OSX 10.4 in QEMU system mode 
and I found out that vpkpx instruction is widely used to display 
different graphical elements. With that in mind, this performance 
improvement is of great importance.

Also, vpkpx instruction is often used in a loop, to process big amount 
of pixels at once. That's why testing performance of this instruction in 
a loop should give good insight of how this instruction perform overall.

Kind Regards,

Stefan



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-10-16 13:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-27  9:37 [Qemu-devel] [PATCH v6 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 1/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
2019-08-27 18:52   ` Richard Henderson
2019-08-27 19:04     ` BALATON Zoltan
2019-08-29 13:34     ` Stefan Brankovic
2019-08-29 15:31       ` Richard Henderson
2019-10-16 13:53         ` Stefan Brankovic
2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 2/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
2019-08-27 19:14   ` Richard Henderson
2019-08-27  9:37 ` [Qemu-devel] [PATCH v6 3/3] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
2019-08-27 19:19   ` Richard Henderson
2019-08-28  0:42     ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).