[PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions
@ 2019-10-17 11:27 Stefan Brankovic
  2019-10-17 11:27 ` [PATCH v7 1/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Stefan Brankovic @ 2019-10-17 11:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: stefan.brankovic, richard.henderson, david

Optimize emulation of ten Altivec instructions: lvsl, lvsr, vsl, vsr, vpkpx,
vgbbd, vclzb, vclzh, vclzw, vclzd, vupkhpx and vupklpx.

This series buils up on and complements recent work of Thomas Murta, Mark
Cave-Ayland and Richard Henderson in the same area. It is based on devising TCG
translation implementation for selected instructions rather than using helpers.
The selected instructions are most of the time idiosyncratic to ppc platform,
so relatively complex TCG translation (without direct mapping to host
instruction that is not possible in these cases) seems to be the best option,
and that approach is presented in this series. The performance improvements
are significant in all cases.

V7:

Added optimization for vupkhpx and vupklpx instructions.

V6:

Rebased series to the latest qemu code.
Excluded all patches that are already accepted.

V5:

Fixed vpkpx bug and added it back in patch.
Fixed graphical distortions on OSX 10.3 and 10.4.
Removed conversion of vmrgh and vmrgl instructions to vector operations for
further investigation.

V4:

Addressed Richard's Henderson's suggestions.
Removed vpkpx's optimization for further investigation on graphical distortions
it caused on OSX 10.2-4 guests.
Added opcodes for vector vmrgh(b|h|w) and vmrgl(b|h|w) in tcg.
Implemented vector vmrgh and vmrgl instructions for i386.
Converted vmrgh and vmrgl instructions to vector operations.

V3:

Fixed problem during build.

V2:

Addressed Richard's Henderson's suggestions.
Fixed problem during build on patch 2/8.
Rebased series to the latest qemu code.

Stefan Brankovic (3):
  target/ppc: Optimize emulation of vclzh and vclzb instructions
  target/ppc: Optimize emulation of vpkpx instruction
  target/ppc: Optimize emulation of vupkhpx and vupklpx instructions

 target/ppc/helper.h                 |   5 -
 target/ppc/int_helper.c             |  50 ------
 target/ppc/translate/vmx-impl.inc.c | 326 +++++++++++++++++++++++++++++++++++-
 3 files changed, 321 insertions(+), 60 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v7 1/3] target/ppc: Optimize emulation of vclzh and vclzb instructions
  2019-10-17 11:27 [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
@ 2019-10-17 11:27 ` Stefan Brankovic
  2019-10-19 19:51   ` Aleksandar Markovic
  2019-10-17 11:27 ` [PATCH v7 2/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Stefan Brankovic @ 2019-10-17 11:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: stefan.brankovic, richard.henderson, david

Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
This instruction counts the number of leading zeros of each halfword element
in source register and places result in the appropriate halfword element of
destination register.

In each iteration of outer for loop count operation is performed on one
doubleword element of source register vB. In the first iteration, higher
doubleword element of vB is placed in variable avr, and then counting
for every halfword element is performed by  using tcg_gen_clzi_i64.
Since it counts leading zeros on 64 bit lenght, ith byte element has to
be moved to the highest 16 bits of tmp, or-ed with mask(in order to get all
ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move it's result
in appropriate halfword element of result. This is done in inner for loop.
After the operation is finished, the result is saved in the appropriate
doubleword element of destination register vD. The same sequence of orders
is to be applied again for the  lower doubleword element of vB.

Optimize Altivec instruction vclzb (Vector Count Leading Zeros Byte).
This instruction counts the number of leading zeros of each byte element
in source register and places result in the appropriate byte element of
destination register.

In each iteration of the outer for loop, counting operation is done on one
doubleword element of source register vB. In the first iteration, the
higher doubleword element of vB is placed in variable avr, and then counting
for every byte element is performed using tcg_gen_clzi_i64. Since it counts
leading zeros on 64 bit lenght, ith byte element has to be moved to the highest
8 bits of variable  tmp, or-ed with mask(in order to get all ones in the lowest
56 bits), then perform tcg_gen_clzi_i64 and move it's result in the appropriate
byte element of result. This is done in inner for loop. After the operation is
finished, the result is saved in the  appropriate doubleword element of destination
register vD. The same sequence of orders is to be applied again for the lower
doubleword element of vB.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |   2 -
 target/ppc/int_helper.c             |   9 ---
 target/ppc/translate/vmx-impl.inc.c | 136 +++++++++++++++++++++++++++++++++++-
 3 files changed, 134 insertions(+), 13 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index f843814..281e54f 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -308,8 +308,6 @@ DEF_HELPER_4(vcfsx, void, env, avr, avr, i32)
 DEF_HELPER_4(vctuxs, void, env, avr, avr, i32)
 DEF_HELPER_4(vctsxs, void, env, avr, avr, i32)
 
-DEF_HELPER_2(vclzb, void, avr, avr)
-DEF_HELPER_2(vclzh, void, avr, avr)
 DEF_HELPER_2(vctzb, void, avr, avr)
 DEF_HELPER_2(vctzh, void, avr, avr)
 DEF_HELPER_2(vctzw, void, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 6d238b9..cd00f5e 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1817,15 +1817,6 @@ VUPK(lsw, s64, s32, UPKLO)
         }                                                               \
     }
 
-#define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8)
-#define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16)
-
-VGENERIC_DO(clzb, u8)
-VGENERIC_DO(clzh, u16)
-
-#undef clzb
-#undef clzh
-
 #define ctzb(v) ((v) ? ctz32(v) : 8)
 #define ctzh(v) ((v) ? ctz32(v) : 16)
 #define ctzw(v) ctz32((v))
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 2472a52..a428ef3 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -751,6 +751,138 @@ static void trans_vgbbd(DisasContext *ctx)
 }
 
 /*
+ * vclzb VRT,VRB - Vector Count Leading Zeros Byte
+ *
+ * Counting the number of leading zero bits of each byte element in source
+ * register and placing result in appropriate byte element of destination
+ * register.
+ */
+static void trans_vclzb(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 result1 = tcg_temp_new_i64();
+    TCGv_i64 result2 = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword of vB in 'avr'. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword of vB in 'avr'. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every byte element using 'tcg_gen_clzi_i64'.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 8 bits of 'tmp', or it with mask(so we
+         * get all ones in lowest 56 bits), then perform 'tcg_gen_clzi_i64' and
+         * move it's result in appropriate byte element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 56);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 7; j++) {
+            tcg_gen_shli_i64(tmp, avr, (7 - j) * 8);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 8, 8);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 56, 8);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            tcg_gen_mov_i64(result1, result);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            tcg_gen_mov_i64(result2, result);
+        }
+    }
+
+    set_avr64(VT, result1, true);
+    set_avr64(VT, result2, false);
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(result1);
+    tcg_temp_free_i64(result2);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
+ * vclzh VRT,VRB - Vector Count Leading Zeros Halfword
+ *
+ * Counting the number of leading zero bits of each halfword element in source
+ * register and placing result in appropriate halfword element of destination
+ * register.
+ */
+static void trans_vclzh(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 result1 = tcg_temp_new_i64();
+    TCGv_i64 result2 = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword element of vB in 'avr'. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword element of vB in 'avr'. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every halfword element using 'tcg_gen_clzi_i64'.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 16 bits of 'tmp', or it with mask(so we
+         * get all ones in lowest 48 bits), then perform 'tcg_gen_clzi_i64' and
+         * move it's result in appropriate halfword element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 48);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 3; j++) {
+            tcg_gen_shli_i64(tmp, avr, (3 - j) * 16);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 16, 16);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 48, 16);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            tcg_gen_mov_i64(result1, result);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            tcg_gen_mov_i64(result2, result);
+        }
+    }
+
+    set_avr64(VT, result1, true);
+    set_avr64(VT, result2, false);
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(result1);
+    tcg_temp_free_i64(result2);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
  * vclzw VRT,VRB - Vector Count Leading Zeros Word
  *
  * Counting the number of leading zero bits of each word element in source
@@ -1315,8 +1447,8 @@ GEN_VAFORM_PAIRED(vmsumshm, vmsumshs, 20)
 GEN_VAFORM_PAIRED(vsel, vperm, 21)
 GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
 
-GEN_VXFORM_NOA(vclzb, 1, 28)
-GEN_VXFORM_NOA(vclzh, 1, 29)
+GEN_VXFORM_TRANS(vclzb, 1, 28)
+GEN_VXFORM_TRANS(vclzh, 1, 29)
 GEN_VXFORM_TRANS(vclzw, 1, 30)
 GEN_VXFORM_TRANS(vclzd, 1, 31)
 GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v7 2/3] target/ppc: Optimize emulation of vpkpx instruction
  2019-10-17 11:27 [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
  2019-10-17 11:27 ` [PATCH v7 1/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
@ 2019-10-17 11:27 ` Stefan Brankovic
  2019-10-19 19:35   ` Aleksandar Markovic
  2019-10-17 11:27 ` [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions Stefan Brankovic
  2019-10-19 19:54 ` [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions Aleksandar Markovic
  3 siblings, 1 reply; 11+ messages in thread
From: Stefan Brankovic @ 2019-10-17 11:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: stefan.brankovic, richard.henderson, david

Optimize altivec instruction vpkpx (Vector Pack Pixel).
Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register)
into contigous array of bits in the destination register.

In each iteration of outer loop, the instruction is to be done with
the 6-5-5 pack for 2 pixels of each doubleword element of each
source register. The first thing to be done in outer loop is
choosing which doubleword element of which register is to be used
in current iteration and it is to be placed in avr variable. The
next step is to perform 6-5-5 pack of pixels on avr variable in inner
for loop(2 iterations, 1 for each pixel) and save result in tmp variable.
In the end of outer for loop, the result is merged in variable called
result and saved in appropriate doubleword element of vD if the whole
doubleword is finished(every second iteration). The outer loop has 4
iterations.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |  1 -
 target/ppc/int_helper.c             | 21 --------
 target/ppc/translate/vmx-impl.inc.c | 99 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 98 insertions(+), 23 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 281e54f..b489b38 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -258,7 +258,6 @@ DEF_HELPER_4(vpkudus, void, env, avr, avr, avr)
 DEF_HELPER_4(vpkuhum, void, env, avr, avr, avr)
 DEF_HELPER_4(vpkuwum, void, env, avr, avr, avr)
 DEF_HELPER_4(vpkudum, void, env, avr, avr, avr)
-DEF_HELPER_3(vpkpx, void, avr, avr, avr)
 DEF_HELPER_5(vmhaddshs, void, env, avr, avr, avr, avr)
 DEF_HELPER_5(vmhraddshs, void, env, avr, avr, avr, avr)
 DEF_HELPER_5(vmsumuhm, void, env, avr, avr, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index cd00f5e..f910c11 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1262,27 +1262,6 @@ void helper_vpmsumd(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 #else
 #define PKBIG 0
 #endif
-void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
-{
-    int i, j;
-    ppc_avr_t result;
-#if defined(HOST_WORDS_BIGENDIAN)
-    const ppc_avr_t *x[2] = { a, b };
-#else
-    const ppc_avr_t *x[2] = { b, a };
-#endif
-
-    VECTOR_FOR_INORDER_I(i, u64) {
-        VECTOR_FOR_INORDER_I(j, u32) {
-            uint32_t e = x[i]->u32[j];
-
-            result.u16[4 * i + j] = (((e >> 9) & 0xfc00) |
-                                     ((e >> 6) & 0x3e0) |
-                                     ((e >> 3) & 0x1f));
-        }
-    }
-    *r = result;
-}
 
 #define VPK(suffix, from, to, cvt, dosat)                               \
     void helper_vpk##suffix(CPUPPCState *env, ppc_avr_t *r,             \
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index a428ef3..3550ffa 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -579,6 +579,103 @@ static void trans_lvsr(DisasContext *ctx)
 }
 
 /*
+ * vpkpx VRT,VRA,VRB - Vector Pack Pixel
+ *
+ * Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register)
+ * into contigous array of bits in the destination register.
+ */
+static void trans_vpkpx(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 shifted = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 result1 = tcg_temp_new_i64();
+    TCGv_i64 result2 = tcg_temp_new_i64();
+    int64_t mask1 = 0x1fULL;
+    int64_t mask2 = 0x1fULL << 5;
+    int64_t mask3 = 0x3fULL << 10;
+    int i, j;
+    /*
+     * In each iteration do the 6-5-5 pack for 2 pixels of each doubleword
+     * element of each source register.
+     */
+    for (i = 0; i < 4; i++) {
+        switch (i) {
+        case 0:
+            /*
+             * Get high doubleword of vA to perform 6-5-5 pack of pixels
+             * 1 and 2.
+             */
+            get_avr64(avr, VA, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 1:
+            /*
+             * Get low doubleword of vA to perform 6-5-5 pack of pixels
+             * 3 and 4.
+             */
+            get_avr64(avr, VA, false);
+            break;
+        case 2:
+            /*
+             * Get high doubleword of vB to perform 6-5-5 pack of pixels
+             * 5 and 6.
+             */
+            get_avr64(avr, VB, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 3:
+            /*
+             * Get low doubleword of vB to perform 6-5-5 pack of pixels
+             * 7 and 8.
+             */
+            get_avr64(avr, VB, false);
+            break;
+        }
+        /* Perform the packing for 2 pixels(each iteration for 1). */
+        tcg_gen_movi_i64(tmp, 0x0ULL);
+        for (j = 0; j < 2; j++) {
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 3));
+            tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 6));
+            tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 9));
+            tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+        }
+        if ((i == 0) || (i == 2)) {
+            tcg_gen_shli_i64(tmp, tmp, 32);
+        }
+        tcg_gen_or_i64(result, result, tmp);
+        if (i == 1) {
+            /* Place packed pixels 1:4 to high doubleword of vD. */
+            tcg_gen_mov_i64(result1, result);
+        }
+        if (i == 3) {
+            /* Place packed pixels 5:8 to low doubleword of vD. */
+            tcg_gen_mov_i64(result2, result);
+        }
+    }
+    set_avr64(VT, result1, true);
+    set_avr64(VT, result2, false);
+
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(shifted);
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(result1);
+    tcg_temp_free_i64(result2);
+}
+
+/*
  * vsl VRT,VRA,VRB - Vector Shift Left
  *
  * Shifting left 128 bit value of vA by value specified in bits 125-127 of vB.
@@ -1063,7 +1160,7 @@ GEN_VXFORM_ENV(vpksdus, 7, 21);
 GEN_VXFORM_ENV(vpkshss, 7, 6);
 GEN_VXFORM_ENV(vpkswss, 7, 7);
 GEN_VXFORM_ENV(vpksdss, 7, 23);
-GEN_VXFORM(vpkpx, 7, 12);
+GEN_VXFORM_TRANS(vpkpx, 7, 12);
 GEN_VXFORM_ENV(vsum4ubs, 4, 24);
 GEN_VXFORM_ENV(vsum4sbs, 4, 28);
 GEN_VXFORM_ENV(vsum4shs, 4, 25);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions
  2019-10-17 11:27 [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
  2019-10-17 11:27 ` [PATCH v7 1/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
  2019-10-17 11:27 ` [PATCH v7 2/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
@ 2019-10-17 11:27 ` Stefan Brankovic
  2019-10-19 19:24   ` Aleksandar Markovic
                     ` (2 more replies)
  2019-10-19 19:54 ` [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions Aleksandar Markovic
  3 siblings, 3 replies; 11+ messages in thread
From: Stefan Brankovic @ 2019-10-17 11:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: stefan.brankovic, richard.henderson, david

'trans_vupkpx' function implements both vupkhpx and vupklpx instructions with
argument 'high' determine which instruction is processed. Instructions are
implemented in two 'for' loops. Outer 'for' loop repeats unpacking two times,
since both doubleword elements of destination register are formed the same way.
It also stores result of every iteration in temporary register, that is later
transferred to destination register. Inner 'for' loop does unpacking of pixels
and forms resulting doubleword 32 by 32 bits.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |  2 -
 target/ppc/int_helper.c             | 20 --------
 target/ppc/translate/vmx-impl.inc.c | 91 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 89 insertions(+), 24 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index b489b38..fd06b56 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -233,8 +233,6 @@ DEF_HELPER_2(vextsh2d, void, avr, avr)
 DEF_HELPER_2(vextsw2d, void, avr, avr)
 DEF_HELPER_2(vnegw, void, avr, avr)
 DEF_HELPER_2(vnegd, void, avr, avr)
-DEF_HELPER_2(vupkhpx, void, avr, avr)
-DEF_HELPER_2(vupklpx, void, avr, avr)
 DEF_HELPER_2(vupkhsb, void, avr, avr)
 DEF_HELPER_2(vupkhsh, void, avr, avr)
 DEF_HELPER_2(vupkhsw, void, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index f910c11..9ee667d 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1737,26 +1737,6 @@ void helper_vsum4ubs(CPUPPCState *env, ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 #define UPKHI 0
 #define UPKLO 1
 #endif
-#define VUPKPX(suffix, hi)                                              \
-    void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)                \
-    {                                                                   \
-        int i;                                                          \
-        ppc_avr_t result;                                               \
-                                                                        \
-        for (i = 0; i < ARRAY_SIZE(r->u32); i++) {                      \
-            uint16_t e = b->u16[hi ? i : i + 4];                        \
-            uint8_t a = (e >> 15) ? 0xff : 0;                           \
-            uint8_t r = (e >> 10) & 0x1f;                               \
-            uint8_t g = (e >> 5) & 0x1f;                                \
-            uint8_t b = e & 0x1f;                                       \
-                                                                        \
-            result.u32[i] = (a << 24) | (r << 16) | (g << 8) | b;       \
-        }                                                               \
-        *r = result;                                                    \
-    }
-VUPKPX(lpx, UPKLO)
-VUPKPX(hpx, UPKHI)
-#undef VUPKPX
 
 #define VUPK(suffix, unpacked, packee, hi)                              \
     void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)                \
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 3550ffa..09d80d6 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -1031,6 +1031,95 @@ static void trans_vclzd(DisasContext *ctx)
     tcg_temp_free_i64(avr);
 }
 
+/*
+ * vupkhpx VRT,VRB - Vector Unpack High Pixel
+ * vupklpx VRT,VRB - Vector Unpack Low Pixel
+ *
+ * Unpacks 4 pixels coded in 1-5-5-5 pattern from high/low doubleword element
+ * of source register into contigous array of bits in the destination register.
+ * Argument 'high' determines if high or low doubleword element of source
+ * register is processed.
+ */
+static void trans_vupkpx(DisasContext *ctx, int high)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 result1 = tcg_temp_new_i64();
+    TCGv_i64 result2 = tcg_temp_new_i64();
+    int64_t mask1 = 0x1fULL;
+    int64_t mask2 = 0x1fULL << 8;
+    int64_t mask3 = 0x1fULL << 16;
+    int64_t mask4 = 0xffULL << 56;
+    int i, j;
+
+    if (high == 1) {
+        get_avr64(avr, VB, true);
+    } else {
+        get_avr64(avr, VB, false);
+    }
+
+    tcg_gen_movi_i64(result, 0x0ULL);
+    for (i = 0; i < 2; i++) {
+        for (j = 0; j < 2; j++) {
+            tcg_gen_shli_i64(tmp, avr, (j * 16));
+            tcg_gen_andi_i64(tmp, tmp, mask1 << (j * 32));
+            tcg_gen_or_i64(result, result, tmp);
+
+            tcg_gen_shli_i64(tmp, avr, 3 + (j * 16));
+            tcg_gen_andi_i64(tmp, tmp, mask2 << (j * 32));
+            tcg_gen_or_i64(result, result, tmp);
+
+            tcg_gen_shli_i64(tmp, avr, 6 + (j * 16));
+            tcg_gen_andi_i64(tmp, tmp, mask3 << (j * 32));
+            tcg_gen_or_i64(result, result, tmp);
+
+            tcg_gen_shri_i64(tmp, avr, (j * 16));
+            tcg_gen_ext16s_i64(tmp, tmp);
+            tcg_gen_andi_i64(tmp, tmp, mask4);
+            tcg_gen_shri_i64(tmp, tmp, (32 * (1 - j)));
+            tcg_gen_or_i64(result, result, tmp);
+        }
+        if (i == 0) {
+            tcg_gen_mov_i64(result1, result);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            tcg_gen_shri_i64(avr, avr, 32);
+        }
+        if (i == 1) {
+            tcg_gen_mov_i64(result2, result);
+        }
+    }
+
+    set_avr64(VT, result1, false);
+    set_avr64(VT, result2, true);
+
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(result1);
+    tcg_temp_free_i64(result2);
+}
+
+static void gen_vupkhpx(DisasContext *ctx)
+{
+    if (unlikely(!ctx->altivec_enabled)) {
+        gen_exception(ctx, POWERPC_EXCP_VPU);
+        return;
+    }
+    trans_vupkpx(ctx, 1);
+}
+
+static void gen_vupklpx(DisasContext *ctx)
+{
+    if (unlikely(!ctx->altivec_enabled)) {
+        gen_exception(ctx, POWERPC_EXCP_VPU);
+        return;
+    }
+    trans_vupkpx(ctx, 0);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -1348,8 +1437,6 @@ GEN_VXFORM_NOA(vupkhsw, 7, 25);
 GEN_VXFORM_NOA(vupklsb, 7, 10);
 GEN_VXFORM_NOA(vupklsh, 7, 11);
 GEN_VXFORM_NOA(vupklsw, 7, 27);
-GEN_VXFORM_NOA(vupkhpx, 7, 13);
-GEN_VXFORM_NOA(vupklpx, 7, 15);
 GEN_VXFORM_NOA_ENV(vrefp, 5, 4);
 GEN_VXFORM_NOA_ENV(vrsqrtefp, 5, 5);
 GEN_VXFORM_NOA_ENV(vexptefp, 5, 6);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions
  2019-10-17 11:27 ` [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions Stefan Brankovic
@ 2019-10-19 19:24   ` Aleksandar Markovic
  2019-10-19 20:06   ` Aleksandar Markovic
  2019-10-19 20:40   ` Aleksandar Markovic
  2 siblings, 0 replies; 11+ messages in thread
From: Aleksandar Markovic @ 2019-10-19 19:24 UTC (permalink / raw)
  To: Stefan Brankovic; +Cc: richard.henderson, qemu-devel, david

[-- Attachment #1: Type: text/plain, Size: 7274 bytes --]

On Thursday, October 17, 2019, Stefan Brankovic <stefan.brankovic@rt-rk.com>
wrote:

> 'trans_vupkpx' function implements both vupkhpx and vupklpx instructions
> with


implements both -> implements emulation of both

with -> , while its

argument 'high' determine which


determine -> determines


> instruction is processed. Instructions are
> implemented in two 'for' loops. Outer 'for' loop repeats unpacking two
> times,
> since both doubleword elements of destination register are formed the same
> way.
> It also stores result of every iteration in temporary register, that is
> later
> transferred to destination register. Inner 'for' loop does unpacking of
> pixels
> and forms resulting doubleword 32 by 32 bits.


temporary register -> a temporary variable

destination register -> the destination register

'forms resulting doubleword 32 by 32 bits' is unclear, reword.


> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/helper.h                 |  2 -
>  target/ppc/int_helper.c             | 20 --------
>  target/ppc/translate/vmx-impl.inc.c | 91 ++++++++++++++++++++++++++++++
> ++++++-
>  3 files changed, 89 insertions(+), 24 deletions(-)
>
> diff --git a/target/ppc/helper.h b/target/ppc/helper.h
> index b489b38..fd06b56 100644
> --- a/target/ppc/helper.h
> +++ b/target/ppc/helper.h
> @@ -233,8 +233,6 @@ DEF_HELPER_2(vextsh2d, void, avr, avr)
>  DEF_HELPER_2(vextsw2d, void, avr, avr)
>  DEF_HELPER_2(vnegw, void, avr, avr)
>  DEF_HELPER_2(vnegd, void, avr, avr)
> -DEF_HELPER_2(vupkhpx, void, avr, avr)
> -DEF_HELPER_2(vupklpx, void, avr, avr)
>  DEF_HELPER_2(vupkhsb, void, avr, avr)
>  DEF_HELPER_2(vupkhsh, void, avr, avr)
>  DEF_HELPER_2(vupkhsw, void, avr, avr)
> diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
> index f910c11..9ee667d 100644
> --- a/target/ppc/int_helper.c
> +++ b/target/ppc/int_helper.c
> @@ -1737,26 +1737,6 @@ void helper_vsum4ubs(CPUPPCState *env, ppc_avr_t
> *r, ppc_avr_t *a, ppc_avr_t *b)
>  #define UPKHI 0
>  #define UPKLO 1
>  #endif
> -#define VUPKPX(suffix, hi)                                              \
> -    void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)                \
> -    {                                                                   \
> -        int i;                                                          \
> -        ppc_avr_t result;                                               \
> -                                                                        \
> -        for (i = 0; i < ARRAY_SIZE(r->u32); i++) {                      \
> -            uint16_t e = b->u16[hi ? i : i + 4];                        \
> -            uint8_t a = (e >> 15) ? 0xff : 0;                           \
> -            uint8_t r = (e >> 10) & 0x1f;                               \
> -            uint8_t g = (e >> 5) & 0x1f;                                \
> -            uint8_t b = e & 0x1f;                                       \
> -                                                                        \
> -            result.u32[i] = (a << 24) | (r << 16) | (g << 8) | b;       \
> -        }                                                               \
> -        *r = result;                                                    \
> -    }
> -VUPKPX(lpx, UPKLO)
> -VUPKPX(hpx, UPKHI)
> -#undef VUPKPX
>
>  #define VUPK(suffix, unpacked, packee, hi)                              \
>      void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)                \
> diff --git a/target/ppc/translate/vmx-impl.inc.c
> b/target/ppc/translate/vmx-impl.inc.c
> index 3550ffa..09d80d6 100644
> --- a/target/ppc/translate/vmx-impl.inc.c
> +++ b/target/ppc/translate/vmx-impl.inc.c
> @@ -1031,6 +1031,95 @@ static void trans_vclzd(DisasContext *ctx)
>      tcg_temp_free_i64(avr);
>  }
>
> +/*
> + * vupkhpx VRT,VRB - Vector Unpack High Pixel
> + * vupklpx VRT,VRB - Vector Unpack Low Pixel
> + *
> + * Unpacks 4 pixels coded in 1-5-5-5 pattern from high/low doubleword
> element
> + * of source register into contigous array of bits in the destination
> register.
> + * Argument 'high' determines if high or low doubleword element of source
> + * register is processed.
> + */
> +static void trans_vupkpx(DisasContext *ctx, int high)
> +{
> +    int VT = rD(ctx->opcode);
> +    int VB = rB(ctx->opcode);
> +    TCGv_i64 tmp = tcg_temp_new_i64();
> +    TCGv_i64 avr = tcg_temp_new_i64();
> +    TCGv_i64 result = tcg_temp_new_i64();
> +    TCGv_i64 result1 = tcg_temp_new_i64();
> +    TCGv_i64 result2 = tcg_temp_new_i64();
> +    int64_t mask1 = 0x1fULL;
> +    int64_t mask2 = 0x1fULL << 8;
> +    int64_t mask3 = 0x1fULL << 16;
> +    int64_t mask4 = 0xffULL << 56;

+    int i, j;
> +
> +    if (high == 1) {
> +        get_avr64(avr, VB, true);


before this line, insert comment: /* vupkhpx */


> +    } else {
> +        get_avr64(avr, VB, false);


before this line, insert comment: /* vupklpx */


> +    }
> +
> +    tcg_gen_movi_i64(result, 0x0ULL);
> +    for (i = 0; i < 2; i++) {
> +        for (j = 0; j < 2; j++) {
> +            tcg_gen_shli_i64(tmp, avr, (j * 16));
> +            tcg_gen_andi_i64(tmp, tmp, mask1 << (j * 32));
> +            tcg_gen_or_i64(result, result, tmp);
> +
> +            tcg_gen_shli_i64(tmp, avr, 3 + (j * 16));
> +            tcg_gen_andi_i64(tmp, tmp, mask2 << (j * 32));
> +            tcg_gen_or_i64(result, result, tmp);
> +
> +            tcg_gen_shli_i64(tmp, avr, 6 + (j * 16));
> +            tcg_gen_andi_i64(tmp, tmp, mask3 << (j * 32));
> +            tcg_gen_or_i64(result, result, tmp);
> +
> +            tcg_gen_shri_i64(tmp, avr, (j * 16));
> +            tcg_gen_ext16s_i64(tmp, tmp);
> +            tcg_gen_andi_i64(tmp, tmp, mask4);
> +            tcg_gen_shri_i64(tmp, tmp, (32 * (1 - j)));
> +            tcg_gen_or_i64(result, result, tmp);
> +        }
> +        if (i == 0) {
> +            tcg_gen_mov_i64(result1, result);
> +            tcg_gen_movi_i64(result, 0x0ULL);
> +            tcg_gen_shri_i64(avr, avr, 32);
> +        }
> +        if (i == 1) {
> +            tcg_gen_mov_i64(result2, result);
> +        }
> +    }
> +
> +    set_avr64(VT, result1, false);
> +    set_avr64(VT, result2, true);
> +
> +    tcg_temp_free_i64(tmp);
> +    tcg_temp_free_i64(avr);
> +    tcg_temp_free_i64(result);
> +    tcg_temp_free_i64(result1);
> +    tcg_temp_free_i64(result2);
> +}
> +
> +static void gen_vupkhpx(DisasContext *ctx)

+{
> +    if (unlikely(!ctx->altivec_enabled)) {
> +        gen_exception(ctx, POWERPC_EXCP_VPU);
> +        return;
> +    }
> +    trans_vupkpx(ctx, 1);
> +}
> +
> +static void gen_vupklpx(DisasContext *ctx)

    +{

> +    if (unlikely(!ctx->altivec_enabled)) {
> +        gen_exception(ctx, POWERPC_EXCP_VPU);
> +        return;
> +    }
> +    trans_vupkpx(ctx, 0);
> +}
> +
>  GEN_VXFORM(vmuloub, 4, 0);
>  GEN_VXFORM(vmulouh, 4, 1);
>  GEN_VXFORM(vmulouw, 4, 2);
> @@ -1348,8 +1437,6 @@ GEN_VXFORM_NOA(vupkhsw, 7, 25);
>  GEN_VXFORM_NOA(vupklsb, 7, 10);
>  GEN_VXFORM_NOA(vupklsh, 7, 11);
>  GEN_VXFORM_NOA(vupklsw, 7, 27);
> -GEN_VXFORM_NOA(vupkhpx, 7, 13);
> -GEN_VXFORM_NOA(vupklpx, 7, 15);

 GEN_VXFORM_NOA_ENV(vrefp, 5, 4);
>  GEN_VXFORM_NOA_ENV(vrsqrtefp, 5, 5);
>  GEN_VXFORM_NOA_ENV(vexptefp, 5, 6);
> --
> 2.7.4
>
>
>

[-- Attachment #2: Type: text/html, Size: 10203 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 2/3] target/ppc: Optimize emulation of vpkpx instruction
  2019-10-17 11:27 ` [PATCH v7 2/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
@ 2019-10-19 19:35   ` Aleksandar Markovic
  0 siblings, 0 replies; 11+ messages in thread
From: Aleksandar Markovic @ 2019-10-19 19:35 UTC (permalink / raw)
  To: Stefan Brankovic; +Cc: richard.henderson, qemu-devel, david

[-- Attachment #1: Type: text/plain, Size: 7551 bytes --]

On Thursday, October 17, 2019, Stefan Brankovic <stefan.brankovic@rt-rk.com>
wrote:

> Optimize altivec instruction vpkpx (Vector Pack Pixel).
> Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register)
> into contigous array of bits in the destination register.
>
> In each iteration of outer loop, the instruction is to be done with
> the 6-5-5 pack for 2 pixels of each doubleword element of each
> source register. The first thing to be done in outer loop is
> choosing which doubleword element of which register is to be used
> in current iteration and it is to be placed in avr variable. The
> next step is to perform 6-5-5 pack of pixels on avr variable in inner
> for loop(2 iterations, 1 for each pixel) and save result in tmp variable.
> In the end of outer for loop, the result is merged in variable called
> result and saved in appropriate doubleword element of vD if the whole
> doubleword is finished(every second iteration). The outer loop has 4
> iterations.
>
>
Check spelling.

Use single quotation marks around variavle names and other code elements.

avr variable-> variable 'avr' (and several similar instances)


> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/helper.h                 |  1 -
>  target/ppc/int_helper.c             | 21 --------
>  target/ppc/translate/vmx-impl.inc.c | 99 ++++++++++++++++++++++++++++++
> ++++++-
>  3 files changed, 98 insertions(+), 23 deletions(-)
>
> diff --git a/target/ppc/helper.h b/target/ppc/helper.h
> index 281e54f..b489b38 100644
> --- a/target/ppc/helper.h
> +++ b/target/ppc/helper.h
> @@ -258,7 +258,6 @@ DEF_HELPER_4(vpkudus, void, env, avr, avr, avr)
>  DEF_HELPER_4(vpkuhum, void, env, avr, avr, avr)
>  DEF_HELPER_4(vpkuwum, void, env, avr, avr, avr)
>  DEF_HELPER_4(vpkudum, void, env, avr, avr, avr)
> -DEF_HELPER_3(vpkpx, void, avr, avr, avr)
>  DEF_HELPER_5(vmhaddshs, void, env, avr, avr, avr, avr)
>  DEF_HELPER_5(vmhraddshs, void, env, avr, avr, avr, avr)
>  DEF_HELPER_5(vmsumuhm, void, env, avr, avr, avr, avr)
> diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
> index cd00f5e..f910c11 100644
> --- a/target/ppc/int_helper.c
> +++ b/target/ppc/int_helper.c
> @@ -1262,27 +1262,6 @@ void helper_vpmsumd(ppc_avr_t *r, ppc_avr_t *a,
> ppc_avr_t *b)
>  #else
>  #define PKBIG 0
>  #endif
> -void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
> -{
> -    int i, j;
> -    ppc_avr_t result;
> -#if defined(HOST_WORDS_BIGENDIAN)
> -    const ppc_avr_t *x[2] = { a, b };
> -#else
> -    const ppc_avr_t *x[2] = { b, a };
> -#endif
> -
> -    VECTOR_FOR_INORDER_I(i, u64) {
> -        VECTOR_FOR_INORDER_I(j, u32) {
> -            uint32_t e = x[i]->u32[j];
> -
> -            result.u16[4 * i + j] = (((e >> 9) & 0xfc00) |
> -                                     ((e >> 6) & 0x3e0) |
> -                                     ((e >> 3) & 0x1f));
> -        }
> -    }
> -    *r = result;
> -}
>
>  #define VPK(suffix, from, to, cvt, dosat)                               \
>      void helper_vpk##suffix(CPUPPCState *env, ppc_avr_t *r,             \
> diff --git a/target/ppc/translate/vmx-impl.inc.c
> b/target/ppc/translate/vmx-impl.inc.c
> index a428ef3..3550ffa 100644
> --- a/target/ppc/translate/vmx-impl.inc.c
> +++ b/target/ppc/translate/vmx-impl.inc.c
> @@ -579,6 +579,103 @@ static void trans_lvsr(DisasContext *ctx)
>  }
>
>  /*
> + * vpkpx VRT,VRA,VRB - Vector Pack Pixel
> + *
> + * Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source
> register)
> + * into contigous array of bits in the destination register.
> + */
> +static void trans_vpkpx(DisasContext *ctx)
> +{
> +    int VT = rD(ctx->opcode);
> +    int VA = rA(ctx->opcode);
> +    int VB = rB(ctx->opcode);
> +    TCGv_i64 tmp = tcg_temp_new_i64();
> +    TCGv_i64 shifted = tcg_temp_new_i64();
> +    TCGv_i64 avr = tcg_temp_new_i64();
> +    TCGv_i64 result = tcg_temp_new_i64();
> +    TCGv_i64 result1 = tcg_temp_new_i64();
> +    TCGv_i64 result2 = tcg_temp_new_i64();


'result2' is not needed, 'result' can be used in the final half instead all
the way up to the coping to the destination.


> +    int64_t mask1 = 0x1fULL;
> +    int64_t mask2 = 0x1fULL << 5;
> +    int64_t mask3 = 0x3fULL << 10;
> +    int i, j;
> +    /*
> +     * In each iteration do the 6-5-5 pack for 2 pixels of each doubleword
> +     * element of each source register.
> +     */
> +    for (i = 0; i < 4; i++) {
> +        switch (i) {
> +        case 0:
> +            /*
> +             * Get high doubleword of vA to perform 6-5-5 pack of pixels
> +             * 1 and 2.
> +             */
> +            get_avr64(avr, VA, true);
> +            tcg_gen_movi_i64(result, 0x0ULL);
> +            break;
> +        case 1:
> +            /*
> +             * Get low doubleword of vA to perform 6-5-5 pack of pixels
> +             * 3 and 4.
> +             */
> +            get_avr64(avr, VA, false);
> +            break;
> +        case 2:
> +            /*
> +             * Get high doubleword of vB to perform 6-5-5 pack of pixels
> +             * 5 and 6.
> +             */
> +            get_avr64(avr, VB, true);
> +            tcg_gen_movi_i64(result, 0x0ULL);
> +            break;
> +        case 3:
> +            /*
> +             * Get low doubleword of vB to perform 6-5-5 pack of pixels
> +             * 7 and 8.
> +             */
> +            get_avr64(avr, VB, false);
> +            break;
> +        }
> +        /* Perform the packing for 2 pixels(each iteration for 1). */
> +        tcg_gen_movi_i64(tmp, 0x0ULL);
> +        for (j = 0; j < 2; j++) {
> +            tcg_gen_shri_i64(shifted, avr, (j * 16 + 3));
> +            tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16));
> +            tcg_gen_or_i64(tmp, tmp, shifted);
> +
> +            tcg_gen_shri_i64(shifted, avr, (j * 16 + 6));
> +            tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16));
> +            tcg_gen_or_i64(tmp, tmp, shifted);
> +
> +            tcg_gen_shri_i64(shifted, avr, (j * 16 + 9));
> +            tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16));
> +            tcg_gen_or_i64(tmp, tmp, shifted);
> +        }
> +        if ((i == 0) || (i == 2)) {
> +            tcg_gen_shli_i64(tmp, tmp, 32);
> +        }
> +        tcg_gen_or_i64(result, result, tmp);
> +        if (i == 1) {
> +            /* Place packed pixels 1:4 to high doubleword of vD. */
> +            tcg_gen_mov_i64(result1, result);
> +        }
> +        if (i == 3) {
> +            /* Place packed pixels 5:8 to low doubleword of vD. */
> +            tcg_gen_mov_i64(result2, result);
> +        }


If 'result2' is removed, the last tcg movement is not needed...


> +    }
> +    set_avr64(VT, result1, true);
> +    set_avr64(VT, result2, false);


... and here 'result' should be instead of  'result2'.

A.


    +

> +    tcg_temp_free_i64(tmp);
> +    tcg_temp_free_i64(shifted);
> +    tcg_temp_free_i64(avr);
> +    tcg_temp_free_i64(result);
> +    tcg_temp_free_i64(result1);
> +    tcg_temp_free_i64(result2);
> +}
> +
> +/*
>   * vsl VRT,VRA,VRB - Vector Shift Left
>   *
>   * Shifting left 128 bit value of vA by value specified in bits 125-127
> of vB.
> @@ -1063,7 +1160,7 @@ GEN_VXFORM_ENV(vpksdus, 7, 21);
>  GEN_VXFORM_ENV(vpkshss, 7, 6);
>  GEN_VXFORM_ENV(vpkswss, 7, 7);
>  GEN_VXFORM_ENV(vpksdss, 7, 23);
> -GEN_VXFORM(vpkpx, 7, 12);
> +GEN_VXFORM_TRANS(vpkpx, 7, 12);
>  GEN_VXFORM_ENV(vsum4ubs, 4, 24);
>  GEN_VXFORM_ENV(vsum4sbs, 4, 28);
>  GEN_VXFORM_ENV(vsum4shs, 4, 25);
> --
> 2.7.4
>
>
>

[-- Attachment #2: Type: text/html, Size: 9704 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 1/3] target/ppc: Optimize emulation of vclzh and vclzb instructions
  2019-10-17 11:27 ` [PATCH v7 1/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
@ 2019-10-19 19:51   ` Aleksandar Markovic
  0 siblings, 0 replies; 11+ messages in thread
From: Aleksandar Markovic @ 2019-10-19 19:51 UTC (permalink / raw)
  To: Stefan Brankovic; +Cc: richard.henderson, qemu-devel, david

[-- Attachment #1: Type: text/plain, Size: 10089 bytes --]

On Thursday, October 17, 2019, Stefan Brankovic <stefan.brankovic@rt-rk.com>
wrote:

> Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
> This instruction counts the number of leading zeros of each halfword
> element
> in source register and places result in the appropriate halfword element of
> destination register.
>
> In each iteration of outer for loop count operation is performed on one
> doubleword element of source register vB. In the first iteration, higher
> doubleword element of vB is placed in variable avr, and then counting
> for every halfword element is performed by  using tcg_gen_clzi_i64.
> Since it counts leading zeros on 64 bit lenght, ith byte element has to
> be moved to the highest 16 bits of tmp, or-ed with mask(in order to get all
> ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move it's result
> in appropriate halfword element of result. This is done in inner for loop.
> After the operation is finished, the result is saved in the appropriate
> doubleword element of destination register vD. The same sequence of orders
> is to be applied again for the  lower doubleword element of vB.
>
> Optimize Altivec instruction vclzb (Vector Count Leading Zeros Byte).
> This instruction counts the number of leading zeros of each byte element
> in source register and places result in the appropriate byte element of
> destination register.
>
> In each iteration of the outer for loop, counting operation is done on one
> doubleword element of source register vB. In the first iteration, the
> higher doubleword element of vB is placed in variable avr, and then
> counting
> for every byte element is performed using tcg_gen_clzi_i64. Since it counts
> leading zeros on 64 bit lenght, ith byte element has to be moved to the
> highest
> 8 bits of variable  tmp, ored with mask(in order to get all ones in the
> lowest
> 56 bits), then perform tcg_gen_clzi_i64 and move it's result in the
> appropriate
> byte element of result. This is done in inner for loop. After the
> operation is
> finished, the result is saved in the  appropriate doubleword element of
> destination
> register vD. The same sequence of orders is to be applied again for the
> lower
> doubleword element of vB.
>
>
The same hints as for the commit message of patch 2/3.

Additionally, the first and the third paragraph should be merged into a
single one at the beggining of the commit message


Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/helper.h                 |   2 -
>  target/ppc/int_helper.c             |   9 ---
>  target/ppc/translate/vmx-impl.inc.c | 136 ++++++++++++++++++++++++++++++
> +++++-
>  3 files changed, 134 insertions(+), 13 deletions(-)
>
> diff --git a/target/ppc/helper.h b/target/ppc/helper.h
> index f843814..281e54f 100644
> --- a/target/ppc/helper.h
> +++ b/target/ppc/helper.h
> @@ -308,8 +308,6 @@ DEF_HELPER_4(vcfsx, void, env, avr, avr, i32)
>  DEF_HELPER_4(vctuxs, void, env, avr, avr, i32)
>  DEF_HELPER_4(vctsxs, void, env, avr, avr, i32)
>
> -DEF_HELPER_2(vclzb, void, avr, avr)
> -DEF_HELPER_2(vclzh, void, avr, avr)
>  DEF_HELPER_2(vctzb, void, avr, avr)
>  DEF_HELPER_2(vctzh, void, avr, avr)
>  DEF_HELPER_2(vctzw, void, avr, avr)
> diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
> index 6d238b9..cd00f5e 100644
> --- a/target/ppc/int_helper.c
> +++ b/target/ppc/int_helper.c
> @@ -1817,15 +1817,6 @@ VUPK(lsw, s64, s32, UPKLO)
>          }                                                               \
>      }
>
> -#define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8)
> -#define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16)
> -
> -VGENERIC_DO(clzb, u8)
> -VGENERIC_DO(clzh, u16)
> -
> -#undef clzb
> -#undef clzh
> -
>  #define ctzb(v) ((v) ? ctz32(v) : 8)
>  #define ctzh(v) ((v) ? ctz32(v) : 16)
>  #define ctzw(v) ctz32((v))
> diff --git a/target/ppc/translate/vmx-impl.inc.c
> b/target/ppc/translate/vmx-impl.inc.c
> index 2472a52..a428ef3 100644
> --- a/target/ppc/translate/vmx-impl.inc.c
> +++ b/target/ppc/translate/vmx-impl.inc.c
> @@ -751,6 +751,138 @@ static void trans_vgbbd(DisasContext *ctx)
>  }
>
>  /*
> + * vclzb VRT,VRB - Vector Count Leading Zeros Byte
> + *
> + * Counting the number of leading zero bits of each byte element in source
> + * register and placing result in appropriate byte element of destination
> + * register.
> + */
> +static void trans_vclzb(DisasContext *ctx)
> +{
> +    int VT = rD(ctx->opcode);
> +    int VB = rB(ctx->opcode);
> +    TCGv_i64 avr = tcg_temp_new_i64();
> +    TCGv_i64 result = tcg_temp_new_i64();
> +    TCGv_i64 result1 = tcg_temp_new_i64();
> +    TCGv_i64 result2 = tcg_temp_new_i64();
> +    TCGv_i64 tmp = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0xffffffffffffffULL);
> +    int i, j;
> +
> +    for (i = 0; i < 2; i++) {
> +        if (i == 0) {
> +            /* Get high doubleword of vB in 'avr'. */
> +            get_avr64(avr, VB, true);
> +        } else {
> +            /* Get low doubleword of vB in 'avr'. */
> +            get_avr64(avr, VB, false);
> +        }
> +        /*
> +         * Perform count for every byte element using 'tcg_gen_clzi_i64'.
> +         * Since it counts leading zeros on 64 bit lenght, we have to move
> +         * ith byte element to highest 8 bits of 'tmp', or it with
> mask(so we
> +         * get all ones in lowest 56 bits), then perform
> 'tcg_gen_clzi_i64' and
> +         * move it's result in appropriate byte element of result.
> +         */
> +        tcg_gen_shli_i64(tmp, avr, 56);


before this line, insert a blank line and the comment: /* count leading
zeroes for bits 0..8 */


> +        tcg_gen_or_i64(tmp, tmp, mask);
> +        tcg_gen_clzi_i64(result, tmp, 64);
> +        for (j = 1; j < 7; j++) {
> +            tcg_gen_shli_i64(tmp, avr, (7 - j) * 8);


before this line, insert comment: /* count leading zeroes for bits
8*j..8*j+7  */


> +            tcg_gen_or_i64(tmp, tmp, mask);
> +            tcg_gen_clzi_i64(tmp, tmp, 64);
> +            tcg_gen_deposit_i64(result, result, tmp, j * 8, 8);
> +        }
> +        tcg_gen_or_i64(tmp, avr, mask);


before this line, insert comment: /* count leading zeroes for bits 56..63
 */


> +        tcg_gen_clzi_i64(tmp, tmp, 64);
> +        tcg_gen_deposit_i64(result, result, tmp, 56, 8);

+        if (i == 0) {
> +            /* Place result in high doubleword element of vD. */
> +            tcg_gen_mov_i64(result1, result);
> +        } else {
> +            /* Place result in low doubleword element of vD. */
> +            tcg_gen_mov_i64(result2, result);
> +        }
> +    }
> +
> +    set_avr64(VT, result1, true);
> +    set_avr64(VT, result2, false);
> +
> +    tcg_temp_free_i64(avr);
> +    tcg_temp_free_i64(result);
> +    tcg_temp_free_i64(result1);
> +    tcg_temp_free_i64(result2);
> +    tcg_temp_free_i64(tmp);
> +    tcg_temp_free_i64(mask);
> +}
> +
> +/*
> + * vclzh VRT,VRB - Vector Count Leading Zeros Halfword


Similar hints as in the case of the previous function.


> + *
> + * Counting the number of leading zero bits of each halfword element in
> source
> + * register and placing result in appropriate halfword element of
> destination
> + * register.
> + */
> +static void trans_vclzh(DisasContext *ctx)
> +{
> +    int VT = rD(ctx->opcode);
> +    int VB = rB(ctx->opcode);
> +    TCGv_i64 avr = tcg_temp_new_i64();
> +    TCGv_i64 result = tcg_temp_new_i64();
> +    TCGv_i64 result1 = tcg_temp_new_i64();
> +    TCGv_i64 result2 = tcg_temp_new_i64();
> +    TCGv_i64 tmp = tcg_temp_new_i64();
> +    TCGv_i64 mask = tcg_const_i64(0xffffffffffffULL);
> +    int i, j;
> +
> +    for (i = 0; i < 2; i++) {
> +        if (i == 0) {
> +            /* Get high doubleword element of vB in 'avr'. */
> +            get_avr64(avr, VB, true);
> +        } else {
> +            /* Get low doubleword element of vB in 'avr'. */
> +            get_avr64(avr, VB, false);
> +        }
> +        /*
> +         * Perform count for every halfword element using
> 'tcg_gen_clzi_i64'.
> +         * Since it counts leading zeros on 64 bit lenght, we have to move
> +         * ith byte element to highest 16 bits of 'tmp', or it with
> mask(so we
> +         * get all ones in lowest 48 bits), then perform
> 'tcg_gen_clzi_i64' and
> +         * move it's result in appropriate halfword element of result.
> +         */
> +        tcg_gen_shli_i64(tmp, avr, 48);
> +        tcg_gen_or_i64(tmp, tmp, mask);
> +        tcg_gen_clzi_i64(result, tmp, 64);
> +        for (j = 1; j < 3; j++) {
> +            tcg_gen_shli_i64(tmp, avr, (3 - j) * 16);
> +            tcg_gen_or_i64(tmp, tmp, mask);
> +            tcg_gen_clzi_i64(tmp, tmp, 64);
> +            tcg_gen_deposit_i64(result, result, tmp, j * 16, 16);
> +        }
> +        tcg_gen_or_i64(tmp, avr, mask);
> +        tcg_gen_clzi_i64(tmp, tmp, 64);
> +        tcg_gen_deposit_i64(result, result, tmp, 48, 16);
> +        if (i == 0) {
> +            /* Place result in high doubleword element of vD. */
> +            tcg_gen_mov_i64(result1, result);
> +        } else {
> +            /* Place result in low doubleword element of vD. */
> +            tcg_gen_mov_i64(result2, result);
> +        }
> +    }
> +
> +    set_avr64(VT, result1, true);
> +    set_avr64(VT, result2, false);
> +
> +    tcg_temp_free_i64(avr);
> +    tcg_temp_free_i64(result);
> +    tcg_temp_free_i64(result1);
> +    tcg_temp_free_i64(result2);
> +    tcg_temp_free_i64(tmp);
> +    tcg_temp_free_i64(mask);
> +}
> +
> +/*
>   * vclzw VRT,VRB - Vector Count Leading Zeros Word
>   *
>   * Counting the number of leading zero bits of each word element in source
> @@ -1315,8 +1447,8 @@ GEN_VAFORM_PAIRED(vmsumshm, vmsumshs, 20)
>  GEN_VAFORM_PAIRED(vsel, vperm, 21)
>  GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
>
> -GEN_VXFORM_NOA(vclzb, 1, 28)
> -GEN_VXFORM_NOA(vclzh, 1, 29)
> +GEN_VXFORM_TRANS(vclzb, 1, 28)
> +GEN_VXFORM_TRANS(vclzh, 1, 29)
>  GEN_VXFORM_TRANS(vclzw, 1, 30)
>  GEN_VXFORM_TRANS(vclzd, 1, 31)
>  GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
> --
> 2.7.4
>
>
>

[-- Attachment #2: Type: text/html, Size: 12818 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions
  2019-10-17 11:27 [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
                   ` (2 preceding siblings ...)
  2019-10-17 11:27 ` [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions Stefan Brankovic
@ 2019-10-19 19:54 ` Aleksandar Markovic
  3 siblings, 0 replies; 11+ messages in thread
From: Aleksandar Markovic @ 2019-10-19 19:54 UTC (permalink / raw)
  To: Stefan Brankovic; +Cc: richard.henderson, qemu-devel, david

[-- Attachment #1: Type: text/plain, Size: 2269 bytes --]

On Thursday, October 17, 2019, Stefan Brankovic <stefan.brankovic@rt-rk.com>
wrote:

> Optimize emulation of ten Altivec instructions: lvsl, lvsr, vsl, vsr,
> vpkpx,
> vgbbd, vclzb, vclzh, vclzw, vclzd, vupkhpx and vupklpx.
>
>
ten -> twelve



> This series buils up on and complements


>
buils -> builds


>
recent work of Thomas Murta, Mark
> Cave-Ayland and Richard Henderson in the same area. It is based on
> devising TCG
> translation implementation for selected instructions rather than using
> helpers.
> The selected instructions are most of the time idiosyncratic to ppc
> platform,
> so relatively complex TCG translation (without direct mapping to host
> instruction that is not possible in these cases) seems to be the best
> option,
> and that approach is presented in this series. The performance improvements
> are significant in all cases.
>
> V7:
>
> Added optimization for vupkhpx and vupklpx instructions.
>
> V6:
>
> Rebased series to the latest qemu code.
> Excluded all patches that are already accepted.
>
> V5:
>
> Fixed vpkpx bug and added it back in patch.
> Fixed graphical distortions on OSX 10.3 and 10.4.
> Removed conversion of vmrgh and vmrgl instructions to vector operations for
> further investigation.
>
> V4:
>
> Addressed Richard's Henderson's

suggestions.
> Removed vpkpx's optimization for further investigation on graphical
> distortions
> it caused on OSX 10.2-4 guests.
> Added opcodes for vector vmrgh(b|h|w) and vmrgl(b|h|w) in tcg.
> Implemented vector vmrgh and vmrgl instructions for i386.
> Converted vmrgh and vmrgl instructions to vector operations.
>
> V3:
>
> Fixed problem during build.
>
> V2:
>
> Addressed Richard's Henderson's suggestions.
> Fixed problem during build on patch 2/8.
> Rebased series to the latest qemu code.
>
> Stefan Brankovic (3):
>   target/ppc: Optimize emulation of vclzh and vclzb instructions
>   target/ppc: Optimize emulation of vpkpx instruction
>   target/ppc: Optimize emulation of vupkhpx and vupklpx instructions
>
>  target/ppc/helper.h                 |   5 -
>  target/ppc/int_helper.c             |  50 ------
>  target/ppc/translate/vmx-impl.inc.c | 326 ++++++++++++++++++++++++++++++
> +++++-
>  3 files changed, 321 insertions(+), 60 deletions(-)
>
> --
> 2.7.4
>
>
>

[-- Attachment #2: Type: text/html, Size: 3361 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions
  2019-10-17 11:27 ` [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions Stefan Brankovic
  2019-10-19 19:24   ` Aleksandar Markovic
@ 2019-10-19 20:06   ` Aleksandar Markovic
  2019-10-19 20:40   ` Aleksandar Markovic
  2 siblings, 0 replies; 11+ messages in thread
From: Aleksandar Markovic @ 2019-10-19 20:06 UTC (permalink / raw)
  To: Stefan Brankovic; +Cc: richard.henderson, qemu-devel, david

[-- Attachment #1: Type: text/plain, Size: 7243 bytes --]

On Thursday, October 17, 2019, Stefan Brankovic <stefan.brankovic@rt-rk.com>
wrote:

> 'trans_vupkpx' function implements both vupkhpx and vupklpx instructions
> with
> argument 'high' determine which instruction is processed. Instructions are
> implemented in two 'for' loops. Outer 'for' loop repeats unpacking two
> times,
> since both doubleword elements of destination register are formed the same
> way.
> It also stores result of every iteration in temporary register, that is
> later
> transferred to destination register. Inner 'for' loop does unpacking of
> pixels
> and forms resulting doubleword 32 by 32 bits.
>
> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/helper.h                 |  2 -
>  target/ppc/int_helper.c             | 20 --------
>  target/ppc/translate/vmx-impl.inc.c | 91 ++++++++++++++++++++++++++++++
> ++++++-
>  3 files changed, 89 insertions(+), 24 deletions(-)
>
> diff --git a/target/ppc/helper.h b/target/ppc/helper.h
> index b489b38..fd06b56 100644
> --- a/target/ppc/helper.h
> +++ b/target/ppc/helper.h
> @@ -233,8 +233,6 @@ DEF_HELPER_2(vextsh2d, void, avr, avr)
>  DEF_HELPER_2(vextsw2d, void, avr, avr)
>  DEF_HELPER_2(vnegw, void, avr, avr)
>  DEF_HELPER_2(vnegd, void, avr, avr)
> -DEF_HELPER_2(vupkhpx, void, avr, avr)
> -DEF_HELPER_2(vupklpx, void, avr, avr)
>  DEF_HELPER_2(vupkhsb, void, avr, avr)
>  DEF_HELPER_2(vupkhsh, void, avr, avr)
>  DEF_HELPER_2(vupkhsw, void, avr, avr)
> diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
> index f910c11..9ee667d 100644
> --- a/target/ppc/int_helper.c
> +++ b/target/ppc/int_helper.c
> @@ -1737,26 +1737,6 @@ void helper_vsum4ubs(CPUPPCState *env, ppc_avr_t
> *r, ppc_avr_t *a, ppc_avr_t *b)
>  #define UPKHI 0
>  #define UPKLO 1
>  #endif
> -#define VUPKPX(suffix, hi)                                              \
> -    void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)                \
> -    {                                                                   \
> -        int i;                                                          \
> -        ppc_avr_t result;                                               \
> -                                                                        \
> -        for (i = 0; i < ARRAY_SIZE(r->u32); i++) {                      \
> -            uint16_t e = b->u16[hi ? i : i + 4];                        \
> -            uint8_t a = (e >> 15) ? 0xff : 0;                           \
> -            uint8_t r = (e >> 10) & 0x1f;                               \
> -            uint8_t g = (e >> 5) & 0x1f;                                \
> -            uint8_t b = e & 0x1f;                                       \
> -                                                                        \
> -            result.u32[i] = (a << 24) | (r << 16) | (g << 8) | b;       \
> -        }                                                               \
> -        *r = result;                                                    \
> -    }
> -VUPKPX(lpx, UPKLO)
> -VUPKPX(hpx, UPKHI)
> -#undef VUPKPX
>
>  #define VUPK(suffix, unpacked, packee, hi)                              \
>      void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)                \
> diff --git a/target/ppc/translate/vmx-impl.inc.c
> b/target/ppc/translate/vmx-impl.inc.c
> index 3550ffa..09d80d6 100644
> --- a/target/ppc/translate/vmx-impl.inc.c
> +++ b/target/ppc/translate/vmx-impl.inc.c
> @@ -1031,6 +1031,95 @@ static void trans_vclzd(DisasContext *ctx)
>      tcg_temp_free_i64(avr);
>  }
>
> +/*
> + * vupkhpx VRT,VRB - Vector Unpack High Pixel
> + * vupklpx VRT,VRB - Vector Unpack Low Pixel
> + *
> + * Unpacks 4 pixels coded in 1-5-5-5 pattern from high/low doubleword
> element
> + * of source register into contigous array of bits in the destination
> register.
> + * Argument 'high' determines if high or low doubleword element of source
> + * register is processed.
> + */
> +static void trans_vupkpx(DisasContext *ctx, int high)
> +{
> +    int VT = rD(ctx->opcode);
> +    int VB = rB(ctx->opcode);
> +    TCGv_i64 tmp = tcg_temp_new_i64();
> +    TCGv_i64 avr = tcg_temp_new_i64();
> +    TCGv_i64 result = tcg_temp_new_i64();
> +    TCGv_i64 result1 = tcg_temp_new_i64();
> +    TCGv_i64 result2 = tcg_temp_new_i64();
> +    int64_t mask1 = 0x1fULL;
> +    int64_t mask2 = 0x1fULL << 8;
> +    int64_t mask3 = 0x1fULL << 16;
> +    int64_t mask4 = 0xffULL << 56;
> +    int i, j;
> +
> +    if (high == 1) {
> +        get_avr64(avr, VB, true);
> +    } else {
> +        get_avr64(avr, VB, false);
> +    }
> +
> +    tcg_gen_movi_i64(result, 0x0ULL);
> +    for (i = 0; i < 2; i++) {
> +        for (j = 0; j < 2; j++) {
> +            tcg_gen_shli_i64(tmp, avr, (j * 16));
> +            tcg_gen_andi_i64(tmp, tmp, mask1 << (j * 32));
> +            tcg_gen_or_i64(result, result, tmp);
> +
> +            tcg_gen_shli_i64(tmp, avr, 3 + (j * 16));
> +            tcg_gen_andi_i64(tmp, tmp, mask2 << (j * 32));
> +            tcg_gen_or_i64(result, result, tmp);
> +
> +            tcg_gen_shli_i64(tmp, avr, 6 + (j * 16));
> +            tcg_gen_andi_i64(tmp, tmp, mask3 << (j * 32));
> +            tcg_gen_or_i64(result, result, tmp);
> +
> +            tcg_gen_shri_i64(tmp, avr, (j * 16));
> +            tcg_gen_ext16s_i64(tmp, tmp);
> +            tcg_gen_andi_i64(tmp, tmp, mask4);
> +            tcg_gen_shri_i64(tmp, tmp, (32 * (1 - j)));
> +            tcg_gen_or_i64(result, result, tmp);
> +        }
> +        if (i == 0) {
> +            tcg_gen_mov_i64(result1, result);
> +            tcg_gen_movi_i64(result, 0x0ULL);
> +            tcg_gen_shri_i64(avr, avr, 32);
> +        }
> +        if (i == 1) {
> +            tcg_gen_mov_i64(result2, result);
> +        }
> +    }
> +
> +    set_avr64(VT, result1, false);
> +    set_avr64(VT, result2, true);
> +
> +    tcg_temp_free_i64(tmp);
> +    tcg_temp_free_i64(avr);
> +    tcg_temp_free_i64(result);
> +    tcg_temp_free_i64(result1);
> +    tcg_temp_free_i64(result2);
> +}
> +
> +static void gen_vupkhpx(DisasContext *ctx)
> +{
> +    if (unlikely(!ctx->altivec_enabled)) {
> +        gen_exception(ctx, POWERPC_EXCP_VPU);
> +        return;
> +    }
> +    trans_vupkpx(ctx, 1);
> +}
> +
> +static void gen_vupklpx(DisasContext *ctx)
> +{
> +    if (unlikely(!ctx->altivec_enabled)) {
> +        gen_exception(ctx, POWERPC_EXCP_VPU);
> +        return;
> +    }
> +    trans_vupkpx(ctx, 0);
> +}
> +
>  GEN_VXFORM(vmuloub, 4, 0);
>  GEN_VXFORM(vmulouh, 4, 1);
>  GEN_VXFORM(vmulouw, 4, 2);
> @@ -1348,8 +1437,6 @@ GEN_VXFORM_NOA(vupkhsw, 7, 25);
>  GEN_VXFORM_NOA(vupklsb, 7, 10);
>  GEN_VXFORM_NOA(vupklsh, 7, 11);
>  GEN_VXFORM_NOA(vupklsw, 7, 27);
> -GEN_VXFORM_NOA(vupkhpx, 7, 13);
> -GEN_VXFORM_NOA(vupklpx, 7, 15);


There is inconsistency here compared to your previous patches. There should
be lines:

GEN_VXFORM_TRANS(vupkhpx, 7, 13);
GEN_VXFORM_TRANS(vupklpx, 7, 15);

and there should be two new functions trans_vupkhpx() and trans_vupklpx()
drfined as thin wrappers around trans_vupkpx(). gen_vupkhpx() and
gen_vupklpx() should be deleted.


>  GEN_VXFORM_NOA_ENV(vrefp, 5, 4);
>  GEN_VXFORM_NOA_ENV(vrsqrtefp, 5, 5);
>  GEN_VXFORM_NOA_ENV(vexptefp, 5, 6);
> --
> 2.7.4
>
>
>

[-- Attachment #2: Type: text/html, Size: 9249 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions
  2019-10-17 11:27 ` [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions Stefan Brankovic
  2019-10-19 19:24   ` Aleksandar Markovic
  2019-10-19 20:06   ` Aleksandar Markovic
@ 2019-10-19 20:40   ` Aleksandar Markovic
  2019-10-21 12:42     ` Stefan Brankovic
  2 siblings, 1 reply; 11+ messages in thread
From: Aleksandar Markovic @ 2019-10-19 20:40 UTC (permalink / raw)
  To: Stefan Brankovic; +Cc: richard.henderson, qemu-devel, david

[-- Attachment #1: Type: text/plain, Size: 6951 bytes --]

On Thursday, October 17, 2019, Stefan Brankovic <stefan.brankovic@rt-rk.com>
wrote:

> 'trans_vupkpx' function implements both vupkhpx and vupklpx instructions
> with
> argument 'high' determine which instruction is processed. Instructions are
> implemented in two 'for' loops. Outer 'for' loop repeats unpacking two
> times,
> since both doubleword elements of destination register are formed the same
> way.
> It also stores result of every iteration in temporary register, that is
> later
> transferred to destination register. Inner 'for' loop does unpacking of
> pixels
> and forms resulting doubleword 32 by 32 bits.
>
> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/helper.h                 |  2 -
>  target/ppc/int_helper.c             | 20 --------
>  target/ppc/translate/vmx-impl.inc.c | 91 ++++++++++++++++++++++++++++++
> ++++++-
>  3 files changed, 89 insertions(+), 24 deletions(-)
>
> diff --git a/target/ppc/helper.h b/target/ppc/helper.h
> index b489b38..fd06b56 100644
> --- a/target/ppc/helper.h
> +++ b/target/ppc/helper.h
> @@ -233,8 +233,6 @@ DEF_HELPER_2(vextsh2d, void, avr, avr)
>  DEF_HELPER_2(vextsw2d, void, avr, avr)
>  DEF_HELPER_2(vnegw, void, avr, avr)
>  DEF_HELPER_2(vnegd, void, avr, avr)
> -DEF_HELPER_2(vupkhpx, void, avr, avr)
> -DEF_HELPER_2(vupklpx, void, avr, avr)
>  DEF_HELPER_2(vupkhsb, void, avr, avr)
>  DEF_HELPER_2(vupkhsh, void, avr, avr)
>  DEF_HELPER_2(vupkhsw, void, avr, avr)
> diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
> index f910c11..9ee667d 100644
> --- a/target/ppc/int_helper.c
> +++ b/target/ppc/int_helper.c
> @@ -1737,26 +1737,6 @@ void helper_vsum4ubs(CPUPPCState *env, ppc_avr_t
> *r, ppc_avr_t *a, ppc_avr_t *b)
>  #define UPKHI 0
>  #define UPKLO 1
>  #endif
> -#define VUPKPX(suffix, hi)                                              \
> -    void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)                \
> -    {                                                                   \
> -        int i;                                                          \
> -        ppc_avr_t result;                                               \
> -                                                                        \
> -        for (i = 0; i < ARRAY_SIZE(r->u32); i++) {                      \
> -            uint16_t e = b->u16[hi ? i : i + 4];                        \
> -            uint8_t a = (e >> 15) ? 0xff : 0;                           \
> -            uint8_t r = (e >> 10) & 0x1f;                               \
> -            uint8_t g = (e >> 5) & 0x1f;                                \
> -            uint8_t b = e & 0x1f;                                       \
> -                                                                        \
> -            result.u32[i] = (a << 24) | (r << 16) | (g << 8) | b;       \
> -        }                                                               \
> -        *r = result;                                                    \
> -    }
> -VUPKPX(lpx, UPKLO)
> -VUPKPX(hpx, UPKHI)
> -#undef VUPKPX
>
>  #define VUPK(suffix, unpacked, packee, hi)                              \
>      void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)                \
> diff --git a/target/ppc/translate/vmx-impl.inc.c
> b/target/ppc/translate/vmx-impl.inc.c
> index 3550ffa..09d80d6 100644
> --- a/target/ppc/translate/vmx-impl.inc.c
> +++ b/target/ppc/translate/vmx-impl.inc.c
> @@ -1031,6 +1031,95 @@ static void trans_vclzd(DisasContext *ctx)
>      tcg_temp_free_i64(avr);
>  }
>
> +/*
> + * vupkhpx VRT,VRB - Vector Unpack High Pixel
> + * vupklpx VRT,VRB - Vector Unpack Low Pixel
> + *
> + * Unpacks 4 pixels coded in 1-5-5-5 pattern from high/low doubleword
> element
> + * of source register into contigous array of bits in the destination
> register.
> + * Argument 'high' determines if high or low doubleword element of source
> + * register is processed.
> + */
> +static void trans_vupkpx(DisasContext *ctx, int high)


The last argument should be boolean.


> +{
> +    int VT = rD(ctx->opcode);
> +    int VB = rB(ctx->opcode);
> +    TCGv_i64 tmp = tcg_temp_new_i64();
> +    TCGv_i64 avr = tcg_temp_new_i64();
> +    TCGv_i64 result = tcg_temp_new_i64();
> +    TCGv_i64 result1 = tcg_temp_new_i64();
> +    TCGv_i64 result2 = tcg_temp_new_i64();
> +    int64_t mask1 = 0x1fULL;
> +    int64_t mask2 = 0x1fULL << 8;
> +    int64_t mask3 = 0x1fULL << 16;
> +    int64_t mask4 = 0xffULL << 56;
> +    int i, j;
> +
> +    if (high == 1) {
> +        get_avr64(avr, VB, true);
> +    } else {
> +        get_avr64(avr, VB, false);
> +    }
> +
> +    tcg_gen_movi_i64(result, 0x0ULL);
> +    for (i = 0; i < 2; i++) {
> +        for (j = 0; j < 2; j++) {
> +            tcg_gen_shli_i64(tmp, avr, (j * 16));
> +            tcg_gen_andi_i64(tmp, tmp, mask1 << (j * 32));
> +            tcg_gen_or_i64(result, result, tmp);
> +
> +            tcg_gen_shli_i64(tmp, avr, 3 + (j * 16));
> +            tcg_gen_andi_i64(tmp, tmp, mask2 << (j * 32));
> +            tcg_gen_or_i64(result, result, tmp);
> +
> +            tcg_gen_shli_i64(tmp, avr, 6 + (j * 16));
> +            tcg_gen_andi_i64(tmp, tmp, mask3 << (j * 32));
> +            tcg_gen_or_i64(result, result, tmp);
> +
> +            tcg_gen_shri_i64(tmp, avr, (j * 16));
> +            tcg_gen_ext16s_i64(tmp, tmp);
> +            tcg_gen_andi_i64(tmp, tmp, mask4);
> +            tcg_gen_shri_i64(tmp, tmp, (32 * (1 - j)));
> +            tcg_gen_or_i64(result, result, tmp);
> +        }
> +        if (i == 0) {
> +            tcg_gen_mov_i64(result1, result);
> +            tcg_gen_movi_i64(result, 0x0ULL);
> +            tcg_gen_shri_i64(avr, avr, 32);
> +        }
> +        if (i == 1) {
> +            tcg_gen_mov_i64(result2, result);
> +        }
> +    }
> +
> +    set_avr64(VT, result1, false);
> +    set_avr64(VT, result2, true);
> +
> +    tcg_temp_free_i64(tmp);
> +    tcg_temp_free_i64(avr);
> +    tcg_temp_free_i64(result);
> +    tcg_temp_free_i64(result1);
> +    tcg_temp_free_i64(result2);
> +}
> +
> +static void gen_vupkhpx(DisasContext *ctx)
> +{
> +    if (unlikely(!ctx->altivec_enabled)) {
> +        gen_exception(ctx, POWERPC_EXCP_VPU);
> +        return;
> +    }
> +    trans_vupkpx(ctx, 1);
> +}
> +
> +static void gen_vupklpx(DisasContext *ctx)
> +{
> +    if (unlikely(!ctx->altivec_enabled)) {
> +        gen_exception(ctx, POWERPC_EXCP_VPU);
> +        return;
> +    }
> +    trans_vupkpx(ctx, 0);
> +}
> +
>  GEN_VXFORM(vmuloub, 4, 0);
>  GEN_VXFORM(vmulouh, 4, 1);
>  GEN_VXFORM(vmulouw, 4, 2);
> @@ -1348,8 +1437,6 @@ GEN_VXFORM_NOA(vupkhsw, 7, 25);
>  GEN_VXFORM_NOA(vupklsb, 7, 10);
>  GEN_VXFORM_NOA(vupklsh, 7, 11);
>  GEN_VXFORM_NOA(vupklsw, 7, 27);
> -GEN_VXFORM_NOA(vupkhpx, 7, 13);
> -GEN_VXFORM_NOA(vupklpx, 7, 15);
>  GEN_VXFORM_NOA_ENV(vrefp, 5, 4);
>  GEN_VXFORM_NOA_ENV(vrsqrtefp, 5, 5);
>  GEN_VXFORM_NOA_ENV(vexptefp, 5, 6);
> --
> 2.7.4
>
>
>

[-- Attachment #2: Type: text/html, Size: 8721 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions
  2019-10-19 20:40   ` Aleksandar Markovic
@ 2019-10-21 12:42     ` Stefan Brankovic
  0 siblings, 0 replies; 11+ messages in thread
From: Stefan Brankovic @ 2019-10-21 12:42 UTC (permalink / raw)
  To: Aleksandar Markovic; +Cc: richard.henderson, qemu-devel, david

[-- Attachment #1: Type: text/plain, Size: 8104 bytes --]

Hello Aleksandar,

Thank you for taking a look at this patch. I will start working on a 
version 8 of the patch where I will address all your suggestions.

Kind Regards,

Stefan

On 19.10.19. 22:40, Aleksandar Markovic wrote:
>
>
> On Thursday, October 17, 2019, Stefan Brankovic 
> <stefan.brankovic@rt-rk.com <mailto:stefan.brankovic@rt-rk.com>> wrote:
>
>     'trans_vupkpx' function implements both vupkhpx and vupklpx
>     instructions with
>     argument 'high' determine which instruction is processed.
>     Instructions are
>     implemented in two 'for' loops. Outer 'for' loop repeats unpacking
>     two times,
>     since both doubleword elements of destination register are formed
>     the same way.
>     It also stores result of every iteration in temporary register,
>     that is later
>     transferred to destination register. Inner 'for' loop does
>     unpacking of pixels
>     and forms resulting doubleword 32 by 32 bits.
>
>     Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com
>     <mailto:stefan.brankovic@rt-rk.com>>
>     ---
>      target/ppc/helper.h                 |  2 -
>      target/ppc/int_helper.c             | 20 --------
>      target/ppc/translate/vmx-impl.inc.c | 91
>     ++++++++++++++++++++++++++++++++++++-
>      3 files changed, 89 insertions(+), 24 deletions(-)
>
>     diff --git a/target/ppc/helper.h b/target/ppc/helper.h
>     index b489b38..fd06b56 100644
>     --- a/target/ppc/helper.h
>     +++ b/target/ppc/helper.h
>     @@ -233,8 +233,6 @@ DEF_HELPER_2(vextsh2d, void, avr, avr)
>      DEF_HELPER_2(vextsw2d, void, avr, avr)
>      DEF_HELPER_2(vnegw, void, avr, avr)
>      DEF_HELPER_2(vnegd, void, avr, avr)
>     -DEF_HELPER_2(vupkhpx, void, avr, avr)
>     -DEF_HELPER_2(vupklpx, void, avr, avr)
>      DEF_HELPER_2(vupkhsb, void, avr, avr)
>      DEF_HELPER_2(vupkhsh, void, avr, avr)
>      DEF_HELPER_2(vupkhsw, void, avr, avr)
>     diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
>     index f910c11..9ee667d 100644
>     --- a/target/ppc/int_helper.c
>     +++ b/target/ppc/int_helper.c
>     @@ -1737,26 +1737,6 @@ void helper_vsum4ubs(CPUPPCState *env,
>     ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
>      #define UPKHI 0
>      #define UPKLO 1
>      #endif
>     -#define VUPKPX(suffix, hi)         \
>     -    void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)         \
>     -    {          \
>     -        int i;         \
>     -        ppc_avr_t result;          \
>     -         \
>     -        for (i = 0; i < ARRAY_SIZE(r->u32); i++) {               \
>     -            uint16_t e = b->u16[hi ? i : i + 4];             \
>     -            uint8_t a = (e >> 15) ? 0xff : 0;                \
>     -            uint8_t r = (e >> 10) & 0x1f;                    \
>     -            uint8_t g = (e >> 5) & 0x1f;                   \
>     -            uint8_t b = e & 0x1f;              \
>     -         \
>     -            result.u32[i] = (a << 24) | (r << 16) | (g << 8) |
>     b;       \
>     -        }          \
>     -        *r = result;         \
>     -    }
>     -VUPKPX(lpx, UPKLO)
>     -VUPKPX(hpx, UPKHI)
>     -#undef VUPKPX
>
>      #define VUPK(suffix, unpacked, packee, hi)         \
>          void helper_vupk##suffix(ppc_avr_t *r, ppc_avr_t *b)         \
>     diff --git a/target/ppc/translate/vmx-impl.inc.c
>     b/target/ppc/translate/vmx-impl.inc.c
>     index 3550ffa..09d80d6 100644
>     --- a/target/ppc/translate/vmx-impl.inc.c
>     +++ b/target/ppc/translate/vmx-impl.inc.c
>     @@ -1031,6 +1031,95 @@ static void trans_vclzd(DisasContext *ctx)
>          tcg_temp_free_i64(avr);
>      }
>
>     +/*
>     + * vupkhpx VRT,VRB - Vector Unpack High Pixel
>     + * vupklpx VRT,VRB - Vector Unpack Low Pixel
>     + *
>     + * Unpacks 4 pixels coded in 1-5-5-5 pattern from high/low
>     doubleword element
>     + * of source register into contigous array of bits in the
>     destination register.
>     + * Argument 'high' determines if high or low doubleword element
>     of source
>     + * register is processed.
>     + */
>     +static void trans_vupkpx(DisasContext *ctx, int high)
>
>
> The last argument should be boolean.
>
>     +{
>     +    int VT = rD(ctx->opcode);
>     +    int VB = rB(ctx->opcode);
>     +    TCGv_i64 tmp = tcg_temp_new_i64();
>     +    TCGv_i64 avr = tcg_temp_new_i64();
>     +    TCGv_i64 result = tcg_temp_new_i64();
>     +    TCGv_i64 result1 = tcg_temp_new_i64();
>     +    TCGv_i64 result2 = tcg_temp_new_i64();
>     +    int64_t mask1 = 0x1fULL;
>     +    int64_t mask2 = 0x1fULL << 8;
>     +    int64_t mask3 = 0x1fULL << 16;
>     +    int64_t mask4 = 0xffULL << 56;
>     +    int i, j;
>     +
>     +    if (high == 1) {
>     +        get_avr64(avr, VB, true);
>     +    } else {
>     +        get_avr64(avr, VB, false);
>     +    }
>     +
>     +    tcg_gen_movi_i64(result, 0x0ULL);
>     +    for (i = 0; i < 2; i++) {
>     +        for (j = 0; j < 2; j++) {
>     +            tcg_gen_shli_i64(tmp, avr, (j * 16));
>     +            tcg_gen_andi_i64(tmp, tmp, mask1 << (j * 32));
>     +            tcg_gen_or_i64(result, result, tmp);
>     +
>     +            tcg_gen_shli_i64(tmp, avr, 3 + (j * 16));
>     +            tcg_gen_andi_i64(tmp, tmp, mask2 << (j * 32));
>     +            tcg_gen_or_i64(result, result, tmp);
>     +
>     +            tcg_gen_shli_i64(tmp, avr, 6 + (j * 16));
>     +            tcg_gen_andi_i64(tmp, tmp, mask3 << (j * 32));
>     +            tcg_gen_or_i64(result, result, tmp);
>     +
>     +            tcg_gen_shri_i64(tmp, avr, (j * 16));
>     +            tcg_gen_ext16s_i64(tmp, tmp);
>     +            tcg_gen_andi_i64(tmp, tmp, mask4);
>     +            tcg_gen_shri_i64(tmp, tmp, (32 * (1 - j)));
>     +            tcg_gen_or_i64(result, result, tmp);
>     +        }
>     +        if (i == 0) {
>     +            tcg_gen_mov_i64(result1, result);
>     +            tcg_gen_movi_i64(result, 0x0ULL);
>     +            tcg_gen_shri_i64(avr, avr, 32);
>     +        }
>     +        if (i == 1) {
>     +            tcg_gen_mov_i64(result2, result);
>     +        }
>     +    }
>     +
>     +    set_avr64(VT, result1, false);
>     +    set_avr64(VT, result2, true);
>     +
>     +    tcg_temp_free_i64(tmp);
>     +    tcg_temp_free_i64(avr);
>     +    tcg_temp_free_i64(result);
>     +    tcg_temp_free_i64(result1);
>     +    tcg_temp_free_i64(result2);
>     +}
>     +
>     +static void gen_vupkhpx(DisasContext *ctx)
>     +{
>     +    if (unlikely(!ctx->altivec_enabled)) {
>     +        gen_exception(ctx, POWERPC_EXCP_VPU);
>     +        return;
>     +    }
>     +    trans_vupkpx(ctx, 1);
>     +}
>     +
>     +static void gen_vupklpx(DisasContext *ctx)
>     +{
>     +    if (unlikely(!ctx->altivec_enabled)) {
>     +        gen_exception(ctx, POWERPC_EXCP_VPU);
>     +        return;
>     +    }
>     +    trans_vupkpx(ctx, 0);
>     +}
>     +
>      GEN_VXFORM(vmuloub, 4, 0);
>      GEN_VXFORM(vmulouh, 4, 1);
>      GEN_VXFORM(vmulouw, 4, 2);
>     @@ -1348,8 +1437,6 @@ GEN_VXFORM_NOA(vupkhsw, 7, 25);
>      GEN_VXFORM_NOA(vupklsb, 7, 10);
>      GEN_VXFORM_NOA(vupklsh, 7, 11);
>      GEN_VXFORM_NOA(vupklsw, 7, 27);
>     -GEN_VXFORM_NOA(vupkhpx, 7, 13);
>     -GEN_VXFORM_NOA(vupklpx, 7, 15);
>      GEN_VXFORM_NOA_ENV(vrefp, 5, 4);
>      GEN_VXFORM_NOA_ENV(vrsqrtefp, 5, 5);
>      GEN_VXFORM_NOA_ENV(vexptefp, 5, 6);
>     -- 
>     2.7.4
>
>

[-- Attachment #2: Type: text/html, Size: 11243 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-10-21 12:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-17 11:27 [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
2019-10-17 11:27 ` [PATCH v7 1/3] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
2019-10-19 19:51   ` Aleksandar Markovic
2019-10-17 11:27 ` [PATCH v7 2/3] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
2019-10-19 19:35   ` Aleksandar Markovic
2019-10-17 11:27 ` [PATCH v7 3/3] target/ppc: Optimize emulation of vupkhpx and vupklpx instructions Stefan Brankovic
2019-10-19 19:24   ` Aleksandar Markovic
2019-10-19 20:06   ` Aleksandar Markovic
2019-10-19 20:40   ` Aleksandar Markovic
2019-10-21 12:42     ` Stefan Brankovic
2019-10-19 19:54 ` [PATCH v7 0/3] target/ppc: Optimize emulation of some Altivec instructions Aleksandar Markovic

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.