qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions
@ 2019-06-19 11:03 Stefan Brankovic
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize emulation of ten Altivec instructions: lvsl, lvsr, vsl, vsr, vpkpx,
vgbbd, vclzb, vclzh, vclzw and vclzd.

This series buils up on and complements recent work of Thomas Murta, Mark
Cave-Ayland and Richard Henderson in the same area. It is based on devising TCG
translation implementation for selected instructions rather than using helpers.
The selected instructions are most of the time idiosyncratic to ppc platform,
so relatively complex TCG translation (without direct mapping to host
instruction that is not possible in these cases) seems to be the best option,
and that approach is presented in this series. The performance improvements are
significant in all cases.

V2:

Addressed Richard's Henderson's suggestions.
Fixed problem during build on patch 2/8.
Rebased series to the latest qemu code.

Stefan Brankovic (8):
  target/ppc: Optimize emulation of lvsl and lvsr instructions
  target/ppc: Optimize emulation of vsl and vsr instructions
  target/ppc: Optimize emulation of vpkpx instruction
  target/ppc: Optimize emulation of vgbbd instruction
  target/ppc: Optimize emulation of vclzd instruction
  target/ppc: Optimize emulation of vclzw instruction
  target/ppc: Optimize emulation of vclzh and vclzb instructions
  target/ppc: Refactor emulation of vmrgew and vmrgow instructions

 target/ppc/helper.h                 |  10 -
 target/ppc/int_helper.c             | 365 ---------------------
 target/ppc/translate/vmx-impl.inc.c | 633 ++++++++++++++++++++++++++++++++----
 3 files changed, 564 insertions(+), 444 deletions(-)

-- 
2.7.4



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
@ 2019-06-19 11:03 ` Stefan Brankovic
  2019-06-26 15:28   ` Richard Henderson
  2019-06-26 15:48   ` Richard Henderson
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions Stefan Brankovic
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Adding simple macro that is calling tcg implementation of appropriate
instruction if altivec support is active.

Optimization of altivec instruction lvsl (Load Vector for Shift Left).
Place bytes sh:sh+15 of value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F
in destination register. Sh is calculated by adding 2 source registers and
getting bits 60-63 of result.

First, the bits [28-31] are placed from EA to variable sh. After that,
the bytes are created in the following way:
sh:(sh+7) of X(from description) by multiplying sh with 0x0101010101010101
followed by addition of the result with 0x0001020304050607. Value obtained
is placed in higher doubleword element of vD.
(sh+8):(sh+15) by adding the result of previous multiplication with
0x08090a0b0c0d0e0f. Value obtained is placed in lower doubleword element
of vD.

Optimization of altivec instruction lvsr (Load Vector for Shift Right).
Place bytes 16-sh:31-sh of value 0x00 || 0x01 || 0x02 || ... || 0x1E ||
0x1F in destination register. Sh is calculated by adding 2 source
registers and getting bits 60-63 of result.

First, the bits [28-31] are placed from EA to variable sh. After that,
the bytes are created in the following way:
sh:(sh+7) of X(from description) by multiplying sh with 0x0101010101010101
followed by substraction of the result from 0x1011121314151617. Value
obtained is placed in higher doubleword element of vD.
(sh+8):(sh+15) by substracting the result of previous multiplication from
0x18191a1b1c1d1e1f. Value obtained is placed in lower doubleword element
of vD.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |   2 -
 target/ppc/int_helper.c             |  18 ------
 target/ppc/translate/vmx-impl.inc.c | 120 ++++++++++++++++++++++++++----------
 3 files changed, 88 insertions(+), 52 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 02b67a3..c82105e 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -189,8 +189,6 @@ DEF_HELPER_2(vprtybw, void, avr, avr)
 DEF_HELPER_2(vprtybd, void, avr, avr)
 DEF_HELPER_2(vprtybq, void, avr, avr)
 DEF_HELPER_3(vsubcuw, void, avr, avr, avr)
-DEF_HELPER_2(lvsl, void, avr, tl)
-DEF_HELPER_2(lvsr, void, avr, tl)
 DEF_HELPER_FLAGS_5(vaddsbs, TCG_CALL_NO_RWG, void, avr, avr, avr, avr, i32)
 DEF_HELPER_FLAGS_5(vaddshs, TCG_CALL_NO_RWG, void, avr, avr, avr, avr, i32)
 DEF_HELPER_FLAGS_5(vaddsws, TCG_CALL_NO_RWG, void, avr, avr, avr, avr, i32)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 8ce89f2..9505f4c 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -457,24 +457,6 @@ SATCVT(sd, uw, int64_t, uint32_t, 0, UINT32_MAX)
 #undef SATCVT
 #undef SATCVTU
 
-void helper_lvsl(ppc_avr_t *r, target_ulong sh)
-{
-    int i, j = (sh & 0xf);
-
-    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
-        r->VsrB(i) = j++;
-    }
-}
-
-void helper_lvsr(ppc_avr_t *r, target_ulong sh)
-{
-    int i, j = 0x10 - (sh & 0xf);
-
-    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
-        r->VsrB(i) = j++;
-    }
-}
-
 void helper_mtvscr(CPUPPCState *env, uint32_t vscr)
 {
     env->vscr = vscr & ~(1u << VSCR_SAT);
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 663275b..482e2ee 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -142,38 +142,6 @@ GEN_VR_STVE(bx, 0x07, 0x04, 1);
 GEN_VR_STVE(hx, 0x07, 0x05, 2);
 GEN_VR_STVE(wx, 0x07, 0x06, 4);
 
-static void gen_lvsl(DisasContext *ctx)
-{
-    TCGv_ptr rd;
-    TCGv EA;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    EA = tcg_temp_new();
-    gen_addr_reg_index(ctx, EA);
-    rd = gen_avr_ptr(rD(ctx->opcode));
-    gen_helper_lvsl(rd, EA);
-    tcg_temp_free(EA);
-    tcg_temp_free_ptr(rd);
-}
-
-static void gen_lvsr(DisasContext *ctx)
-{
-    TCGv_ptr rd;
-    TCGv EA;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    EA = tcg_temp_new();
-    gen_addr_reg_index(ctx, EA);
-    rd = gen_avr_ptr(rD(ctx->opcode));
-    gen_helper_lvsr(rd, EA);
-    tcg_temp_free(EA);
-    tcg_temp_free_ptr(rd);
-}
-
 static void gen_mfvscr(DisasContext *ctx)
 {
     TCGv_i32 t;
@@ -316,6 +284,16 @@ static void glue(gen_, name)(DisasContext *ctx)                         \
     tcg_temp_free_ptr(rd);                                              \
 }
 
+#define GEN_VXFORM_TRANS(name, opc2, opc3)                              \
+static void glue(gen_, name)(DisasContext *ctx)                         \
+{                                                                       \
+    if (unlikely(!ctx->altivec_enabled)) {                              \
+        gen_exception(ctx, POWERPC_EXCP_VPU);                           \
+        return;                                                         \
+    }                                                                   \
+    trans_##name(ctx);                                                  \
+}
+
 #define GEN_VXFORM_ENV(name, opc2, opc3)                                \
 static void glue(gen_, name)(DisasContext *ctx)                         \
 {                                                                       \
@@ -515,6 +493,82 @@ static void gen_vmrgow(DisasContext *ctx)
     tcg_temp_free_i64(avr);
 }
 
+/*
+ * lvsl VRT,RA,RB - Load Vector for Shift Left
+ *
+ * Let the EA be the sum (rA|0)+(rB). Let sh=EA[28–31].
+ * Let X be the 32-byte value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F.
+ * Bytes sh:sh+15 of X are placed into vD.
+ */
+static void trans_lvsl(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv EA = tcg_temp_new();
+
+    /* Get sh(from description) by anding EA with 0xf. */
+    gen_addr_reg_index(ctx, EA);
+    tcg_gen_ext_i32_i64(sh, EA);
+    tcg_gen_andi_i64(sh, sh, 0xfULL);
+
+    /*
+     * Create bytes sh:sh+7 of X(from description) and place them in
+     * higher doubleword of vD.
+     */
+    tcg_gen_muli_i64(sh, sh, 0x0101010101010101ULL);
+    tcg_gen_addi_i64(result, sh, 0x0001020304050607ull);
+    set_avr64(VT, result, true);
+    /*
+     * Create bytes sh+8:sh+15 of X(from description) and place them in
+     * lower doubleword of vD.
+     */
+    tcg_gen_addi_i64(result, sh, 0x08090a0b0c0d0e0fULL);
+    set_avr64(VT, result, false);
+
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free(EA);
+}
+
+/*
+ * lvsr VRT,RA,RB - Load Vector for Shift Right
+ *
+ * Let the EA be the sum (rA|0)+(rB). Let sh=EA[28–31].
+ * Let X be the 32-byte value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F.
+ * Bytes (16-sh):(31-sh) of X are placed into vD.
+ */
+static void trans_lvsr(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv EA = tcg_temp_new();
+
+
+    /* Get sh(from description) by anding EA with 0xf. */
+    gen_addr_reg_index(ctx, EA);
+    tcg_gen_ext_i32_i64(sh, EA);
+    tcg_gen_andi_i64(sh, sh, 0xfULL);
+    /*
+     * Create bytes (16-sh):(23-sh) of X(from description) and place them in
+     * higher doubleword of vD.
+     */
+    tcg_gen_muli_i64(sh, sh, 0x0101010101010101ULL);
+    tcg_gen_subfi_i64(result, 0x1011121314151617ULL, sh);
+    set_avr64(VT, result, true);
+    /*
+     * Create bytes (24-sh):(32-sh) of X(from description) and place them in
+     * lower doubleword of vD.
+     */
+    tcg_gen_subfi_i64(result, 0x18191a1b1c1d1e1fULL, sh);
+    set_avr64(VT, result, false);
+
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free(EA);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -662,6 +716,8 @@ GEN_VXFORM_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
 GEN_VXFORM_HETRO(vextubrx, 6, 28)
 GEN_VXFORM_HETRO(vextuhrx, 6, 29)
 GEN_VXFORM_HETRO(vextuwrx, 6, 30)
+GEN_VXFORM_TRANS(lvsl, 6, 31)
+GEN_VXFORM_TRANS(lvsr, 6, 32)
 GEN_VXFORM_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207, \
                 vextuwrx, PPC_NONE, PPC2_ISA300)
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
@ 2019-06-19 11:03 ` Stefan Brankovic
  2019-06-26 15:34   ` Richard Henderson
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 3/8] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimization of altivec instructions vsl and vsr(Vector Shift Left/Rigt).
Perform shift operation (left and right respectively) on 128 bit value of
register vA by value specified in bits 125-127 of register vB. Lowest 3
bits in each byte element of register vB must be identical or result is
undefined.

For vsl instruction, the first step is bits 125-127 of register vB have
to be saved in variable sh. Then, the highest sh bits of the lower
doubleword element of register vA are saved in variable shifted,
in order not to lose those bits when shift operation is performed on
the lower doubleword element of register vA, which is the next
step. After shifting the lower doubleword element shift operation
is performed on higher doubleword element of vA, with replacement of
the lowest sh bits(that are now 0) with bits saved in shifted.

For vsr instruction, firstly, the bits 125-127 of register vB have
to be saved in variable sh. Then, the lowest sh bits of the higher
doubleword element of register vA are saved in variable shifted,
in odred not to lose those bits when the shift operation is
performed on the higher doubleword element of register vA, which is
the next step. After shifting higher doubleword element, shift operation
is performed on lower doubleword element of vA, with replacement of
highest sh bits(that are now 0) with bits saved in shifted.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |  2 -
 target/ppc/int_helper.c             | 35 -------------
 target/ppc/translate/vmx-impl.inc.c | 99 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 97 insertions(+), 39 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index c82105e..33dad6a 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -213,8 +213,6 @@ DEF_HELPER_3(vrlb, void, avr, avr, avr)
 DEF_HELPER_3(vrlh, void, avr, avr, avr)
 DEF_HELPER_3(vrlw, void, avr, avr, avr)
 DEF_HELPER_3(vrld, void, avr, avr, avr)
-DEF_HELPER_3(vsl, void, avr, avr, avr)
-DEF_HELPER_3(vsr, void, avr, avr, avr)
 DEF_HELPER_4(vsldoi, void, avr, avr, avr, i32)
 DEF_HELPER_3(vextractub, void, avr, avr, i32)
 DEF_HELPER_3(vextractuh, void, avr, avr, i32)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 9505f4c..a23853e 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1738,41 +1738,6 @@ VEXTU_X_DO(vextuhrx, 16, 0)
 VEXTU_X_DO(vextuwrx, 32, 0)
 #undef VEXTU_X_DO
 
-/*
- * The specification says that the results are undefined if all of the
- * shift counts are not identical.  We check to make sure that they
- * are to conform to what real hardware appears to do.
- */
-#define VSHIFT(suffix, leftp)                                           \
-    void helper_vs##suffix(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)    \
-    {                                                                   \
-        int shift = b->VsrB(15) & 0x7;                                  \
-        int doit = 1;                                                   \
-        int i;                                                          \
-                                                                        \
-        for (i = 0; i < ARRAY_SIZE(r->u8); i++) {                       \
-            doit = doit && ((b->u8[i] & 0x7) == shift);                 \
-        }                                                               \
-        if (doit) {                                                     \
-            if (shift == 0) {                                           \
-                *r = *a;                                                \
-            } else if (leftp) {                                         \
-                uint64_t carry = a->VsrD(1) >> (64 - shift);            \
-                                                                        \
-                r->VsrD(0) = (a->VsrD(0) << shift) | carry;             \
-                r->VsrD(1) = a->VsrD(1) << shift;                       \
-            } else {                                                    \
-                uint64_t carry = a->VsrD(0) << (64 - shift);            \
-                                                                        \
-                r->VsrD(1) = (a->VsrD(1) >> shift) | carry;             \
-                r->VsrD(0) = a->VsrD(0) >> shift;                       \
-            }                                                           \
-        }                                                               \
-    }
-VSHIFT(l, 1)
-VSHIFT(r, 0)
-#undef VSHIFT
-
 void helper_vslv(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 {
     int i;
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 482e2ee..e7ae979 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -569,6 +569,101 @@ static void trans_lvsr(DisasContext *ctx)
     tcg_temp_free(EA);
 }
 
+/*
+ * vsl VRT,VRA,VRB - Vector Shift Left
+ *
+ * Shifting left 128 bit value of vA by value specified in bits 125-127 of vB.
+ * Lowest 3 bits in each byte element of register vB must be identical or
+ * result is undefined.
+ */
+static void trans_vsl(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avrA = tcg_temp_new_i64();
+    TCGv_i64 avrB = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv_i64 shifted = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+
+    /* Place bits 125-127 of vB in sh. */
+    get_avr64(avrB, VB, false);
+    tcg_gen_andi_i64(sh, avrB, 0x07ULL);
+
+    /*
+     * Save highest sh bits of lower doubleword element of vA in variable
+     * shifted and perform shift on lower doubleword.
+     */
+    get_avr64(avrA, VA, false);
+    tcg_gen_subfi_i64(tmp, 64, sh);
+    tcg_gen_shr_i64(shifted, avrA, tmp);
+    tcg_gen_shl_i64(avrA, avrA, sh);
+    set_avr64(VT, avrA, false);
+
+    /*
+     * Perform shift on higher doubleword element of vA and replace lowest
+     * sh bits with shifted.
+     */
+    get_avr64(avrA, VA, true);
+    tcg_gen_shl_i64(avrA, avrA, sh);
+    tcg_gen_or_i64(avrA, avrA, shifted);
+    set_avr64(VT, avrA, true);
+
+    tcg_temp_free_i64(avrA);
+    tcg_temp_free_i64(avrB);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free_i64(shifted);
+    tcg_temp_free_i64(tmp);
+}
+
+/*
+ * vsr VRT,VRA,VRB - Vector Shift Right
+ *
+ * Shifting right 128 bit value of vA by value specified in bits 125-127 of vB.
+ * Lowest 3 bits in each byte element of register vB must be identical or
+ * result is undefined.
+ */
+static void trans_vsr(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avrA = tcg_temp_new_i64();
+    TCGv_i64 avrB = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv_i64 shifted = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+
+    /* Place bits 125-127 of vB in sh. */
+    get_avr64(avrB, VB, false);
+    tcg_gen_andi_i64(sh, avrB, 0x07ULL);
+
+    /*
+     * Save lowest sh bits of higher doubleword element of vA in variable
+     * shifted and perform shift on higher doubleword.
+     */
+    get_avr64(avrA, VA, true);
+    tcg_gen_subfi_i64(tmp, 64, sh);
+    tcg_gen_shl_i64(shifted, avrA, tmp);
+    tcg_gen_shr_i64(avrA, avrA, sh);
+    set_avr64(VT, avrA, true);
+    /*
+     * Perform shift on lower doubleword element of vA and replace highest
+     * sh bits with shifted.
+     */
+    get_avr64(avrA, VA, false);
+    tcg_gen_shr_i64(avrA, avrA, sh);
+    tcg_gen_or_i64(avrA, avrA, shifted);
+    set_avr64(VT, avrA, false);
+
+    tcg_temp_free_i64(avrA);
+    tcg_temp_free_i64(avrB);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free_i64(shifted);
+    tcg_temp_free_i64(tmp);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -681,11 +776,11 @@ GEN_VXFORM(vrld, 2, 3);
 GEN_VXFORM(vrldmi, 2, 3);
 GEN_VXFORM_DUAL(vrld, PPC_NONE, PPC2_ALTIVEC_207, \
                 vrldmi, PPC_NONE, PPC2_ISA300)
-GEN_VXFORM(vsl, 2, 7);
+GEN_VXFORM_TRANS(vsl, 2, 7);
 GEN_VXFORM(vrldnm, 2, 7);
 GEN_VXFORM_DUAL(vsl, PPC_ALTIVEC, PPC_NONE, \
                 vrldnm, PPC_NONE, PPC2_ISA300)
-GEN_VXFORM(vsr, 2, 11);
+GEN_VXFORM_TRANS(vsr, 2, 11);
 GEN_VXFORM_ENV(vpkuhum, 7, 0);
 GEN_VXFORM_ENV(vpkuwum, 7, 1);
 GEN_VXFORM_ENV(vpkudum, 7, 17);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 3/8] target/ppc: Optimize emulation of vpkpx instruction
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions Stefan Brankovic
@ 2019-06-19 11:03 ` Stefan Brankovic
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction Stefan Brankovic
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize altivec instruction vpkpx (Vector Pack Pixel).
Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register)
into contigous array of bits in the destination register.

In each iteration of outer loop, the instruction is to be done with
the 6-5-5 pack for 2 pixels of each doubleword element of each
source register. The first thing to be done in outer loop is
choosing which doubleword element of which register is to be used
in current iteration and it is to be placed in avr variable. The
next step is to perform 6-5-5 pack of pixels on avr variable in inner
for loop(2 iterations, 1 for each pixel) and save result in tmp variable.
In the end of outer for loop, the result is merged in variable called
result and saved in appropriate doubleword element of vD if the whole
doubleword is finished(every second iteration). The outer loop has 4
iterations.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |  1 -
 target/ppc/int_helper.c             | 21 ---------
 target/ppc/translate/vmx-impl.inc.c | 93 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 92 insertions(+), 23 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 33dad6a..7a3d68d 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -254,7 +254,6 @@ DEF_HELPER_4(vpkudus, void, env, avr, avr, avr)
 DEF_HELPER_4(vpkuhum, void, env, avr, avr, avr)
 DEF_HELPER_4(vpkuwum, void, env, avr, avr, avr)
 DEF_HELPER_4(vpkudum, void, env, avr, avr, avr)
-DEF_HELPER_3(vpkpx, void, avr, avr, avr)
 DEF_HELPER_5(vmhaddshs, void, env, avr, avr, avr, avr)
 DEF_HELPER_5(vmhraddshs, void, env, avr, avr, avr, avr)
 DEF_HELPER_5(vmsumuhm, void, env, avr, avr, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index a23853e..f397380 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1536,27 +1536,6 @@ void helper_vpmsumd(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 #else
 #define PKBIG 0
 #endif
-void helper_vpkpx(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
-{
-    int i, j;
-    ppc_avr_t result;
-#if defined(HOST_WORDS_BIGENDIAN)
-    const ppc_avr_t *x[2] = { a, b };
-#else
-    const ppc_avr_t *x[2] = { b, a };
-#endif
-
-    VECTOR_FOR_INORDER_I(i, u64) {
-        VECTOR_FOR_INORDER_I(j, u32) {
-            uint32_t e = x[i]->u32[j];
-
-            result.u16[4 * i + j] = (((e >> 9) & 0xfc00) |
-                                     ((e >> 6) & 0x3e0) |
-                                     ((e >> 3) & 0x1f));
-        }
-    }
-    *r = result;
-}
 
 #define VPK(suffix, from, to, cvt, dosat)                               \
     void helper_vpk##suffix(CPUPPCState *env, ppc_avr_t *r,             \
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index e7ae979..7d0c824 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -570,6 +570,97 @@ static void trans_lvsr(DisasContext *ctx)
 }
 
 /*
+ * vpkpx VRT,VRA,VRB - Vector Pack Pixel
+ *
+ * Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register)
+ * into contigous array of bits in the destination register.
+ */
+static void trans_vpkpx(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 shifted = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    int64_t mask1 = 0x1fULL;
+    int64_t mask2 = 0x1fULL << 5;
+    int64_t mask3 = 0x3fULL << 10;
+    int i, j;
+    /*
+     * In each iteration do the 6-5-5 pack for 2 pixels of each doubleword
+     * element of each source register.
+     */
+    for (i = 0; i < 4; i++) {
+        switch (i) {
+        case 0:
+            /*
+             * Get high doubleword of vA to perfrom 6-5-5 pack of pixels
+             * 1 and 2.
+             */
+            get_avr64(avr, VA, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 1:
+            /*
+             * Get low doubleword of vA to perfrom 6-5-5 pack of pixels
+             * 3 and 4.
+             */
+            get_avr64(avr, VA, false);
+            break;
+        case 2:
+            /*
+             * Get high doubleword of vB to perfrom 6-5-5 pack of pixels
+             * 5 and 6.
+             */
+            get_avr64(avr, VB, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 3:
+            /*
+             * Get low doubleword of vB to perfrom 6-5-5 pack of pixels
+             * 7 and 8.
+             */
+            get_avr64(avr, VB, false);
+            break;
+        }
+        /* Perform the packing for 2 pixels(each iteration for 1). */
+        tcg_gen_movi_i64(tmp, 0x0ULL);
+        for (j = 0; j < 2; j++) {
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 3));
+            tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 6));
+            tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 9));
+            tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+        }
+        if ((i == 0) || (i == 2)) {
+            tcg_gen_shli_i64(tmp, tmp, 32);
+        }
+        tcg_gen_or_i64(result, result, tmp);
+        if (i == 1) {
+            /* Place packed pixels 1:4 to high doubleword of vD. */
+            set_avr64(VT, result, true);
+        }
+        if (i == 3) {
+            /* Place packed pixels 5:8 to low doubleword of vD. */
+            set_avr64(VT, result, false);
+        }
+    }
+
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(shifted);
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+}
+
+/*
  * vsl VRT,VRA,VRB - Vector Shift Left
  *
  * Shifting left 128 bit value of vA by value specified in bits 125-127 of vB.
@@ -793,7 +884,7 @@ GEN_VXFORM_ENV(vpksdus, 7, 21);
 GEN_VXFORM_ENV(vpkshss, 7, 6);
 GEN_VXFORM_ENV(vpkswss, 7, 7);
 GEN_VXFORM_ENV(vpksdss, 7, 23);
-GEN_VXFORM(vpkpx, 7, 12);
+GEN_VXFORM_TRANS(vpkpx, 7, 12);
 GEN_VXFORM_ENV(vsum4ubs, 4, 24);
 GEN_VXFORM_ENV(vsum4sbs, 4, 28);
 GEN_VXFORM_ENV(vsum4shs, 4, 25);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
                   ` (2 preceding siblings ...)
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 3/8] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
@ 2019-06-19 11:03 ` Stefan Brankovic
  2019-06-26 15:37   ` Richard Henderson
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction Stefan Brankovic
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize altivec instruction vgbbd (Vector Gather Bits by Bytes by Doubleword)
All ith bits (i in range 1 to 8) of each byte of doubleword element in
source register are concatenated and placed into ith byte of appropriate
doubleword element in destination register.

Following solution is done for both doubleword elements of source register
in parallel, in order to reduce the number of instructions needed(that's why
arrays are used):
First, both doubleword elements of source register vB are placed in
appropriate element of array avr. Bits are gathered in 2x8 iterations(2 for
loops). In first iteration bit 1 of byte 1, bit 2 of byte 2,... bit 8 of
byte 8 are in their final spots so avr[i], i={0,1} can be and-ed with
tcg_mask. For every following iteration, both avr[i] and tcg_mask variables
have to be shifted right for 7 and 8 places, respectively, in order to get
bit 1 of byte 2, bit 2 of byte 3.. bit 7 of byte 8 in their final spots so
shifted avr values(saved in tmp) can be and-ed with new value of tcg_mask...
After first 8 iteration(first loop), all the first bits are in their final
places, all second bits but second bit from eight byte are in their places...
only 1 eight bit from eight byte is in it's place). In second loop we do all
operations symmetrically, in order to get other half of bits in their final
spots. Results for first and second doubleword elements are saved in
result[0] and result[1] respectively. In the end those results are saved in
appropriate doubleword element of destination register vD.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |   1 -
 target/ppc/int_helper.c             | 276 ------------------------------------
 target/ppc/translate/vmx-impl.inc.c |  77 +++++++++-
 3 files changed, 76 insertions(+), 278 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 7a3d68d..0aa1e05 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -319,7 +319,6 @@ DEF_HELPER_1(vclzlsbb, tl, avr)
 DEF_HELPER_1(vctzlsbb, tl, avr)
 DEF_HELPER_3(vbpermd, void, avr, avr, avr)
 DEF_HELPER_3(vbpermq, void, avr, avr, avr)
-DEF_HELPER_2(vgbbd, void, avr, avr)
 DEF_HELPER_3(vpmsumb, void, avr, avr, avr)
 DEF_HELPER_3(vpmsumh, void, avr, avr, avr)
 DEF_HELPER_3(vpmsumw, void, avr, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index f397380..1e32549 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1185,282 +1185,6 @@ void helper_vbpermq(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 #undef VBPERMQ_INDEX
 #undef VBPERMQ_DW
 
-static const uint64_t VGBBD_MASKS[256] = {
-    0x0000000000000000ull, /* 00 */
-    0x0000000000000080ull, /* 01 */
-    0x0000000000008000ull, /* 02 */
-    0x0000000000008080ull, /* 03 */
-    0x0000000000800000ull, /* 04 */
-    0x0000000000800080ull, /* 05 */
-    0x0000000000808000ull, /* 06 */
-    0x0000000000808080ull, /* 07 */
-    0x0000000080000000ull, /* 08 */
-    0x0000000080000080ull, /* 09 */
-    0x0000000080008000ull, /* 0A */
-    0x0000000080008080ull, /* 0B */
-    0x0000000080800000ull, /* 0C */
-    0x0000000080800080ull, /* 0D */
-    0x0000000080808000ull, /* 0E */
-    0x0000000080808080ull, /* 0F */
-    0x0000008000000000ull, /* 10 */
-    0x0000008000000080ull, /* 11 */
-    0x0000008000008000ull, /* 12 */
-    0x0000008000008080ull, /* 13 */
-    0x0000008000800000ull, /* 14 */
-    0x0000008000800080ull, /* 15 */
-    0x0000008000808000ull, /* 16 */
-    0x0000008000808080ull, /* 17 */
-    0x0000008080000000ull, /* 18 */
-    0x0000008080000080ull, /* 19 */
-    0x0000008080008000ull, /* 1A */
-    0x0000008080008080ull, /* 1B */
-    0x0000008080800000ull, /* 1C */
-    0x0000008080800080ull, /* 1D */
-    0x0000008080808000ull, /* 1E */
-    0x0000008080808080ull, /* 1F */
-    0x0000800000000000ull, /* 20 */
-    0x0000800000000080ull, /* 21 */
-    0x0000800000008000ull, /* 22 */
-    0x0000800000008080ull, /* 23 */
-    0x0000800000800000ull, /* 24 */
-    0x0000800000800080ull, /* 25 */
-    0x0000800000808000ull, /* 26 */
-    0x0000800000808080ull, /* 27 */
-    0x0000800080000000ull, /* 28 */
-    0x0000800080000080ull, /* 29 */
-    0x0000800080008000ull, /* 2A */
-    0x0000800080008080ull, /* 2B */
-    0x0000800080800000ull, /* 2C */
-    0x0000800080800080ull, /* 2D */
-    0x0000800080808000ull, /* 2E */
-    0x0000800080808080ull, /* 2F */
-    0x0000808000000000ull, /* 30 */
-    0x0000808000000080ull, /* 31 */
-    0x0000808000008000ull, /* 32 */
-    0x0000808000008080ull, /* 33 */
-    0x0000808000800000ull, /* 34 */
-    0x0000808000800080ull, /* 35 */
-    0x0000808000808000ull, /* 36 */
-    0x0000808000808080ull, /* 37 */
-    0x0000808080000000ull, /* 38 */
-    0x0000808080000080ull, /* 39 */
-    0x0000808080008000ull, /* 3A */
-    0x0000808080008080ull, /* 3B */
-    0x0000808080800000ull, /* 3C */
-    0x0000808080800080ull, /* 3D */
-    0x0000808080808000ull, /* 3E */
-    0x0000808080808080ull, /* 3F */
-    0x0080000000000000ull, /* 40 */
-    0x0080000000000080ull, /* 41 */
-    0x0080000000008000ull, /* 42 */
-    0x0080000000008080ull, /* 43 */
-    0x0080000000800000ull, /* 44 */
-    0x0080000000800080ull, /* 45 */
-    0x0080000000808000ull, /* 46 */
-    0x0080000000808080ull, /* 47 */
-    0x0080000080000000ull, /* 48 */
-    0x0080000080000080ull, /* 49 */
-    0x0080000080008000ull, /* 4A */
-    0x0080000080008080ull, /* 4B */
-    0x0080000080800000ull, /* 4C */
-    0x0080000080800080ull, /* 4D */
-    0x0080000080808000ull, /* 4E */
-    0x0080000080808080ull, /* 4F */
-    0x0080008000000000ull, /* 50 */
-    0x0080008000000080ull, /* 51 */
-    0x0080008000008000ull, /* 52 */
-    0x0080008000008080ull, /* 53 */
-    0x0080008000800000ull, /* 54 */
-    0x0080008000800080ull, /* 55 */
-    0x0080008000808000ull, /* 56 */
-    0x0080008000808080ull, /* 57 */
-    0x0080008080000000ull, /* 58 */
-    0x0080008080000080ull, /* 59 */
-    0x0080008080008000ull, /* 5A */
-    0x0080008080008080ull, /* 5B */
-    0x0080008080800000ull, /* 5C */
-    0x0080008080800080ull, /* 5D */
-    0x0080008080808000ull, /* 5E */
-    0x0080008080808080ull, /* 5F */
-    0x0080800000000000ull, /* 60 */
-    0x0080800000000080ull, /* 61 */
-    0x0080800000008000ull, /* 62 */
-    0x0080800000008080ull, /* 63 */
-    0x0080800000800000ull, /* 64 */
-    0x0080800000800080ull, /* 65 */
-    0x0080800000808000ull, /* 66 */
-    0x0080800000808080ull, /* 67 */
-    0x0080800080000000ull, /* 68 */
-    0x0080800080000080ull, /* 69 */
-    0x0080800080008000ull, /* 6A */
-    0x0080800080008080ull, /* 6B */
-    0x0080800080800000ull, /* 6C */
-    0x0080800080800080ull, /* 6D */
-    0x0080800080808000ull, /* 6E */
-    0x0080800080808080ull, /* 6F */
-    0x0080808000000000ull, /* 70 */
-    0x0080808000000080ull, /* 71 */
-    0x0080808000008000ull, /* 72 */
-    0x0080808000008080ull, /* 73 */
-    0x0080808000800000ull, /* 74 */
-    0x0080808000800080ull, /* 75 */
-    0x0080808000808000ull, /* 76 */
-    0x0080808000808080ull, /* 77 */
-    0x0080808080000000ull, /* 78 */
-    0x0080808080000080ull, /* 79 */
-    0x0080808080008000ull, /* 7A */
-    0x0080808080008080ull, /* 7B */
-    0x0080808080800000ull, /* 7C */
-    0x0080808080800080ull, /* 7D */
-    0x0080808080808000ull, /* 7E */
-    0x0080808080808080ull, /* 7F */
-    0x8000000000000000ull, /* 80 */
-    0x8000000000000080ull, /* 81 */
-    0x8000000000008000ull, /* 82 */
-    0x8000000000008080ull, /* 83 */
-    0x8000000000800000ull, /* 84 */
-    0x8000000000800080ull, /* 85 */
-    0x8000000000808000ull, /* 86 */
-    0x8000000000808080ull, /* 87 */
-    0x8000000080000000ull, /* 88 */
-    0x8000000080000080ull, /* 89 */
-    0x8000000080008000ull, /* 8A */
-    0x8000000080008080ull, /* 8B */
-    0x8000000080800000ull, /* 8C */
-    0x8000000080800080ull, /* 8D */
-    0x8000000080808000ull, /* 8E */
-    0x8000000080808080ull, /* 8F */
-    0x8000008000000000ull, /* 90 */
-    0x8000008000000080ull, /* 91 */
-    0x8000008000008000ull, /* 92 */
-    0x8000008000008080ull, /* 93 */
-    0x8000008000800000ull, /* 94 */
-    0x8000008000800080ull, /* 95 */
-    0x8000008000808000ull, /* 96 */
-    0x8000008000808080ull, /* 97 */
-    0x8000008080000000ull, /* 98 */
-    0x8000008080000080ull, /* 99 */
-    0x8000008080008000ull, /* 9A */
-    0x8000008080008080ull, /* 9B */
-    0x8000008080800000ull, /* 9C */
-    0x8000008080800080ull, /* 9D */
-    0x8000008080808000ull, /* 9E */
-    0x8000008080808080ull, /* 9F */
-    0x8000800000000000ull, /* A0 */
-    0x8000800000000080ull, /* A1 */
-    0x8000800000008000ull, /* A2 */
-    0x8000800000008080ull, /* A3 */
-    0x8000800000800000ull, /* A4 */
-    0x8000800000800080ull, /* A5 */
-    0x8000800000808000ull, /* A6 */
-    0x8000800000808080ull, /* A7 */
-    0x8000800080000000ull, /* A8 */
-    0x8000800080000080ull, /* A9 */
-    0x8000800080008000ull, /* AA */
-    0x8000800080008080ull, /* AB */
-    0x8000800080800000ull, /* AC */
-    0x8000800080800080ull, /* AD */
-    0x8000800080808000ull, /* AE */
-    0x8000800080808080ull, /* AF */
-    0x8000808000000000ull, /* B0 */
-    0x8000808000000080ull, /* B1 */
-    0x8000808000008000ull, /* B2 */
-    0x8000808000008080ull, /* B3 */
-    0x8000808000800000ull, /* B4 */
-    0x8000808000800080ull, /* B5 */
-    0x8000808000808000ull, /* B6 */
-    0x8000808000808080ull, /* B7 */
-    0x8000808080000000ull, /* B8 */
-    0x8000808080000080ull, /* B9 */
-    0x8000808080008000ull, /* BA */
-    0x8000808080008080ull, /* BB */
-    0x8000808080800000ull, /* BC */
-    0x8000808080800080ull, /* BD */
-    0x8000808080808000ull, /* BE */
-    0x8000808080808080ull, /* BF */
-    0x8080000000000000ull, /* C0 */
-    0x8080000000000080ull, /* C1 */
-    0x8080000000008000ull, /* C2 */
-    0x8080000000008080ull, /* C3 */
-    0x8080000000800000ull, /* C4 */
-    0x8080000000800080ull, /* C5 */
-    0x8080000000808000ull, /* C6 */
-    0x8080000000808080ull, /* C7 */
-    0x8080000080000000ull, /* C8 */
-    0x8080000080000080ull, /* C9 */
-    0x8080000080008000ull, /* CA */
-    0x8080000080008080ull, /* CB */
-    0x8080000080800000ull, /* CC */
-    0x8080000080800080ull, /* CD */
-    0x8080000080808000ull, /* CE */
-    0x8080000080808080ull, /* CF */
-    0x8080008000000000ull, /* D0 */
-    0x8080008000000080ull, /* D1 */
-    0x8080008000008000ull, /* D2 */
-    0x8080008000008080ull, /* D3 */
-    0x8080008000800000ull, /* D4 */
-    0x8080008000800080ull, /* D5 */
-    0x8080008000808000ull, /* D6 */
-    0x8080008000808080ull, /* D7 */
-    0x8080008080000000ull, /* D8 */
-    0x8080008080000080ull, /* D9 */
-    0x8080008080008000ull, /* DA */
-    0x8080008080008080ull, /* DB */
-    0x8080008080800000ull, /* DC */
-    0x8080008080800080ull, /* DD */
-    0x8080008080808000ull, /* DE */
-    0x8080008080808080ull, /* DF */
-    0x8080800000000000ull, /* E0 */
-    0x8080800000000080ull, /* E1 */
-    0x8080800000008000ull, /* E2 */
-    0x8080800000008080ull, /* E3 */
-    0x8080800000800000ull, /* E4 */
-    0x8080800000800080ull, /* E5 */
-    0x8080800000808000ull, /* E6 */
-    0x8080800000808080ull, /* E7 */
-    0x8080800080000000ull, /* E8 */
-    0x8080800080000080ull, /* E9 */
-    0x8080800080008000ull, /* EA */
-    0x8080800080008080ull, /* EB */
-    0x8080800080800000ull, /* EC */
-    0x8080800080800080ull, /* ED */
-    0x8080800080808000ull, /* EE */
-    0x8080800080808080ull, /* EF */
-    0x8080808000000000ull, /* F0 */
-    0x8080808000000080ull, /* F1 */
-    0x8080808000008000ull, /* F2 */
-    0x8080808000008080ull, /* F3 */
-    0x8080808000800000ull, /* F4 */
-    0x8080808000800080ull, /* F5 */
-    0x8080808000808000ull, /* F6 */
-    0x8080808000808080ull, /* F7 */
-    0x8080808080000000ull, /* F8 */
-    0x8080808080000080ull, /* F9 */
-    0x8080808080008000ull, /* FA */
-    0x8080808080008080ull, /* FB */
-    0x8080808080800000ull, /* FC */
-    0x8080808080800080ull, /* FD */
-    0x8080808080808000ull, /* FE */
-    0x8080808080808080ull, /* FF */
-};
-
-void helper_vgbbd(ppc_avr_t *r, ppc_avr_t *b)
-{
-    int i;
-    uint64_t t[2] = { 0, 0 };
-
-    VECTOR_FOR_INORDER_I(i, u8) {
-#if defined(HOST_WORDS_BIGENDIAN)
-        t[i >> 3] |= VGBBD_MASKS[b->u8[i]] >> (i & 7);
-#else
-        t[i >> 3] |= VGBBD_MASKS[b->u8[i]] >> (7 - (i & 7));
-#endif
-    }
-
-    r->u64[0] = t[0];
-    r->u64[1] = t[1];
-}
-
 #define PMSUM(name, srcfld, trgfld, trgtyp)                   \
 void helper_##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)  \
 {                                                             \
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 7d0c824..a3467a5 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -755,6 +755,81 @@ static void trans_vsr(DisasContext *ctx)
     tcg_temp_free_i64(tmp);
 }
 
+/*
+ * vgbbd VRT,VRB - Vector Gather Bits by Bytes by Doubleword
+ *
+ * All ith bits (i in range 1 to 8) of each byte of doubleword element in source
+ * register are concatenated and placed into ith byte of appropriate doubleword
+ * element in destination register.
+ *
+ * Following solution is done for both doubleword elements of source register
+ * in parallel, in order to reduce the number of instructions needed(that's why
+ * arrays are used):
+ * First, both doubleword elements of source register vB are placed in
+ * appropriate element of array avr. Bits are gathered in 2x8 iterations(2 for
+ * loops). In first iteration bit 1 of byte 1, bit 2 of byte 2,... bit 8 of
+ * byte 8 are in their final spots so avr[i], i={0,1} can be and-ed with
+ * tcg_mask. For every following iteration, both avr[i] and tcg_mask variables
+ * have to be shifted right for 7 and 8 places, respectively, in order to get
+ * bit 1 of byte 2, bit 2 of byte 3.. bit 7 of byte 8 in their final spots so
+ * shifted avr values(saved in tmp) can be and-ed with new value of tcg_mask...
+ * After first 8 iteration(first loop), all the first bits are in their final
+ * places, all second bits but second bit from eight byte are in their places...
+ * only 1 eight bit from eight byte is in it's place). In second loop we do all
+ * operations symmetrically, in order to get other half of bits in their final
+ * spots. Results for first and second doubleword elements are saved in
+ * result[0] and result[1] respectively. In the end those results are saved in
+ * appropriate doubleword element of destination register vD.
+ */
+static void trans_vgbbd(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    uint64_t mask = 0x8040201008040201ULL;
+    int i, j;
+
+    TCGv_i64 result[2];
+    result[0] = tcg_temp_new_i64();
+    result[1] = tcg_temp_new_i64();
+    TCGv_i64 avr[2];
+    avr[0] = tcg_temp_new_i64();
+    avr[1] = tcg_temp_new_i64();
+    TCGv_i64 tcg_mask = tcg_temp_new_i64();
+
+    tcg_gen_movi_i64(tcg_mask, mask);
+    for (j = 0; j < 2; j++) {
+        get_avr64(avr[j], VB, j);
+        tcg_gen_and_i64(result[j], avr[j], tcg_mask);
+    }
+    for (i = 1; i < 8; i++) {
+        tcg_gen_movi_i64(tcg_mask, mask >> (i * 8));
+        for (j = 0; j < 2; j++) {
+            tcg_gen_shri_i64(tmp, avr[j], i * 7);
+            tcg_gen_and_i64(tmp, tmp, tcg_mask);
+            tcg_gen_or_i64(result[j], result[j], tmp);
+        }
+    }
+    for (i = 1; i < 8; i++) {
+        tcg_gen_movi_i64(tcg_mask, mask << (i * 8));
+        for (j = 0; j < 2; j++) {
+            tcg_gen_shli_i64(tmp, avr[j], i * 7);
+            tcg_gen_and_i64(tmp, tmp, tcg_mask);
+            tcg_gen_or_i64(result[j], result[j], tmp);
+        }
+    }
+    for (j = 0; j < 2; j++) {
+        set_avr64(VT, result[j], j);
+    }
+
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(tcg_mask);
+    tcg_temp_free_i64(result[0]);
+    tcg_temp_free_i64(result[1]);
+    tcg_temp_free_i64(avr[0]);
+    tcg_temp_free_i64(avr[1]);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -1299,7 +1374,7 @@ GEN_VXFORM_DUAL(vclzd, PPC_NONE, PPC2_ALTIVEC_207, \
                 vpopcntd, PPC_NONE, PPC2_ALTIVEC_207)
 GEN_VXFORM(vbpermd, 6, 23);
 GEN_VXFORM(vbpermq, 6, 21);
-GEN_VXFORM_NOA(vgbbd, 6, 20);
+GEN_VXFORM_TRANS(vgbbd, 6, 20);
 GEN_VXFORM(vpmsumb, 4, 16)
 GEN_VXFORM(vpmsumh, 4, 17)
 GEN_VXFORM(vpmsumw, 4, 18)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
                   ` (3 preceding siblings ...)
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction Stefan Brankovic
@ 2019-06-19 11:03 ` Stefan Brankovic
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction Stefan Brankovic
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize Altivec instruction vclzd (Vector Count Leading Zeros Doubleword).
This instruction counts the number of leading zeros of each doubleword element
in source register and places result in the appropriate doubleword element of
destination register.

Using tcg-s count leading zeros instruction two times(once for each
doubleword element of source register vB) and placing result in
appropriate doubleword element of destination register vD.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/ppc/helper.h                 |  1 -
 target/ppc/int_helper.c             |  3 ---
 target/ppc/translate/vmx-impl.inc.c | 28 +++++++++++++++++++++++++++-
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 0aa1e05..3b92e3b 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -306,7 +306,6 @@ DEF_HELPER_4(vctsxs, void, env, avr, avr, i32)
 DEF_HELPER_2(vclzb, void, avr, avr)
 DEF_HELPER_2(vclzh, void, avr, avr)
 DEF_HELPER_2(vclzw, void, avr, avr)
-DEF_HELPER_2(vclzd, void, avr, avr)
 DEF_HELPER_2(vctzb, void, avr, avr)
 DEF_HELPER_2(vctzh, void, avr, avr)
 DEF_HELPER_2(vctzw, void, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 1e32549..3134798 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1803,17 +1803,14 @@ VUPK(lsw, s64, s32, UPKLO)
 #define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8)
 #define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16)
 #define clzw(v) clz32((v))
-#define clzd(v) clz64((v))
 
 VGENERIC_DO(clzb, u8)
 VGENERIC_DO(clzh, u16)
 VGENERIC_DO(clzw, u32)
-VGENERIC_DO(clzd, u64)
 
 #undef clzb
 #undef clzh
 #undef clzw
-#undef clzd
 
 #define ctzb(v) ((v) ? ctz32(v) : 8)
 #define ctzh(v) ((v) ? ctz32(v) : 16)
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index a3467a5..c5bebfb 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -830,6 +830,32 @@ static void trans_vgbbd(DisasContext *ctx)
     tcg_temp_free_i64(avr[1]);
 }
 
+/*
+ * vclzd VRT,VRB - Vector Count Leading Zeros Doubleword
+ *
+ * Counting the number of leading zero bits of each doubleword element in source
+ * register and placing result in appropriate doubleword element of destination
+ * register.
+ */
+static void trans_vclzd(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+
+    /* high doubleword */
+    get_avr64(avr, VB, true);
+    tcg_gen_clzi_i64(avr, avr, 64);
+    set_avr64(VT, avr, true);
+
+    /* low doubleword */
+    get_avr64(avr, VB, false);
+    tcg_gen_clzi_i64(avr, avr, 64);
+    set_avr64(VT, avr, false);
+
+    tcg_temp_free_i64(avr);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -1346,7 +1372,7 @@ GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
 GEN_VXFORM_NOA(vclzb, 1, 28)
 GEN_VXFORM_NOA(vclzh, 1, 29)
 GEN_VXFORM_NOA(vclzw, 1, 30)
-GEN_VXFORM_NOA(vclzd, 1, 31)
+GEN_VXFORM_TRANS(vclzd, 1, 31)
 GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
 GEN_VXFORM_NOA_2(vnegd, 1, 24, 7)
 GEN_VXFORM_NOA_2(vextsb2w, 1, 24, 16)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
                   ` (4 preceding siblings ...)
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction Stefan Brankovic
@ 2019-06-19 11:03 ` Stefan Brankovic
  2019-06-26 15:38   ` Richard Henderson
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize Altivec instruction vclzw (Vector Count Leading Zeros Word).
This instruction counts the number of leading zeros of each word element
in source register and places result in the appropriate word element of
destination register.

Counting is to be performed in four iterations of for loop(one for each
word elemnt of source register vB). Every iteration consists of loading
appropriate word element from source register, counting leading zeros
with tcg_gen_clzi_i32, and saving the result in appropriate word element
of destination register.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |  1 -
 target/ppc/int_helper.c             |  3 ---
 target/ppc/translate/vmx-impl.inc.c | 28 +++++++++++++++++++++++++++-
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 3b92e3b..595241c 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -305,7 +305,6 @@ DEF_HELPER_4(vctsxs, void, env, avr, avr, i32)
 
 DEF_HELPER_2(vclzb, void, avr, avr)
 DEF_HELPER_2(vclzh, void, avr, avr)
-DEF_HELPER_2(vclzw, void, avr, avr)
 DEF_HELPER_2(vctzb, void, avr, avr)
 DEF_HELPER_2(vctzh, void, avr, avr)
 DEF_HELPER_2(vctzw, void, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 3134798..82cb12e 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1802,15 +1802,12 @@ VUPK(lsw, s64, s32, UPKLO)
 
 #define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8)
 #define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16)
-#define clzw(v) clz32((v))
 
 VGENERIC_DO(clzb, u8)
 VGENERIC_DO(clzh, u16)
-VGENERIC_DO(clzw, u32)
 
 #undef clzb
 #undef clzh
-#undef clzw
 
 #define ctzb(v) ((v) ? ctz32(v) : 8)
 #define ctzh(v) ((v) ? ctz32(v) : 16)
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index c5bebfb..2588a9e 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -831,6 +831,32 @@ static void trans_vgbbd(DisasContext *ctx)
 }
 
 /*
+ * vclzw VRT,VRB - Vector Count Leading Zeros Word
+ *
+ * Counting the number of leading zero bits of each word element in source
+ * register and placing result in appropriate word element of destination
+ * register.
+ */
+static void trans_vclzw(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i32 tmp = tcg_temp_new_i32();
+    int i;
+
+    /* Perform count for every word element using tcg_gen_clzi_i32. */
+    for (i = 0; i < 4; i++) {
+        tcg_gen_ld_i32(tmp, cpu_env,
+            offsetof(CPUPPCState, vsr[32 + VB].u64[0]) + i * 4);
+        tcg_gen_clzi_i32(tmp, tmp, 32);
+        tcg_gen_st_i32(tmp, cpu_env,
+            offsetof(CPUPPCState, vsr[32 + VT].u64[0]) + i * 4);
+    }
+
+    tcg_temp_free_i32(tmp);
+}
+
+/*
  * vclzd VRT,VRB - Vector Count Leading Zeros Doubleword
  *
  * Counting the number of leading zero bits of each doubleword element in source
@@ -1371,7 +1397,7 @@ GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
 
 GEN_VXFORM_NOA(vclzb, 1, 28)
 GEN_VXFORM_NOA(vclzh, 1, 29)
-GEN_VXFORM_NOA(vclzw, 1, 30)
+GEN_VXFORM_TRANS(vclzw, 1, 30)
 GEN_VXFORM_TRANS(vclzd, 1, 31)
 GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
 GEN_VXFORM_NOA_2(vnegd, 1, 24, 7)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
                   ` (5 preceding siblings ...)
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction Stefan Brankovic
@ 2019-06-19 11:03 ` Stefan Brankovic
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
  2019-06-19 11:47 ` [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions no-reply
  8 siblings, 0 replies; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
This instruction counts the number of leading zeros of each halfword element
in source register and places result in the appropriate halfword element of
destination register.

In each iteration of outer for loop count operation is performed on one
doubleword element of source register vB. In the first iteration, higher
doubleword element of vB is placed in variable avr, and then counting
for every halfword element is performed by  using tcg_gen_clzi_i64.
Since it counts leading zeros on 64 bit lenght, ith byte element has to
be moved to the highest 16 bits of tmp, or-ed with mask(in order to get all
ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move it's result
in appropriate halfword element of result. This is done in inner for loop.
After the operation is finished, the result is saved in the appropriate
doubleword element of destination register vD. The same sequence of orders
is to be applied again for the  lower doubleword element of vB.

Optimize Altivec instruction vclzb (Vector Count Leading Zeros Byte).
This instruction counts the number of leading zeros of each byte element
in source register and places result in the appropriate byte element of
destination register.

In each iteration of the outer for loop, counting operation is done on one
doubleword element of source register vB. In the first iteration, the
higher doubleword element of vB is placed in variable avr, and then counting
for every byte element is performed using tcg_gen_clzi_i64. Since it counts
leading zeros on 64 bit lenght, ith byte element has to be moved to the highest
8 bits of variable  tmp, or-ed with mask(in order to get all ones in the lowest
56 bits), then perform tcg_gen_clzi_i64 and move it's result in the appropriate
byte element of result. This is done in inner for loop. After the operation is
finished, the result is saved in the  appropriate doubleword element of destination
register vD. The same sequence of orders is to be applied again for the lower
doubleword element of vB.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/helper.h                 |   2 -
 target/ppc/int_helper.c             |   9 ---
 target/ppc/translate/vmx-impl.inc.c | 122 +++++++++++++++++++++++++++++++++++-
 3 files changed, 120 insertions(+), 13 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 595241c..17b4b06 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -303,8 +303,6 @@ DEF_HELPER_4(vcfsx, void, env, avr, avr, i32)
 DEF_HELPER_4(vctuxs, void, env, avr, avr, i32)
 DEF_HELPER_4(vctsxs, void, env, avr, avr, i32)
 
-DEF_HELPER_2(vclzb, void, avr, avr)
-DEF_HELPER_2(vclzh, void, avr, avr)
 DEF_HELPER_2(vctzb, void, avr, avr)
 DEF_HELPER_2(vctzh, void, avr, avr)
 DEF_HELPER_2(vctzw, void, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 82cb12e..264b5e7 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1800,15 +1800,6 @@ VUPK(lsw, s64, s32, UPKLO)
         }                                                               \
     }
 
-#define clzb(v) ((v) ? clz32((uint32_t)(v) << 24) : 8)
-#define clzh(v) ((v) ? clz32((uint32_t)(v) << 16) : 16)
-
-VGENERIC_DO(clzb, u8)
-VGENERIC_DO(clzh, u16)
-
-#undef clzb
-#undef clzh
-
 #define ctzb(v) ((v) ? ctz32(v) : 8)
 #define ctzh(v) ((v) ? ctz32(v) : 16)
 #define ctzw(v) ctz32((v))
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 2588a9e..81569a8 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -831,6 +831,124 @@ static void trans_vgbbd(DisasContext *ctx)
 }
 
 /*
+ * vclzb VRT,VRB - Vector Count Leading Zeros Byte
+ *
+ * Counting the number of leading zero bits of each byte element in source
+ * register and placing result in appropriate byte element of destination
+ * register.
+ */
+static void trans_vclzb(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword of vB in avr. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword of vB in avr. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every byte element using tcg_gen_clzi_i64.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 8 bits of tmp, or it with mask(so we get
+         * all ones in lowest 56 bits), then perform tcg_gen_clzi_i64 and move
+         * it's result in appropriate byte element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 56);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 7; j++) {
+            tcg_gen_shli_i64(tmp, avr, (7 - j) * 8);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 8, 8);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 56, 8);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            set_avr64(VT, result, true);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            set_avr64(VT, result, false);
+        }
+    }
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
+ * vclzh VRT,VRB - Vector Count Leading Zeros Halfword
+ *
+ * Counting the number of leading zero bits of each halfword element in source
+ * register and placing result in appropriate halfword element of destination
+ * register.
+ */
+static void trans_vclzh(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword element of vB in avr. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword element of vB in avr. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every halfword element using tcg_gen_clzi_i64.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 16 bits of tmp, or it with mask(so we get
+         * all ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move
+         * it's result in appropriate halfword element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 48);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 3; j++) {
+            tcg_gen_shli_i64(tmp, avr, (3 - j) * 16);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 16, 16);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 48, 16);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            set_avr64(VT, result, true);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            set_avr64(VT, result, false);
+        }
+    }
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
  * vclzw VRT,VRB - Vector Count Leading Zeros Word
  *
  * Counting the number of leading zero bits of each word element in source
@@ -1395,8 +1513,8 @@ GEN_VAFORM_PAIRED(vmsumshm, vmsumshs, 20)
 GEN_VAFORM_PAIRED(vsel, vperm, 21)
 GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
 
-GEN_VXFORM_NOA(vclzb, 1, 28)
-GEN_VXFORM_NOA(vclzh, 1, 29)
+GEN_VXFORM_TRANS(vclzb, 1, 28)
+GEN_VXFORM_TRANS(vclzh, 1, 29)
 GEN_VXFORM_TRANS(vclzw, 1, 30)
 GEN_VXFORM_TRANS(vclzd, 1, 31)
 GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
                   ` (6 preceding siblings ...)
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
@ 2019-06-19 11:03 ` Stefan Brankovic
  2019-06-19 11:47 ` [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions no-reply
  8 siblings, 0 replies; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Since I found this two instructions implemented with tcg, I refactored
them so they are consistent with other similar implementations that
I introduced in this patch.

Also, a new dual macro GEN_VXFORM_TRANS_DUAL is added. This macro is
used if one instruction is realized with direct translation, and second
one with a helper.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 66 +++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 81569a8..f052dcb 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -350,6 +350,28 @@ static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
     }                                                                  \
 }
 
+/*
+ * We use this macro if one instruction is realized with direct
+ * translation, and second one with helper.
+ */
+#define GEN_VXFORM_TRANS_DUAL(name0, flg0, flg2_0, name1, flg1, flg2_1)\
+static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
+{                                                                      \
+    if ((Rc(ctx->opcode) == 0) &&                                      \
+        ((ctx->insns_flags & flg0) || (ctx->insns_flags2 & flg2_0))) { \
+        if (unlikely(!ctx->altivec_enabled)) {                         \
+            gen_exception(ctx, POWERPC_EXCP_VPU);                      \
+            return;                                                    \
+        }                                                              \
+        trans_##name0(ctx);                                            \
+    } else if ((Rc(ctx->opcode) == 1) &&                               \
+        ((ctx->insns_flags & flg1) || (ctx->insns_flags2 & flg2_1))) { \
+        gen_##name1(ctx);                                              \
+    } else {                                                           \
+        gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);            \
+    }                                                                  \
+}
+
 /* Adds support to provide invalid mask */
 #define GEN_VXFORM_DUAL_EXT(name0, flg0, flg2_0, inval0,                \
                             name1, flg1, flg2_1, inval1)                \
@@ -431,20 +453,13 @@ GEN_VXFORM(vmrglb, 6, 4);
 GEN_VXFORM(vmrglh, 6, 5);
 GEN_VXFORM(vmrglw, 6, 6);
 
-static void gen_vmrgew(DisasContext *ctx)
+static void trans_vmrgew(DisasContext *ctx)
 {
-    TCGv_i64 tmp;
-    TCGv_i64 avr;
-    int VT, VA, VB;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    VT = rD(ctx->opcode);
-    VA = rA(ctx->opcode);
-    VB = rB(ctx->opcode);
-    tmp = tcg_temp_new_i64();
-    avr = tcg_temp_new_i64();
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
 
     get_avr64(avr, VB, true);
     tcg_gen_shri_i64(tmp, avr, 32);
@@ -462,21 +477,14 @@ static void gen_vmrgew(DisasContext *ctx)
     tcg_temp_free_i64(avr);
 }
 
-static void gen_vmrgow(DisasContext *ctx)
+static void trans_vmrgow(DisasContext *ctx)
 {
-    TCGv_i64 t0, t1;
-    TCGv_i64 avr;
-    int VT, VA, VB;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    VT = rD(ctx->opcode);
-    VA = rA(ctx->opcode);
-    VB = rB(ctx->opcode);
-    t0 = tcg_temp_new_i64();
-    t1 = tcg_temp_new_i64();
-    avr = tcg_temp_new_i64();
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 t0 = tcg_temp_new_i64();
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
 
     get_avr64(t0, VB, true);
     get_avr64(t1, VA, true);
@@ -1142,14 +1150,14 @@ GEN_VXFORM_ENV(vminfp, 5, 17);
 GEN_VXFORM_HETRO(vextublx, 6, 24)
 GEN_VXFORM_HETRO(vextuhlx, 6, 25)
 GEN_VXFORM_HETRO(vextuwlx, 6, 26)
-GEN_VXFORM_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
+GEN_VXFORM_TRANS_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
                 vextuwlx, PPC_NONE, PPC2_ISA300)
 GEN_VXFORM_HETRO(vextubrx, 6, 28)
 GEN_VXFORM_HETRO(vextuhrx, 6, 29)
 GEN_VXFORM_HETRO(vextuwrx, 6, 30)
 GEN_VXFORM_TRANS(lvsl, 6, 31)
 GEN_VXFORM_TRANS(lvsr, 6, 32)
-GEN_VXFORM_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207, \
+GEN_VXFORM_TRANS_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207,
                 vextuwrx, PPC_NONE, PPC2_ISA300)
 
 #define GEN_VXRFORM1(opname, name, str, opc2, opc3)                     \
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
                   ` (7 preceding siblings ...)
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
@ 2019-06-19 11:47 ` no-reply
  8 siblings, 0 replies; 18+ messages in thread
From: no-reply @ 2019-06-19 11:47 UTC (permalink / raw)
  To: stefan.brankovic; +Cc: qemu-devel, david

Patchew URL: https://patchew.org/QEMU/1560942225-24728-1-git-send-email-stefan.brankovic@rt-rk.com/



Hi,

This series failed build test on s390x host. Please find the details below.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
# Testing script will be invoked under the git checkout with
# HEAD pointing to a commit that has the patches applied on top of "base"
# branch
set -e
CC=$HOME/bin/cc
INSTALL=$PWD/install
BUILD=$PWD/build
mkdir -p $BUILD $INSTALL
SRC=$PWD
cd $BUILD
$SRC/configure --cc=$CC --prefix=$INSTALL
make -j4
# XXX: we need reliable clean up
# make check -j4 V=1
make install

echo
echo "=== ENV ==="
env

echo
echo "=== PACKAGES ==="
rpm -qa
=== TEST SCRIPT END ===

  CC      riscv64-softmmu/accel/tcg/tcg-all.o
In file included from /var/tmp/patchew-tester-tmp-wp6edlgx/src/target/ppc/translate.c:6826:
/var/tmp/patchew-tester-tmp-wp6edlgx/src/target/ppc/translate/vmx-impl.inc.c: In function ‘trans_lvsl’:
/var/tmp/patchew-tester-tmp-wp6edlgx/src/target/ppc/translate/vmx-impl.inc.c:520:29: error: passing argument 2 of ‘tcg_gen_ext_i32_i64’ from incompatible pointer type [-Werror=incompatible-pointer-types]
  520 |     tcg_gen_ext_i32_i64(sh, EA);
      |                             ^~
      |                             |
---
      |                                        ~~~~~~~~~^~~
In file included from /var/tmp/patchew-tester-tmp-wp6edlgx/src/target/ppc/translate.c:6826:
/var/tmp/patchew-tester-tmp-wp6edlgx/src/target/ppc/translate/vmx-impl.inc.c: In function ‘trans_lvsr’:
/var/tmp/patchew-tester-tmp-wp6edlgx/src/target/ppc/translate/vmx-impl.inc.c:559:29: error: passing argument 2 of ‘tcg_gen_ext_i32_i64’ from incompatible pointer type [-Werror=incompatible-pointer-types]
  559 |     tcg_gen_ext_i32_i64(sh, EA);
      |                             ^~
      |                             |


The full log is available at
http://patchew.org/logs/1560942225-24728-1-git-send-email-stefan.brankovic@rt-rk.com/testing.s390x/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
@ 2019-06-26 15:28   ` Richard Henderson
  2019-06-26 15:48   ` Richard Henderson
  1 sibling, 0 replies; 18+ messages in thread
From: Richard Henderson @ 2019-06-26 15:28 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/19/19 1:03 PM, Stefan Brankovic wrote:
> Adding simple macro that is calling tcg implementation of appropriate
> instruction if altivec support is active.
> 
> Optimization of altivec instruction lvsl (Load Vector for Shift Left).
> Place bytes sh:sh+15 of value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F
> in destination register. Sh is calculated by adding 2 source registers and
> getting bits 60-63 of result.
> 
> First, the bits [28-31] are placed from EA to variable sh. After that,
> the bytes are created in the following way:
> sh:(sh+7) of X(from description) by multiplying sh with 0x0101010101010101
> followed by addition of the result with 0x0001020304050607. Value obtained
> is placed in higher doubleword element of vD.
> (sh+8):(sh+15) by adding the result of previous multiplication with
> 0x08090a0b0c0d0e0f. Value obtained is placed in lower doubleword element
> of vD.
> 
> Optimization of altivec instruction lvsr (Load Vector for Shift Right).
> Place bytes 16-sh:31-sh of value 0x00 || 0x01 || 0x02 || ... || 0x1E ||
> 0x1F in destination register. Sh is calculated by adding 2 source
> registers and getting bits 60-63 of result.
> 
> First, the bits [28-31] are placed from EA to variable sh. After that,
> the bytes are created in the following way:
> sh:(sh+7) of X(from description) by multiplying sh with 0x0101010101010101
> followed by substraction of the result from 0x1011121314151617. Value
> obtained is placed in higher doubleword element of vD.
> (sh+8):(sh+15) by substracting the result of previous multiplication from
> 0x18191a1b1c1d1e1f. Value obtained is placed in lower doubleword element
> of vD.
> 
> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/helper.h                 |   2 -
>  target/ppc/int_helper.c             |  18 ------
>  target/ppc/translate/vmx-impl.inc.c | 120 ++++++++++++++++++++++++++----------
>  3 files changed, 88 insertions(+), 52 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions Stefan Brankovic
@ 2019-06-26 15:34   ` Richard Henderson
  0 siblings, 0 replies; 18+ messages in thread
From: Richard Henderson @ 2019-06-26 15:34 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/19/19 1:03 PM, Stefan Brankovic wrote:
> Optimization of altivec instructions vsl and vsr(Vector Shift Left/Rigt).
> Perform shift operation (left and right respectively) on 128 bit value of
> register vA by value specified in bits 125-127 of register vB. Lowest 3
> bits in each byte element of register vB must be identical or result is
> undefined.
> 
> For vsl instruction, the first step is bits 125-127 of register vB have
> to be saved in variable sh. Then, the highest sh bits of the lower
> doubleword element of register vA are saved in variable shifted,
> in order not to lose those bits when shift operation is performed on
> the lower doubleword element of register vA, which is the next
> step. After shifting the lower doubleword element shift operation
> is performed on higher doubleword element of vA, with replacement of
> the lowest sh bits(that are now 0) with bits saved in shifted.
> 
> For vsr instruction, firstly, the bits 125-127 of register vB have
> to be saved in variable sh. Then, the lowest sh bits of the higher
> doubleword element of register vA are saved in variable shifted,
> in odred not to lose those bits when the shift operation is
> performed on the higher doubleword element of register vA, which is
> the next step. After shifting higher doubleword element, shift operation
> is performed on lower doubleword element of vA, with replacement of
> highest sh bits(that are now 0) with bits saved in shifted.
> 
> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/helper.h                 |  2 -
>  target/ppc/int_helper.c             | 35 -------------
>  target/ppc/translate/vmx-impl.inc.c | 99 ++++++++++++++++++++++++++++++++++++-
>  3 files changed, 97 insertions(+), 39 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction Stefan Brankovic
@ 2019-06-26 15:37   ` Richard Henderson
  0 siblings, 0 replies; 18+ messages in thread
From: Richard Henderson @ 2019-06-26 15:37 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/19/19 1:03 PM, Stefan Brankovic wrote:
> Optimize altivec instruction vgbbd (Vector Gather Bits by Bytes by Doubleword)
> All ith bits (i in range 1 to 8) of each byte of doubleword element in
> source register are concatenated and placed into ith byte of appropriate
> doubleword element in destination register.
> 
> Following solution is done for both doubleword elements of source register
> in parallel, in order to reduce the number of instructions needed(that's why
> arrays are used):
> First, both doubleword elements of source register vB are placed in
> appropriate element of array avr. Bits are gathered in 2x8 iterations(2 for
> loops). In first iteration bit 1 of byte 1, bit 2 of byte 2,... bit 8 of
> byte 8 are in their final spots so avr[i], i={0,1} can be and-ed with
> tcg_mask. For every following iteration, both avr[i] and tcg_mask variables
> have to be shifted right for 7 and 8 places, respectively, in order to get
> bit 1 of byte 2, bit 2 of byte 3.. bit 7 of byte 8 in their final spots so
> shifted avr values(saved in tmp) can be and-ed with new value of tcg_mask...
> After first 8 iteration(first loop), all the first bits are in their final
> places, all second bits but second bit from eight byte are in their places...
> only 1 eight bit from eight byte is in it's place). In second loop we do all
> operations symmetrically, in order to get other half of bits in their final
> spots. Results for first and second doubleword elements are saved in
> result[0] and result[1] respectively. In the end those results are saved in
> appropriate doubleword element of destination register vD.
> 
> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/helper.h                 |   1 -
>  target/ppc/int_helper.c             | 276 ------------------------------------
>  target/ppc/translate/vmx-impl.inc.c |  77 +++++++++-
>  3 files changed, 76 insertions(+), 278 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction Stefan Brankovic
@ 2019-06-26 15:38   ` Richard Henderson
  0 siblings, 0 replies; 18+ messages in thread
From: Richard Henderson @ 2019-06-26 15:38 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/19/19 1:03 PM, Stefan Brankovic wrote:
> Optimize Altivec instruction vclzw (Vector Count Leading Zeros Word).
> This instruction counts the number of leading zeros of each word element
> in source register and places result in the appropriate word element of
> destination register.
> 
> Counting is to be performed in four iterations of for loop(one for each
> word elemnt of source register vB). Every iteration consists of loading
> appropriate word element from source register, counting leading zeros
> with tcg_gen_clzi_i32, and saving the result in appropriate word element
> of destination register.
> 
> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/helper.h                 |  1 -
>  target/ppc/int_helper.c             |  3 ---
>  target/ppc/translate/vmx-impl.inc.c | 28 +++++++++++++++++++++++++++-
>  3 files changed, 27 insertions(+), 5 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions
  2019-06-19 11:03 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
  2019-06-26 15:28   ` Richard Henderson
@ 2019-06-26 15:48   ` Richard Henderson
  1 sibling, 0 replies; 18+ messages in thread
From: Richard Henderson @ 2019-06-26 15:48 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/19/19 1:03 PM, Stefan Brankovic wrote:
> +    /* Get sh(from description) by anding EA with 0xf. */
> +    gen_addr_reg_index(ctx, EA);
> +    tcg_gen_ext_i32_i64(sh, EA);

>From patchew, this fails to build one of the ppc*-linux-user configurations.

You need to use tcg_gen_extu_tl_i64().


r~


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions
  2019-06-06 16:46   ` Richard Henderson
@ 2019-06-17 11:31     ` Stefan Brankovic
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-17 11:31 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david


On 6.6.19. 18:46, Richard Henderson wrote:
> On 6/6/19 5:15 AM, Stefan Brankovic wrote:
>> +    tcg_gen_addi_i64(result, sh, 7);
>> +    for (i = 7; i >= 1; i--) {
>> +        tcg_gen_shli_i64(tmp, sh, i * 8);
>> +        tcg_gen_or_i64(result, result, tmp);
>> +        tcg_gen_addi_i64(sh, sh, 1);
>> +    }
> Better to replicate sh into the 8 positions and then use one add.
>
>      tcg_gen_muli_i64(sh, sh, 0x0101010101010101ull);
>      tcg_gen_addi_i64(hi_result, sh, 0x0001020304050607ull);
>      tcg_gen_addi_i64(lo_result, sh, 0x08090a0b0c0d0e0full);
>
> and
>
>      tcg_gen_subfi_i64(hi_result, 0x1011121314151617ull, sh);
>      tcg_gen_subfi_i64(lo_result, 0x18191a1b1c1d1e1full, sh);
>
> for lvsr.
>
I think you are right, this is definitely better way of implementing it. 
I will adopt your approach in v2.

Kind Regards,

Stefan

> r~


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
@ 2019-06-06 16:46   ` Richard Henderson
  2019-06-17 11:31     ` Stefan Brankovic
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Henderson @ 2019-06-06 16:46 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> +    tcg_gen_addi_i64(result, sh, 7);
> +    for (i = 7; i >= 1; i--) {
> +        tcg_gen_shli_i64(tmp, sh, i * 8);
> +        tcg_gen_or_i64(result, result, tmp);
> +        tcg_gen_addi_i64(sh, sh, 1);
> +    }

Better to replicate sh into the 8 positions and then use one add.

    tcg_gen_muli_i64(sh, sh, 0x0101010101010101ull);
    tcg_gen_addi_i64(hi_result, sh, 0x0001020304050607ull);
    tcg_gen_addi_i64(lo_result, sh, 0x08090a0b0c0d0e0full);

and

    tcg_gen_subfi_i64(hi_result, 0x1011121314151617ull, sh);
    tcg_gen_subfi_i64(lo_result, 0x18191a1b1c1d1e1full, sh);

for lvsr.


r~


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
@ 2019-06-06 10:15 ` Stefan Brankovic
  2019-06-06 16:46   ` Richard Henderson
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 7561 bytes --]

Adding simple macro that is calling tcg implementation of appropriate
instruction if altivec support is active.

Optimization of altivec instruction lvsl (Load Vector for Shift Left).
Place bytes sh:sh+15 of value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F
in destination register. Sh is calculated by adding 2 source registers and
getting bits 60-63 of result.

First we place bits [28-31] of EA to variable sh. After that
we create bytes sh:(sh+7) of X(from description) in for loop
(by incrementing sh in each iteration and placing it in
appropriate byte of variable result) and save them in higher
doubleword element of vD. We repeat this once again for lower
doubleword element of vD by creating bytes (sh+8):(sh+15) in
a for loop and saving result.

Optimization of altivec instruction lvsr (Load Vector for Shift Right).
Place bytes 16-sh:31-sh of value 0x00 || 0x01 || 0x02 || ... || 0x1E ||
0x1F in destination register. Sh is calculated by adding 2 source
registers and getting bits 60-63 of result.

First we place bits [28-31] of EA to variable sh. After that
we create bytes (16-sh):(23-sh) of X(from description) in for loop
(by incrementing sh in each iteration and placing it in
appropriate byte of variable result) and save them in higher
doubleword element of vD. We repeat this once again for lower
doubleword element of vD by creating bytes (24-sh):(32-sh) in
a for loop and saving result.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 143 ++++++++++++++++++++++++++++--------
 1 file changed, 111 insertions(+), 32 deletions(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index bd3ff40..140bb05 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -142,38 +142,6 @@ GEN_VR_STVE(bx, 0x07, 0x04, 1);
 GEN_VR_STVE(hx, 0x07, 0x05, 2);
 GEN_VR_STVE(wx, 0x07, 0x06, 4);
 
-static void gen_lvsl(DisasContext *ctx)
-{
-    TCGv_ptr rd;
-    TCGv EA;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    EA = tcg_temp_new();
-    gen_addr_reg_index(ctx, EA);
-    rd = gen_avr_ptr(rD(ctx->opcode));
-    gen_helper_lvsl(rd, EA);
-    tcg_temp_free(EA);
-    tcg_temp_free_ptr(rd);
-}
-
-static void gen_lvsr(DisasContext *ctx)
-{
-    TCGv_ptr rd;
-    TCGv EA;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    EA = tcg_temp_new();
-    gen_addr_reg_index(ctx, EA);
-    rd = gen_avr_ptr(rD(ctx->opcode));
-    gen_helper_lvsr(rd, EA);
-    tcg_temp_free(EA);
-    tcg_temp_free_ptr(rd);
-}
-
 static void gen_mfvscr(DisasContext *ctx)
 {
     TCGv_i32 t;
@@ -316,6 +284,16 @@ static void glue(gen_, name)(DisasContext *ctx)                         \
     tcg_temp_free_ptr(rd);                                              \
 }
 
+#define GEN_VXFORM_TRANS(name, opc2, opc3)                              \
+static void glue(gen_, name)(DisasContext *ctx)                         \
+{                                                                       \
+    if (unlikely(!ctx->altivec_enabled)) {                              \
+        gen_exception(ctx, POWERPC_EXCP_VPU);                           \
+        return;                                                         \
+    }                                                                   \
+    trans_##name(ctx);                                                  \
+}
+
 #define GEN_VXFORM_ENV(name, opc2, opc3)                                \
 static void glue(gen_, name)(DisasContext *ctx)                         \
 {                                                                       \
@@ -515,6 +493,105 @@ static void gen_vmrgow(DisasContext *ctx)
     tcg_temp_free_i64(avr);
 }
 
+/*
+ * lvsl VRT,RA,RB - Load Vector for Shift Left
+ *
+ * Let the EA be the sum (rA|0)+(rB). Let sh=EA[28–31].
+ * Let X be the 32-byte value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F.
+ * Bytes sh:sh+15 of X are placed into vD.
+ */
+static void trans_lvsl(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv_i64 EA = tcg_temp_new();
+    int i;
+
+    /* Get sh(from description) by anding EA with 0xf. */
+    gen_addr_reg_index(ctx, EA);
+    tcg_gen_andi_i64(sh, EA, 0xfULL);
+    /*
+     * Create bytes sh:sh+7 of X(from description) and place them in
+     * higher doubleword of vD.
+     */
+    tcg_gen_addi_i64(result, sh, 7);
+    for (i = 7; i >= 1; i--) {
+        tcg_gen_shli_i64(tmp, sh, i * 8);
+        tcg_gen_or_i64(result, result, tmp);
+        tcg_gen_addi_i64(sh, sh, 1);
+    }
+    set_avr64(VT, result, true);
+    /*
+     * Create bytes sh+8:sh+15 of X(from description) and place them in
+     * lower doubleword of vD.
+     */
+    tcg_gen_addi_i64(result, sh, 8);
+    for (i = 7; i >= 1; i--) {
+        tcg_gen_addi_i64(sh, sh, 1);
+        tcg_gen_shli_i64(tmp, sh, i * 8);
+        tcg_gen_or_i64(result, result, tmp);
+    }
+    set_avr64(VT, result, false);
+
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free(EA);
+}
+
+/*
+ * lvsr VRT,RA,RB - Load Vector for Shift Right
+ *
+ * Let the EA be the sum (rA|0)+(rB). Let sh=EA[28–31].
+ * Let X be the 32-byte value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F.
+ * Bytes (16-sh):(31-sh) of X are placed into vD.
+ */
+static void trans_lvsr(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv_i64 EA = tcg_temp_new();
+    int i;
+
+    /* Get sh(from description) by anding EA with 0xf. */
+    gen_addr_reg_index(ctx, EA);
+    tcg_gen_andi_i64(sh, EA, 0xfULL);
+    /* Make (16-sh) and save it in sh. */
+    tcg_gen_subi_i64(sh, sh, 0x10ULL);
+    tcg_gen_neg_i64(sh, sh);
+    /*
+     * Create bytes (16-sh):(23-sh) of X(from description) and place them in
+     * higher doubleword of vD.
+     */
+    tcg_gen_addi_i64(result, sh, 7);
+    for (i = 7; i >= 1; i--) {
+        tcg_gen_shli_i64(tmp, sh, i * 8);
+        tcg_gen_or_i64(result, result, tmp);
+        tcg_gen_addi_i64(sh, sh, 1);
+    }
+    set_avr64(VT, result, true);
+    /*
+     * Create bytes (24-sh):(32-sh) of X(from description) and place them in
+     * lower doubleword of vD.
+     */
+    tcg_gen_addi_i64(result, sh, 8);
+    for (i = 7; i >= 1; i--) {
+        tcg_gen_addi_i64(sh, sh, 1);
+        tcg_gen_shli_i64(tmp, sh, i * 8);
+        tcg_gen_or_i64(result, result, tmp);
+    }
+    set_avr64(VT, result, false);
+
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free(EA);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -657,6 +734,8 @@ GEN_VXFORM_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
 GEN_VXFORM_HETRO(vextubrx, 6, 28)
 GEN_VXFORM_HETRO(vextuhrx, 6, 29)
 GEN_VXFORM_HETRO(vextuwrx, 6, 30)
+GEN_VXFORM_TRANS(lvsl, 6, 31)
+GEN_VXFORM_TRANS(lvsr, 6, 32)
 GEN_VXFORM_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207, \
                 vextuwrx, PPC_NONE, PPC2_ISA300)
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-06-26 18:46 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
2019-06-19 11:03 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
2019-06-26 15:28   ` Richard Henderson
2019-06-26 15:48   ` Richard Henderson
2019-06-19 11:03 ` [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions Stefan Brankovic
2019-06-26 15:34   ` Richard Henderson
2019-06-19 11:03 ` [Qemu-devel] [PATCH 3/8] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
2019-06-19 11:03 ` [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction Stefan Brankovic
2019-06-26 15:37   ` Richard Henderson
2019-06-19 11:03 ` [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction Stefan Brankovic
2019-06-19 11:03 ` [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction Stefan Brankovic
2019-06-26 15:38   ` Richard Henderson
2019-06-19 11:03 ` [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
2019-06-19 11:03 ` [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
2019-06-19 11:47 ` [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions no-reply
  -- strict thread matches above, loose matches on Subject: below --
2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
2019-06-06 10:15 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
2019-06-06 16:46   ` Richard Henderson
2019-06-17 11:31     ` Stefan Brankovic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).