All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl,
@ 2019-06-06 10:15 Stefan Brankovic
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

This series buils up on and complements recent work of Thomas Murta, Mark
Cave-Ayland and Richard Henderson in the same area. It is based on devising TCG
translation implementation for selected instructions rather than using helpers.
The selected instructions are most of the time idiosyncratic to ppc platform,
so relatively complex TCG translation (without direct mapping to host
instruction that is not possible in these cases) seems to be the best option,
and that approach is presented in this series. The performance improvements are
significant in all cases.

Stefan Brankovic (8):
  target/ppc: Optimize emulation of lvsl and lvsr instructions
  target/ppc: Optimize emulation of vsl and vsr instructions
  target/ppc: Optimize emulation of vpkpx instruction
  target/ppc: Optimize emulation of vgbbd instruction
  target/ppc: Optimize emulation of vclzd instruction
  target/ppc: Optimize emulation of vclzw instruction
  target/ppc: Optimize emulation of vclzh and vclzb instructions
  target/ppc: Refactor emulation of vmrgew and vmrgow instructions

 target/ppc/translate/vmx-impl.inc.c | 705 ++++++++++++++++++++++++++++++++----
 1 file changed, 636 insertions(+), 69 deletions(-)

-- 
2.7.4



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
@ 2019-06-06 10:15 ` Stefan Brankovic
  2019-06-06 16:46   ` Richard Henderson
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions Stefan Brankovic
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 7561 bytes --]

Adding simple macro that is calling tcg implementation of appropriate
instruction if altivec support is active.

Optimization of altivec instruction lvsl (Load Vector for Shift Left).
Place bytes sh:sh+15 of value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F
in destination register. Sh is calculated by adding 2 source registers and
getting bits 60-63 of result.

First we place bits [28-31] of EA to variable sh. After that
we create bytes sh:(sh+7) of X(from description) in for loop
(by incrementing sh in each iteration and placing it in
appropriate byte of variable result) and save them in higher
doubleword element of vD. We repeat this once again for lower
doubleword element of vD by creating bytes (sh+8):(sh+15) in
a for loop and saving result.

Optimization of altivec instruction lvsr (Load Vector for Shift Right).
Place bytes 16-sh:31-sh of value 0x00 || 0x01 || 0x02 || ... || 0x1E ||
0x1F in destination register. Sh is calculated by adding 2 source
registers and getting bits 60-63 of result.

First we place bits [28-31] of EA to variable sh. After that
we create bytes (16-sh):(23-sh) of X(from description) in for loop
(by incrementing sh in each iteration and placing it in
appropriate byte of variable result) and save them in higher
doubleword element of vD. We repeat this once again for lower
doubleword element of vD by creating bytes (24-sh):(32-sh) in
a for loop and saving result.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 143 ++++++++++++++++++++++++++++--------
 1 file changed, 111 insertions(+), 32 deletions(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index bd3ff40..140bb05 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -142,38 +142,6 @@ GEN_VR_STVE(bx, 0x07, 0x04, 1);
 GEN_VR_STVE(hx, 0x07, 0x05, 2);
 GEN_VR_STVE(wx, 0x07, 0x06, 4);
 
-static void gen_lvsl(DisasContext *ctx)
-{
-    TCGv_ptr rd;
-    TCGv EA;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    EA = tcg_temp_new();
-    gen_addr_reg_index(ctx, EA);
-    rd = gen_avr_ptr(rD(ctx->opcode));
-    gen_helper_lvsl(rd, EA);
-    tcg_temp_free(EA);
-    tcg_temp_free_ptr(rd);
-}
-
-static void gen_lvsr(DisasContext *ctx)
-{
-    TCGv_ptr rd;
-    TCGv EA;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    EA = tcg_temp_new();
-    gen_addr_reg_index(ctx, EA);
-    rd = gen_avr_ptr(rD(ctx->opcode));
-    gen_helper_lvsr(rd, EA);
-    tcg_temp_free(EA);
-    tcg_temp_free_ptr(rd);
-}
-
 static void gen_mfvscr(DisasContext *ctx)
 {
     TCGv_i32 t;
@@ -316,6 +284,16 @@ static void glue(gen_, name)(DisasContext *ctx)                         \
     tcg_temp_free_ptr(rd);                                              \
 }
 
+#define GEN_VXFORM_TRANS(name, opc2, opc3)                              \
+static void glue(gen_, name)(DisasContext *ctx)                         \
+{                                                                       \
+    if (unlikely(!ctx->altivec_enabled)) {                              \
+        gen_exception(ctx, POWERPC_EXCP_VPU);                           \
+        return;                                                         \
+    }                                                                   \
+    trans_##name(ctx);                                                  \
+}
+
 #define GEN_VXFORM_ENV(name, opc2, opc3)                                \
 static void glue(gen_, name)(DisasContext *ctx)                         \
 {                                                                       \
@@ -515,6 +493,105 @@ static void gen_vmrgow(DisasContext *ctx)
     tcg_temp_free_i64(avr);
 }
 
+/*
+ * lvsl VRT,RA,RB - Load Vector for Shift Left
+ *
+ * Let the EA be the sum (rA|0)+(rB). Let sh=EA[28–31].
+ * Let X be the 32-byte value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F.
+ * Bytes sh:sh+15 of X are placed into vD.
+ */
+static void trans_lvsl(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv_i64 EA = tcg_temp_new();
+    int i;
+
+    /* Get sh(from description) by anding EA with 0xf. */
+    gen_addr_reg_index(ctx, EA);
+    tcg_gen_andi_i64(sh, EA, 0xfULL);
+    /*
+     * Create bytes sh:sh+7 of X(from description) and place them in
+     * higher doubleword of vD.
+     */
+    tcg_gen_addi_i64(result, sh, 7);
+    for (i = 7; i >= 1; i--) {
+        tcg_gen_shli_i64(tmp, sh, i * 8);
+        tcg_gen_or_i64(result, result, tmp);
+        tcg_gen_addi_i64(sh, sh, 1);
+    }
+    set_avr64(VT, result, true);
+    /*
+     * Create bytes sh+8:sh+15 of X(from description) and place them in
+     * lower doubleword of vD.
+     */
+    tcg_gen_addi_i64(result, sh, 8);
+    for (i = 7; i >= 1; i--) {
+        tcg_gen_addi_i64(sh, sh, 1);
+        tcg_gen_shli_i64(tmp, sh, i * 8);
+        tcg_gen_or_i64(result, result, tmp);
+    }
+    set_avr64(VT, result, false);
+
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free(EA);
+}
+
+/*
+ * lvsr VRT,RA,RB - Load Vector for Shift Right
+ *
+ * Let the EA be the sum (rA|0)+(rB). Let sh=EA[28–31].
+ * Let X be the 32-byte value 0x00 || 0x01 || 0x02 || ... || 0x1E || 0x1F.
+ * Bytes (16-sh):(31-sh) of X are placed into vD.
+ */
+static void trans_lvsr(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv_i64 EA = tcg_temp_new();
+    int i;
+
+    /* Get sh(from description) by anding EA with 0xf. */
+    gen_addr_reg_index(ctx, EA);
+    tcg_gen_andi_i64(sh, EA, 0xfULL);
+    /* Make (16-sh) and save it in sh. */
+    tcg_gen_subi_i64(sh, sh, 0x10ULL);
+    tcg_gen_neg_i64(sh, sh);
+    /*
+     * Create bytes (16-sh):(23-sh) of X(from description) and place them in
+     * higher doubleword of vD.
+     */
+    tcg_gen_addi_i64(result, sh, 7);
+    for (i = 7; i >= 1; i--) {
+        tcg_gen_shli_i64(tmp, sh, i * 8);
+        tcg_gen_or_i64(result, result, tmp);
+        tcg_gen_addi_i64(sh, sh, 1);
+    }
+    set_avr64(VT, result, true);
+    /*
+     * Create bytes (24-sh):(32-sh) of X(from description) and place them in
+     * lower doubleword of vD.
+     */
+    tcg_gen_addi_i64(result, sh, 8);
+    for (i = 7; i >= 1; i--) {
+        tcg_gen_addi_i64(sh, sh, 1);
+        tcg_gen_shli_i64(tmp, sh, i * 8);
+        tcg_gen_or_i64(result, result, tmp);
+    }
+    set_avr64(VT, result, false);
+
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free(EA);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -657,6 +734,8 @@ GEN_VXFORM_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
 GEN_VXFORM_HETRO(vextubrx, 6, 28)
 GEN_VXFORM_HETRO(vextuhrx, 6, 29)
 GEN_VXFORM_HETRO(vextuwrx, 6, 30)
+GEN_VXFORM_TRANS(lvsl, 6, 31)
+GEN_VXFORM_TRANS(lvsr, 6, 32)
 GEN_VXFORM_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207, \
                 vextuwrx, PPC_NONE, PPC2_ISA300)
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
@ 2019-06-06 10:15 ` Stefan Brankovic
  2019-06-06 17:03   ` Richard Henderson
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 3/8] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimization of altivec instructions vsl and vsr(Vector Shift Left/Rigt).
Perform shift operation (left and right respectively) on 128 bit value of
register vA by value specified in bits 125-127 of register vB. Lowest 3
bits in each byte element of register vB must be identical or result is
undefined.

For vsl instruction we do this by first saving bits 125-127
of register  vB in variable sh. Then we save highest sh bits
of lower doubleword element of register vA in variable shifted,
so we don't lose those bits when we perform shift operation on
lower doubleword element of register vA, which is our next
step. After shifting lower doubleword element we perform shift
operation on higher doubleword element of vA and replace
lowest sh bits(that are now 0) with bits saved in shifted.

For vsr instruction we do this by first saving bits 125-127
of register  vB in variable sh. Then we save lowest sh bits
of higher doubleword element of register vA in variable shifted,
so we don't lose those bits when we perform shift operation on
higher doubleword element of register vA, which is our next step.
After shifting higher doubleword element we perform shift
operation on lower doubleword element of vA and replace
highest sh bits(that are now 0) with bits saved in shifted.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 101 +++++++++++++++++++++++++++++++++++-
 1 file changed, 99 insertions(+), 2 deletions(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 140bb05..6bd072a 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -592,6 +592,103 @@ static void trans_lvsr(DisasContext *ctx)
     tcg_temp_free(EA);
 }
 
+/*
+ * vsl VRT,VRA,VRB - Vector Shift Left
+ *
+ * Shifting left 128 bit value of vA by value specified in bits 125-127 of vB.
+ * Lowest 3 bits in each byte element of register vB must be identical or
+ * result is undefined.
+ */
+static void trans_vsl(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avrA = tcg_temp_new_i64();
+    TCGv_i64 avrB = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv_i64 shifted = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+
+    /* Place bits 125-127 of vB in sh. */
+    get_avr64(avrB, VB, false);
+    tcg_gen_andi_i64(sh, avrB, 0x07ULL);
+
+    /*
+     * Save highest sh bits of lower doubleword element of vA in variable
+     * shifted and perform shift on lower doubleword.
+     */
+    get_avr64(avrA, VA, false);
+    tcg_gen_subi_i64(tmp, sh, 64);
+    tcg_gen_neg_i64(tmp, tmp);
+    tcg_gen_shr_i64(shifted, avrA, tmp);
+    tcg_gen_shl_i64(avrA, avrA, sh);
+    set_avr64(VT, avrA, false);
+
+    /*
+     * Perform shift on higher doubleword element of vA and replace lowest
+     * sh bits with shifted.
+     */
+    get_avr64(avrA, VA, true);
+    tcg_gen_shl_i64(avrA, avrA, sh);
+    tcg_gen_or_i64(avrA, avrA, shifted);
+    set_avr64(VT, avrA, true);
+
+    tcg_temp_free_i64(avrA);
+    tcg_temp_free_i64(avrB);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free_i64(shifted);
+    tcg_temp_free_i64(tmp);
+}
+
+/*
+ * vsr VRT,VRA,VRB - Vector Shift Right
+ *
+ * Shifting right 128 bit value of vA by value specified in bits 125-127 of vB.
+ * Lowest 3 bits in each byte element of register vB must be identical or
+ * result is undefined.
+ */
+static void trans_vsr(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avrA = tcg_temp_new_i64();
+    TCGv_i64 avrB = tcg_temp_new_i64();
+    TCGv_i64 sh = tcg_temp_new_i64();
+    TCGv_i64 shifted = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+
+    /* Place bits 125-127 of vB in sh. */
+    get_avr64(avrB, VB, false);
+    tcg_gen_andi_i64(sh, avrB, 0x07ULL);
+
+    /*
+     * Save lowest sh bits of higher doubleword element of vA in variable
+     * shifted and perform shift on higher doubleword.
+     */
+    get_avr64(avrA, VA, true);
+    tcg_gen_subi_i64(tmp, sh, 64);
+    tcg_gen_neg_i64(tmp, tmp);
+    tcg_gen_shl_i64(shifted, avrA, tmp);
+    tcg_gen_shr_i64(avrA, avrA, sh);
+    set_avr64(VT, avrA, true);
+    /*
+     * Perform shift on lower doubleword element of vA and replace highest
+     * sh bits with shifted.
+     */
+    get_avr64(avrA, VA, false);
+    tcg_gen_shr_i64(avrA, avrA, sh);
+    tcg_gen_or_i64(avrA, avrA, shifted);
+    set_avr64(VT, avrA, false);
+
+    tcg_temp_free_i64(avrA);
+    tcg_temp_free_i64(avrB);
+    tcg_temp_free_i64(sh);
+    tcg_temp_free_i64(shifted);
+    tcg_temp_free_i64(tmp);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -699,11 +796,11 @@ GEN_VXFORM(vrld, 2, 3);
 GEN_VXFORM(vrldmi, 2, 3);
 GEN_VXFORM_DUAL(vrld, PPC_NONE, PPC2_ALTIVEC_207, \
                 vrldmi, PPC_NONE, PPC2_ISA300)
-GEN_VXFORM(vsl, 2, 7);
+GEN_VXFORM_TRANS(vsl, 2, 7);
 GEN_VXFORM(vrldnm, 2, 7);
 GEN_VXFORM_DUAL(vsl, PPC_ALTIVEC, PPC_NONE, \
                 vrldnm, PPC_NONE, PPC2_ISA300)
-GEN_VXFORM(vsr, 2, 11);
+GEN_VXFORM_TRANS(vsr, 2, 11);
 GEN_VXFORM_ENV(vpkuhum, 7, 0);
 GEN_VXFORM_ENV(vpkuwum, 7, 1);
 GEN_VXFORM_ENV(vpkudum, 7, 17);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH 3/8] target/ppc: Optimize emulation of vpkpx instruction
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions Stefan Brankovic
@ 2019-06-06 10:15 ` Stefan Brankovic
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction Stefan Brankovic
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize altivec instruction vpkpx (Vector Pack Pixel).
Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register)
into contigous array of bits in the destination register.

In each iteration of outer loop we do the 6-5-5 pack for 2 pixels
of each doubleword element of each source register. The first thing
we do in outer loop is choosing which doubleword element of which
register are we using in current iteration and we place it in avr
variable. Then we perform 6-5-5 pack of pixels on avr variable
in inner for loop(2 iterations, 1 for each pixel) and save result
in tmp variable. In the end of outer for loop, we merge result in
variable called result and save it in appropriate doubleword element
of vD if whole doubleword is finished(every second iteration). Outer
loop has 4 iterations.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 93 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 92 insertions(+), 1 deletion(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 6bd072a..87f69dc 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -593,6 +593,97 @@ static void trans_lvsr(DisasContext *ctx)
 }
 
 /*
+ * vpkpx VRT,VRA,VRB - Vector Pack Pixel
+ *
+ * Rearranges 8 pixels coded in 6-5-5 pattern (4 from each source register)
+ * into contigous array of bits in the destination register.
+ */
+static void trans_vpkpx(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 shifted = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    int64_t mask1 = 0x1fULL;
+    int64_t mask2 = 0x1fULL << 5;
+    int64_t mask3 = 0x3fULL << 10;
+    int i, j;
+    /*
+     * In each iteration do the 6-5-5 pack for 2 pixels of each doubleword
+     * element of each source register.
+     */
+    for (i = 0; i < 4; i++) {
+        switch (i) {
+        case 0:
+            /*
+             * Get high doubleword of vA to perfrom 6-5-5 pack of pixels
+             * 1 and 2.
+             */
+            get_avr64(avr, VA, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 1:
+            /*
+             * Get low doubleword of vA to perfrom 6-5-5 pack of pixels
+             * 3 and 4.
+             */
+            get_avr64(avr, VA, false);
+            break;
+        case 2:
+            /*
+             * Get high doubleword of vB to perfrom 6-5-5 pack of pixels
+             * 5 and 6.
+             */
+            get_avr64(avr, VB, true);
+            tcg_gen_movi_i64(result, 0x0ULL);
+            break;
+        case 3:
+            /*
+             * Get low doubleword of vB to perfrom 6-5-5 pack of pixels
+             * 7 and 8.
+             */
+            get_avr64(avr, VB, false);
+            break;
+        }
+        /* Perform the packing for 2 pixels(each iteration for 1). */
+        tcg_gen_movi_i64(tmp, 0x0ULL);
+        for (j = 0; j < 2; j++) {
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 3));
+            tcg_gen_andi_i64(shifted, shifted, mask1 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 6));
+            tcg_gen_andi_i64(shifted, shifted, mask2 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+
+            tcg_gen_shri_i64(shifted, avr, (j * 16 + 9));
+            tcg_gen_andi_i64(shifted, shifted, mask3 << (j * 16));
+            tcg_gen_or_i64(tmp, tmp, shifted);
+        }
+        if ((i == 0) || (i == 2)) {
+            tcg_gen_shli_i64(tmp, tmp, 32);
+        }
+        tcg_gen_or_i64(result, result, tmp);
+        if (i == 1) {
+            /* Place packed pixels 1:4 to high doubleword of vD. */
+            set_avr64(VT, result, true);
+        }
+        if (i == 3) {
+            /* Place packed pixels 5:8 to low doubleword of vD. */
+            set_avr64(VT, result, false);
+        }
+    }
+
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(shifted);
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+}
+
+/*
  * vsl VRT,VRA,VRB - Vector Shift Left
  *
  * Shifting left 128 bit value of vA by value specified in bits 125-127 of vB.
@@ -813,7 +904,7 @@ GEN_VXFORM_ENV(vpksdus, 7, 21);
 GEN_VXFORM_ENV(vpkshss, 7, 6);
 GEN_VXFORM_ENV(vpkswss, 7, 7);
 GEN_VXFORM_ENV(vpksdss, 7, 23);
-GEN_VXFORM(vpkpx, 7, 12);
+GEN_VXFORM_TRANS(vpkpx, 7, 12);
 GEN_VXFORM_ENV(vsum4ubs, 4, 24);
 GEN_VXFORM_ENV(vsum4sbs, 4, 28);
 GEN_VXFORM_ENV(vsum4shs, 4, 25);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
                   ` (2 preceding siblings ...)
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 3/8] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
@ 2019-06-06 10:15 ` Stefan Brankovic
  2019-06-06 18:19   ` Richard Henderson
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction Stefan Brankovic
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize altivec instruction vgbbd (Vector Gather Bits by Bytes by Doubleword)
All ith bits (i in range 1 to 8) of each byte of doubleword element in
source register are concatenated and placed into ith byte of appropriate
doubleword element in destination register.

Following solution is done for every doubleword element of source register
(placed in shifted variable):
We gather bits in 2x8 iterations.
In first iteration bit 1 of byte 1, bit 2 of byte 2,... bit 8 of byte 8 are
in their final spots so we just and avr with mask. For every next iteration,
we have to shift right both shifted(7 places) and mask(8 places), so we get
bit 1 of byte 2, bit 2 of byte 3.. bit 7 of byte 8 in right places so we and
shifted with new value of mask... After first 8 iteration(first for loop) we
have all first bits in their final place all second bits but second bit from
eight byte in their place,... only 1 eight bit from eight byte is in it's
place), so we and result1 with mask1 to save those bits that are at right
place and save them in result1. In second loop we do all operations
symetrical, so we get other half of bits on their final spots, and save
result in result2. Or of result1 and result2 is placed in appropriate
doubleword element of vD. We repeat this 2 times.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 99 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 98 insertions(+), 1 deletion(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 87f69dc..010f337 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -780,6 +780,103 @@ static void trans_vsr(DisasContext *ctx)
     tcg_temp_free_i64(tmp);
 }
 
+/*
+ * vgbbd VRT,VRB - Vector Gather Bits by Bytes by Doubleword
+ *
+ * All ith bits (i in range 1 to 8) of each byte of doubleword element in source
+ * register are concatenated and placed into ith byte of appropriate doubleword
+ * element in destination register.
+ *
+ * Following solution is done for every doubleword element of source register
+ * (placed in shifted variable):
+ * We gather bits in 2x8 iterations.
+ * In first iteration bit 1 of byte 1, bit 2 of byte 2,... bit 8 of byte 8 are
+ * in their final spots so we just and avr with mask. For every next iteration,
+ * we have to shift right both shifted(7 places) and mask(8 places), so we get
+ * bit 1 of byte 2, bit 2 of byte 3.. bit 7 of byte 8 in right places so we and
+ * shifted with new value of mask... After first 8 iteration(first for loop) we
+ * have all first bits in their final place all second bits but second bit from
+ * eight byte in their place,... only 1 eight bit from eight byte is in it's
+ * place), so we and result1 with mask1 to save those bits that are at right
+ * place and save them in result1. In second loop we do all operations
+ * symetrical, so we get other half of bits on their final spots, and save
+ * result in result2. Or of result1 and result2 is placed in appropriate
+ * doubleword element of vD. We repeat this 2 times.
+ */
+static void trans_vgbbd(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 shifted = tcg_temp_new_i64();
+    TCGv_i64 result1 = tcg_temp_new_i64();
+    TCGv_i64 result2 = tcg_temp_new_i64();
+    uint64_t mask = 0x8040201008040201ULL;
+    uint64_t mask1 = 0x80c0e0f0f8fcfeffULL;
+    uint64_t mask2 = 0x7f3f1f0f07030100ULL;
+    int i;
+
+    get_avr64(avr, VB, true);
+    tcg_gen_movi_i64(result1, 0x0ULL);
+    tcg_gen_mov_i64(shifted, avr);
+    for (i = 0; i < 8; i++) {
+        tcg_gen_andi_i64(tmp, shifted, mask);
+        tcg_gen_or_i64(result1, result1, tmp);
+
+        tcg_gen_shri_i64(shifted, shifted, 7);
+        mask = mask >> 8;
+    }
+    tcg_gen_andi_i64(result1, result1, mask1);
+
+    mask = 0x8040201008040201ULL;
+    tcg_gen_movi_i64(result2, 0x0ULL);
+    for (i = 0; i < 8; i++) {
+        tcg_gen_andi_i64(tmp, avr, mask);
+        tcg_gen_or_i64(result2, result2, tmp);
+
+        tcg_gen_shli_i64(avr, avr, 7);
+        mask = mask << 8;
+    }
+    tcg_gen_andi_i64(result2, result2, mask2);
+
+    tcg_gen_or_i64(result2, result2, result1);
+    set_avr64(VT, result2, true);
+
+    mask = 0x8040201008040201ULL;
+    get_avr64(avr, VB, false);
+    tcg_gen_movi_i64(result1, 0x0ULL);
+    tcg_gen_mov_i64(shifted, avr);
+    for (i = 0; i < 8; i++) {
+        tcg_gen_andi_i64(tmp, shifted, mask);
+        tcg_gen_or_i64(result1, result1, tmp);
+
+        tcg_gen_shri_i64(shifted, shifted, 7);
+        mask = mask >> 8;
+    }
+    tcg_gen_andi_i64(result1, result1, mask1);
+
+    mask = 0x8040201008040201ULL;
+    tcg_gen_movi_i64(result2, 0x0ULL);
+    for (i = 0; i < 8; i++) {
+        tcg_gen_andi_i64(tmp, avr, mask);
+        tcg_gen_or_i64(result2, result2, tmp);
+
+        tcg_gen_shli_i64(avr, avr, 7);
+        mask = mask << 8;
+    }
+    tcg_gen_andi_i64(result2, result2, mask2);
+
+    tcg_gen_or_i64(result2, result2, result1);
+    set_avr64(VT, result2, false);
+
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(shifted);
+    tcg_temp_free_i64(result1);
+    tcg_temp_free_i64(result2);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -1319,7 +1416,7 @@ GEN_VXFORM_DUAL(vclzd, PPC_NONE, PPC2_ALTIVEC_207, \
                 vpopcntd, PPC_NONE, PPC2_ALTIVEC_207)
 GEN_VXFORM(vbpermd, 6, 23);
 GEN_VXFORM(vbpermq, 6, 21);
-GEN_VXFORM_NOA(vgbbd, 6, 20);
+GEN_VXFORM_TRANS(vgbbd, 6, 20);
 GEN_VXFORM(vpmsumb, 4, 16)
 GEN_VXFORM(vpmsumh, 4, 17)
 GEN_VXFORM(vpmsumw, 4, 18)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
                   ` (3 preceding siblings ...)
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction Stefan Brankovic
@ 2019-06-06 10:15 ` Stefan Brankovic
  2019-06-06 18:26   ` Richard Henderson
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction Stefan Brankovic
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize Altivec instruction vclzd (Vector Count Leading Zeros Doubleword).
This instruction counts the number of leading zeros of each doubleword element
in source register and places result in the appropriate doubleword element of
destination register.

Using tcg-s count leading zeros instruction two times(once for each
doubleword element of source register vB) and placing result in
appropriate doubleword element of destination register vD.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 010f337..1c34908 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -877,6 +877,32 @@ static void trans_vgbbd(DisasContext *ctx)
     tcg_temp_free_i64(result2);
 }
 
+/*
+ * vclzd VRT,VRB - Vector Count Leading Zeros Doubleword
+ *
+ * Counting the number of leading zero bits of each doubleword element in source
+ * register and placing result in appropriate doubleword element of destination
+ * register.
+ */
+static void trans_vclzd(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+
+    /* high doubleword */
+    get_avr64(avr, VB, true);
+    tcg_gen_clzi_i64(avr, avr, 64);
+    set_avr64(VT, avr, true);
+
+    /* low doubleword */
+    get_avr64(avr, VB, false);
+    tcg_gen_clzi_i64(avr, avr, 64);
+    set_avr64(VT, avr, false);
+
+    tcg_temp_free_i64(avr);
+}
+
 GEN_VXFORM(vmuloub, 4, 0);
 GEN_VXFORM(vmulouh, 4, 1);
 GEN_VXFORM(vmulouw, 4, 2);
@@ -1388,7 +1414,7 @@ GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
 GEN_VXFORM_NOA(vclzb, 1, 28)
 GEN_VXFORM_NOA(vclzh, 1, 29)
 GEN_VXFORM_NOA(vclzw, 1, 30)
-GEN_VXFORM_NOA(vclzd, 1, 31)
+GEN_VXFORM_TRANS(vclzd, 1, 31)
 GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
 GEN_VXFORM_NOA_2(vnegd, 1, 24, 7)
 GEN_VXFORM_NOA_2(vextsb2w, 1, 24, 16)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
                   ` (4 preceding siblings ...)
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction Stefan Brankovic
@ 2019-06-06 10:15 ` Stefan Brankovic
  2019-06-06 18:34   ` Richard Henderson
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize Altivec instruction vclzw (Vector Count Leading Zeros Word).
This instruction counts the number of leading zeros of each word element
in source register and places result in the appropriate word element of
destination register.

We perform counting in two iterations of for loop(one for each
doubleword element of source register vB). First thing we do in
loop is placing appropriate doubleword element of vB in variable
avr. Then we perform counting using tcg-s count leading zeros
function. Since it counts leading zeros on 64 bit lenght, we have to
move ith word element to highest 32 bits of variable tmp, or it with
mask(so we get all ones in lowest 32 bits), then perform
tcg_gen_clzi_i64 and move it's result in appropriate word element of
variable result. In the end of each loop iteration we save variable
result to appropriate doubleword element of destination register vD.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 57 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 1c34908..7689739 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -878,6 +878,61 @@ static void trans_vgbbd(DisasContext *ctx)
 }
 
 /*
+ * vclzw VRT,VRB - Vector Count Leading Zeros Word
+ *
+ * Counting the number of leading zero bits of each word element in source
+ * register and placing result in appropriate word element of destination
+ * register.
+ */
+static void trans_vclzw(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffULL);
+    int i;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword element of vB in avr. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword element of vB in avr. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every word element using tcg_gen_clzi_i64.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith word element to highest 32 bits of tmp, or it with mask(so we get
+         * all ones in lowest 32 bits), then perform tcg_gen_clzi_i64 and move
+         * it's result in appropriate word element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 32);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 32, 32);
+
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            set_avr64(VT, result, true);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            set_avr64(VT, result, false);
+        }
+    }
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
  * vclzd VRT,VRB - Vector Count Leading Zeros Doubleword
  *
  * Counting the number of leading zero bits of each doubleword element in source
@@ -1413,7 +1468,7 @@ GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
 
 GEN_VXFORM_NOA(vclzb, 1, 28)
 GEN_VXFORM_NOA(vclzh, 1, 29)
-GEN_VXFORM_NOA(vclzw, 1, 30)
+GEN_VXFORM_TRANS(vclzw, 1, 30)
 GEN_VXFORM_TRANS(vclzd, 1, 31)
 GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
 GEN_VXFORM_NOA_2(vnegd, 1, 24, 7)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
                   ` (5 preceding siblings ...)
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction Stefan Brankovic
@ 2019-06-06 10:15 ` Stefan Brankovic
  2019-06-06 20:38   ` Richard Henderson
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
This instruction counts the number of leading zeros of each halfword element
in source register and places result in the appropriate halfword element of
destination register.

In each iteration of outer for loop we perform count operation on one
doubleword elements of source register vB. In first iteration we place
higher doubleword element of vB in variable avr, then we perform count
for every halfword element using tcg_gen_clzi_i64. Since it counts
leading zeros on 64 bit lenght, we have to move ith byte element to
highest 16 bits of tmp, or it with mask(so we get all ones in lowest
48 bits), then perform tcg_gen_clzi_i64 and move it's result in
appropriate halfword element of result. We do this in inner for loop.
After operation is finished we save result in appropriate doubleword
element of destination register vD. We repeat this once again for
lower doubleword element of vB.

Optimize Altivec instruction vclzb (Vector Count Leading Zeros Byte).
This instruction counts the number of leading zeros of each byte element
in source register and places result in the appropriate byte element of
destination register.

In each iteration of outer for loop we perform count operation on one
doubleword elements of source register vB. In first iteration we place
higher doubleword element of vB in variable avr, then we perform count
for every byte element using tcg_gen_clzi_i64. Since it counts leading
zeros on 64 bit lenght, we have to move ith byte element to highest 8
bits of variable  tmp, or it with mask(so we get all ones in lowest 56
bits), then perform tcg_gen_clzi_i64 and move it's result in appropriate
byte element of result. We do this in inner for loop. After operation is
finished we save result in appropriate doubleword element of destination
register vD. We repeat this once again for lower doubleword element of
vB.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 122 +++++++++++++++++++++++++++++++++++-
 1 file changed, 120 insertions(+), 2 deletions(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 7689739..8535a31 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -878,6 +878,124 @@ static void trans_vgbbd(DisasContext *ctx)
 }
 
 /*
+ * vclzb VRT,VRB - Vector Count Leading Zeros Byte
+ *
+ * Counting the number of leading zero bits of each byte element in source
+ * register and placing result in appropriate byte element of destination
+ * register.
+ */
+static void trans_vclzb(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword of vB in avr. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword of vB in avr. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every byte element using tcg_gen_clzi_i64.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 8 bits of tmp, or it with mask(so we get
+         * all ones in lowest 56 bits), then perform tcg_gen_clzi_i64 and move
+         * it's result in appropriate byte element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 56);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 7; j++) {
+            tcg_gen_shli_i64(tmp, avr, (7 - j) * 8);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 8, 8);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 56, 8);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            set_avr64(VT, result, true);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            set_avr64(VT, result, false);
+        }
+    }
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
+ * vclzh VRT,VRB - Vector Count Leading Zeros Halfword
+ *
+ * Counting the number of leading zero bits of each halfword element in source
+ * register and placing result in appropriate halfword element of destination
+ * register.
+ */
+static void trans_vclzh(DisasContext *ctx)
+{
+    int VT = rD(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 avr = tcg_temp_new_i64();
+    TCGv_i64 result = tcg_temp_new_i64();
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 mask = tcg_const_i64(0xffffffffffffULL);
+    int i, j;
+
+    for (i = 0; i < 2; i++) {
+        if (i == 0) {
+            /* Get high doubleword element of vB in avr. */
+            get_avr64(avr, VB, true);
+        } else {
+            /* Get low doubleword element of vB in avr. */
+            get_avr64(avr, VB, false);
+        }
+        /*
+         * Perform count for every halfword element using tcg_gen_clzi_i64.
+         * Since it counts leading zeros on 64 bit lenght, we have to move
+         * ith byte element to highest 16 bits of tmp, or it with mask(so we get
+         * all ones in lowest 48 bits), then perform tcg_gen_clzi_i64 and move
+         * it's result in appropriate halfword element of result.
+         */
+        tcg_gen_shli_i64(tmp, avr, 48);
+        tcg_gen_or_i64(tmp, tmp, mask);
+        tcg_gen_clzi_i64(result, tmp, 64);
+        for (j = 1; j < 3; j++) {
+            tcg_gen_shli_i64(tmp, avr, (3 - j) * 16);
+            tcg_gen_or_i64(tmp, tmp, mask);
+            tcg_gen_clzi_i64(tmp, tmp, 64);
+            tcg_gen_deposit_i64(result, result, tmp, j * 16, 16);
+        }
+        tcg_gen_or_i64(tmp, avr, mask);
+        tcg_gen_clzi_i64(tmp, tmp, 64);
+        tcg_gen_deposit_i64(result, result, tmp, 48, 16);
+        if (i == 0) {
+            /* Place result in high doubleword element of vD. */
+            set_avr64(VT, result, true);
+        } else {
+            /* Place result in low doubleword element of vD. */
+            set_avr64(VT, result, false);
+        }
+    }
+
+    tcg_temp_free_i64(avr);
+    tcg_temp_free_i64(result);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(mask);
+}
+
+/*
  * vclzw VRT,VRB - Vector Count Leading Zeros Word
  *
  * Counting the number of leading zero bits of each word element in source
@@ -1466,8 +1584,8 @@ GEN_VAFORM_PAIRED(vmsumshm, vmsumshs, 20)
 GEN_VAFORM_PAIRED(vsel, vperm, 21)
 GEN_VAFORM_PAIRED(vmaddfp, vnmsubfp, 23)
 
-GEN_VXFORM_NOA(vclzb, 1, 28)
-GEN_VXFORM_NOA(vclzh, 1, 29)
+GEN_VXFORM_TRANS(vclzb, 1, 28)
+GEN_VXFORM_TRANS(vclzh, 1, 29)
 GEN_VXFORM_TRANS(vclzw, 1, 30)
 GEN_VXFORM_TRANS(vclzd, 1, 31)
 GEN_VXFORM_NOA_2(vnegw, 1, 24, 6)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
                   ` (6 preceding siblings ...)
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
@ 2019-06-06 10:15 ` Stefan Brankovic
  2019-06-06 20:43   ` Richard Henderson
  2019-06-06 17:13 ` [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Richard Henderson
  2019-06-07  3:51 ` Howard Spoelstra
  9 siblings, 1 reply; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-06 10:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Since I found this two instructions implemented with tcg, I refactored
them so they are consistent with other similar implementations that
I introduced in this patch.

Also had to add new dual macro GEN_VXFORM_TRANS_DUAL. We use this macro
if one instruction is realized with direct translation, and second one
with helper.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 62 ++++++++++++++++++++-----------------
 1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 8535a31..46c6f34 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -350,6 +350,24 @@ static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
     }                                                                  \
 }
 
+/*
+ * We use this macro if one instruction is realized with direct
+ * translation, and second one with helper.
+ */
+#define GEN_VXFORM_TRANS_DUAL(name0, flg0, flg2_0, name1, flg1, flg2_1)\
+static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
+{                                                                      \
+    if ((Rc(ctx->opcode) == 0) &&                                      \
+        ((ctx->insns_flags & flg0) || (ctx->insns_flags2 & flg2_0))) { \
+        trans_##name0(ctx);                                            \
+    } else if ((Rc(ctx->opcode) == 1) &&                               \
+        ((ctx->insns_flags & flg1) || (ctx->insns_flags2 & flg2_1))) { \
+        gen_##name1(ctx);                                              \
+    } else {                                                           \
+        gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);            \
+    }                                                                  \
+}
+
 /* Adds support to provide invalid mask */
 #define GEN_VXFORM_DUAL_EXT(name0, flg0, flg2_0, inval0,                \
                             name1, flg1, flg2_1, inval1)                \
@@ -431,20 +449,13 @@ GEN_VXFORM(vmrglb, 6, 4);
 GEN_VXFORM(vmrglh, 6, 5);
 GEN_VXFORM(vmrglw, 6, 6);
 
-static void gen_vmrgew(DisasContext *ctx)
+static void trans_vmrgew(DisasContext *ctx)
 {
-    TCGv_i64 tmp;
-    TCGv_i64 avr;
-    int VT, VA, VB;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    VT = rD(ctx->opcode);
-    VA = rA(ctx->opcode);
-    VB = rB(ctx->opcode);
-    tmp = tcg_temp_new_i64();
-    avr = tcg_temp_new_i64();
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
 
     get_avr64(avr, VB, true);
     tcg_gen_shri_i64(tmp, avr, 32);
@@ -462,21 +473,14 @@ static void gen_vmrgew(DisasContext *ctx)
     tcg_temp_free_i64(avr);
 }
 
-static void gen_vmrgow(DisasContext *ctx)
+static void trans_vmrgow(DisasContext *ctx)
 {
-    TCGv_i64 t0, t1;
-    TCGv_i64 avr;
-    int VT, VA, VB;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    VT = rD(ctx->opcode);
-    VA = rA(ctx->opcode);
-    VB = rB(ctx->opcode);
-    t0 = tcg_temp_new_i64();
-    t1 = tcg_temp_new_i64();
-    avr = tcg_temp_new_i64();
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 t0 = tcg_temp_new_i64();
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
 
     get_avr64(t0, VB, true);
     get_avr64(t1, VA, true);
@@ -1213,14 +1217,14 @@ GEN_VXFORM_ENV(vminfp, 5, 17);
 GEN_VXFORM_HETRO(vextublx, 6, 24)
 GEN_VXFORM_HETRO(vextuhlx, 6, 25)
 GEN_VXFORM_HETRO(vextuwlx, 6, 26)
-GEN_VXFORM_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
+GEN_VXFORM_TRANS_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
                 vextuwlx, PPC_NONE, PPC2_ISA300)
 GEN_VXFORM_HETRO(vextubrx, 6, 28)
 GEN_VXFORM_HETRO(vextuhrx, 6, 29)
 GEN_VXFORM_HETRO(vextuwrx, 6, 30)
 GEN_VXFORM_TRANS(lvsl, 6, 31)
 GEN_VXFORM_TRANS(lvsr, 6, 32)
-GEN_VXFORM_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207, \
+GEN_VXFORM_TRANS_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207,
                 vextuwrx, PPC_NONE, PPC2_ISA300)
 
 #define GEN_VXRFORM1(opname, name, str, opc2, opc3)                     \
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
@ 2019-06-06 16:46   ` Richard Henderson
  2019-06-17 11:31     ` Stefan Brankovic
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-06-06 16:46 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> +    tcg_gen_addi_i64(result, sh, 7);
> +    for (i = 7; i >= 1; i--) {
> +        tcg_gen_shli_i64(tmp, sh, i * 8);
> +        tcg_gen_or_i64(result, result, tmp);
> +        tcg_gen_addi_i64(sh, sh, 1);
> +    }

Better to replicate sh into the 8 positions and then use one add.

    tcg_gen_muli_i64(sh, sh, 0x0101010101010101ull);
    tcg_gen_addi_i64(hi_result, sh, 0x0001020304050607ull);
    tcg_gen_addi_i64(lo_result, sh, 0x08090a0b0c0d0e0full);

and

    tcg_gen_subfi_i64(hi_result, 0x1011121314151617ull, sh);
    tcg_gen_subfi_i64(lo_result, 0x18191a1b1c1d1e1full, sh);

for lvsr.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions Stefan Brankovic
@ 2019-06-06 17:03   ` Richard Henderson
  2019-06-17 11:36     ` Stefan Brankovic
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-06-06 17:03 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> +    tcg_gen_subi_i64(tmp, sh, 64);
> +    tcg_gen_neg_i64(tmp, tmp);

Better as

    tcg_gen_subfi_i64(tmp, 64, sh);


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl,
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
                   ` (7 preceding siblings ...)
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
@ 2019-06-06 17:13 ` Richard Henderson
  2019-06-12  7:31   ` [Qemu-devel] ?==?utf-8?q? ?==?utf-8?q? [PATCH 0/8] Optimize emulation of ten Altivec instructions:?==?utf-8?q? lvsl, Stefan Brankovic
  2019-06-17 11:32   ` [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
  2019-06-07  3:51 ` Howard Spoelstra
  9 siblings, 2 replies; 27+ messages in thread
From: Richard Henderson @ 2019-06-06 17:13 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> Stefan Brankovic (8):
>   target/ppc: Optimize emulation of lvsl and lvsr instructions
>   target/ppc: Optimize emulation of vsl and vsr instructions
>   target/ppc: Optimize emulation of vpkpx instruction
>   target/ppc: Optimize emulation of vgbbd instruction
>   target/ppc: Optimize emulation of vclzd instruction
>   target/ppc: Optimize emulation of vclzw instruction
>   target/ppc: Optimize emulation of vclzh and vclzb instructions
>   target/ppc: Refactor emulation of vmrgew and vmrgow instructions
> 
>  target/ppc/translate/vmx-impl.inc.c | 705 ++++++++++++++++++++++++++++++++----
>  1 file changed, 636 insertions(+), 69 deletions(-)

You should be removing the out-of-line helpers that are no longer used.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction Stefan Brankovic
@ 2019-06-06 18:19   ` Richard Henderson
  2019-06-17 11:58     ` Stefan Brankovic
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-06-06 18:19 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> Optimize altivec instruction vgbbd (Vector Gather Bits by Bytes by Doubleword)
> All ith bits (i in range 1 to 8) of each byte of doubleword element in
> source register are concatenated and placed into ith byte of appropriate
> doubleword element in destination register.
> 
> Following solution is done for every doubleword element of source register
> (placed in shifted variable):
> We gather bits in 2x8 iterations.
> In first iteration bit 1 of byte 1, bit 2 of byte 2,... bit 8 of byte 8 are
> in their final spots so we just and avr with mask. For every next iteration,
> we have to shift right both shifted(7 places) and mask(8 places), so we get
> bit 1 of byte 2, bit 2 of byte 3.. bit 7 of byte 8 in right places so we and
> shifted with new value of mask... After first 8 iteration(first for loop) we
> have all first bits in their final place all second bits but second bit from
> eight byte in their place,... only 1 eight bit from eight byte is in it's
> place), so we and result1 with mask1 to save those bits that are at right
> place and save them in result1. In second loop we do all operations
> symetrical, so we get other half of bits on their final spots, and save
> result in result2. Or of result1 and result2 is placed in appropriate
> doubleword element of vD. We repeat this 2 times.
> 
> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/translate/vmx-impl.inc.c | 99 ++++++++++++++++++++++++++++++++++++-
>  1 file changed, 98 insertions(+), 1 deletion(-)
> 
> diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
> index 87f69dc..010f337 100644
> --- a/target/ppc/translate/vmx-impl.inc.c
> +++ b/target/ppc/translate/vmx-impl.inc.c
> @@ -780,6 +780,103 @@ static void trans_vsr(DisasContext *ctx)
>      tcg_temp_free_i64(tmp);
>  }
>  
> +/*
> + * vgbbd VRT,VRB - Vector Gather Bits by Bytes by Doubleword
> + *
> + * All ith bits (i in range 1 to 8) of each byte of doubleword element in source
> + * register are concatenated and placed into ith byte of appropriate doubleword
> + * element in destination register.
> + *
> + * Following solution is done for every doubleword element of source register
> + * (placed in shifted variable):
> + * We gather bits in 2x8 iterations.
> + * In first iteration bit 1 of byte 1, bit 2 of byte 2,... bit 8 of byte 8 are
> + * in their final spots so we just and avr with mask. For every next iteration,
> + * we have to shift right both shifted(7 places) and mask(8 places), so we get
> + * bit 1 of byte 2, bit 2 of byte 3.. bit 7 of byte 8 in right places so we and
> + * shifted with new value of mask... After first 8 iteration(first for loop) we
> + * have all first bits in their final place all second bits but second bit from
> + * eight byte in their place,... only 1 eight bit from eight byte is in it's
> + * place), so we and result1 with mask1 to save those bits that are at right
> + * place and save them in result1. In second loop we do all operations
> + * symetrical, so we get other half of bits on their final spots, and save
> + * result in result2. Or of result1 and result2 is placed in appropriate
> + * doubleword element of vD. We repeat this 2 times.
> + */
> +static void trans_vgbbd(DisasContext *ctx)
> +{
> +    int VT = rD(ctx->opcode);
> +    int VB = rB(ctx->opcode);
> +    TCGv_i64 tmp = tcg_temp_new_i64();
> +    TCGv_i64 avr = tcg_temp_new_i64();
> +    TCGv_i64 shifted = tcg_temp_new_i64();
> +    TCGv_i64 result1 = tcg_temp_new_i64();
> +    TCGv_i64 result2 = tcg_temp_new_i64();
> +    uint64_t mask = 0x8040201008040201ULL;
> +    uint64_t mask1 = 0x80c0e0f0f8fcfeffULL;
> +    uint64_t mask2 = 0x7f3f1f0f07030100ULL;
> +    int i;
> +
> +    get_avr64(avr, VB, true);
> +    tcg_gen_movi_i64(result1, 0x0ULL);
> +    tcg_gen_mov_i64(shifted, avr);
> +    for (i = 0; i < 8; i++) {
> +        tcg_gen_andi_i64(tmp, shifted, mask);
> +        tcg_gen_or_i64(result1, result1, tmp);
> +
> +        tcg_gen_shri_i64(shifted, shifted, 7);
> +        mask = mask >> 8;
> +    }
> +    tcg_gen_andi_i64(result1, result1, mask1);

This masking appears to be redundant with the masking within the loop.

> +
> +    mask = 0x8040201008040201ULL;
> +    tcg_gen_movi_i64(result2, 0x0ULL);
> +    for (i = 0; i < 8; i++) {
> +        tcg_gen_andi_i64(tmp, avr, mask);
> +        tcg_gen_or_i64(result2, result2, tmp);
> +
> +        tcg_gen_shli_i64(avr, avr, 7);
> +        mask = mask << 8;
> +    }
> +    tcg_gen_andi_i64(result2, result2, mask2);

Similarly.

Also, the first iteration of the second loop is redundant with the first
iteration of the first loop.

I will also note that these are large constants, not easily constructable.
Therefore it would be best to avoid needing to construct them twice.  You can
do this by processing the two doublewords simultaneously.  e.g.

	TCGv_i64 avr[2], out[2], tmp, tcg_mask;

	identity_mask = 0x8040201008040201ull;
	tcg_gen_movi_i64(tcg_mask, identity_mask);
	for (j = 0; j < 2; j++) {
	    get_avr(avr[j], VB, j);
	    tcg_gen_and_i64(out[j], avr[j], tcg_mask);
	}
	for (i = 1; i < 8; i++) {
	    tcg_gen_movi_i64(tcg_mask, identity_mask >> (i * 8);
	    for (j = 0; j < 2; j++) {
	        tcg_gen_shri_i64(tmp, avr[j], i * 7);
	        tcg_gen_and_i64(tmp, tmp, tcg_mask);
	        tcg_gen_or_i64(out[j], out[j], tmp);
	    }
	}
	for (i = 1; i < 8; i++) {
	    tcg_gen_movi_i64(tcg_mask, identity_mask << (i * 8));
	    for (j = 0; j < 2; j++) {
	        tcg_gen_shli_i64(tmp, avr[j], i * 7);
	        tcg_gen_and_i64(tmp, tmp, tcg_mask);
	        tcg_gen_or_i64(out[j], out[j], tmp);
	    }
	}
	for (j = 0; j < 2; j++) {
	    set_avr(VT, out[j], j);
	}

This should produce the same results with fewer operations.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction Stefan Brankovic
@ 2019-06-06 18:26   ` Richard Henderson
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Henderson @ 2019-06-06 18:26 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> Optimize Altivec instruction vclzd (Vector Count Leading Zeros Doubleword).
> This instruction counts the number of leading zeros of each doubleword element
> in source register and places result in the appropriate doubleword element of
> destination register.
> 
> Using tcg-s count leading zeros instruction two times(once for each
> doubleword element of source register vB) and placing result in
> appropriate doubleword element of destination register vD.
> 
> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
> ---
>  target/ppc/translate/vmx-impl.inc.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)

Once the vclzd helper is removed,

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction Stefan Brankovic
@ 2019-06-06 18:34   ` Richard Henderson
  2019-06-17 11:50     ` Stefan Brankovic
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-06-06 18:34 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> +    for (i = 0; i < 2; i++) {
> +        if (i == 0) {
> +            /* Get high doubleword element of vB in avr. */
> +            get_avr64(avr, VB, true);
> +        } else {
> +            /* Get low doubleword element of vB in avr. */
> +            get_avr64(avr, VB, false);
> +        }

Better as simply get_avr64(avr, VB, i);

> +        /*
> +         * Perform count for every word element using tcg_gen_clzi_i64.
> +         * Since it counts leading zeros on 64 bit lenght, we have to move
> +         * ith word element to highest 32 bits of tmp, or it with mask(so we get
> +         * all ones in lowest 32 bits), then perform tcg_gen_clzi_i64 and move
> +         * it's result in appropriate word element of result.
> +         */
> +        tcg_gen_shli_i64(tmp, avr, 32);
> +        tcg_gen_or_i64(tmp, tmp, mask);
> +        tcg_gen_clzi_i64(result, tmp, 64);
> +
> +        tcg_gen_or_i64(tmp, avr, mask);
> +        tcg_gen_clzi_i64(tmp, tmp, 64);

s/64/32.

> +        tcg_gen_deposit_i64(result, result, tmp, 32, 32);

That said, it's probably better to treat this as 4 words, not 2 doublewords.

	for (i = 0; i < 4; i++) {
	    tcg_gen_ld_i32(tmp, cpu_env, avr_full_offset(VB) + i * 4);
	    tcg_gen_clzi_i32(tmp, tmp, 32);
	    tcg_gen_st_i32(tmp, cpu_env, avr_full_offset(VT) + i * 4);
	}


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
@ 2019-06-06 20:38   ` Richard Henderson
  2019-06-17 11:42     ` Stefan Brankovic
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-06-06 20:38 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
> This instruction counts the number of leading zeros of each halfword element
> in source register and places result in the appropriate halfword element of
> destination register.
For halfword, you're generating 32 operations.  A loop over the halfwords,
similar to the word loop I suggested for the last patch, does not reduce this
total, since one has to adjust the clz32 result.

For byte, you're generating 64 operations.

These expansions are so big that without host vector support it's probably best
to leave them out-of-line.

I can imagine a byte clz expansion like

	t0 = input >> 4;
	t1 = input << 4;
	cmp = input == 0 ? -1 : 0;
	input = cmp ? t1 : input;
	output = cmp & 4;

	t0 = input >> 6;
	t1 = input << 2;
	cmp = input == 0 ? -1 : 0;
	input = cmp ? t1 : input;
	t0 = cmp & 2;
	output += t0;

	t1 = input << 1;
	cmp = input >= 0 ? -1 : 0;
	output -= cmp;

	cmp = input == 0 ? -1 : 0;
	output -= cmp;

which would expand to 20 x86_64 vector instructions.  A halfword expansion
would require one more round and thus 25 instructions.

I'll also note that ARM, Power8, and S390 all support this as a native vector
operation; only x86_64 would require the above expansion.  It probably makes
sense to add this operation to tcg.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions
  2019-06-06 10:15 ` [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
@ 2019-06-06 20:43   ` Richard Henderson
  2019-06-17 11:43     ` Stefan Brankovic
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Henderson @ 2019-06-06 20:43 UTC (permalink / raw)
  To: Stefan Brankovic, qemu-devel; +Cc: david

On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> +/*
> + * We use this macro if one instruction is realized with direct
> + * translation, and second one with helper.
> + */
> +#define GEN_VXFORM_TRANS_DUAL(name0, flg0, flg2_0, name1, flg1, flg2_1)\
> +static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
> +{                                                                      \
> +    if ((Rc(ctx->opcode) == 0) &&                                      \
> +        ((ctx->insns_flags & flg0) || (ctx->insns_flags2 & flg2_0))) { \
> +        trans_##name0(ctx);                                            \
> +    } else if ((Rc(ctx->opcode) == 1) &&                               \
> +        ((ctx->insns_flags & flg1) || (ctx->insns_flags2 & flg2_1))) { \
> +        gen_##name1(ctx);                                              \
> +    } else {                                                           \
> +        gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);            \
> +    }                                                                  \
> +}
> +
>  /* Adds support to provide invalid mask */
>  #define GEN_VXFORM_DUAL_EXT(name0, flg0, flg2_0, inval0,                \
>                              name1, flg1, flg2_1, inval1)                \
> @@ -431,20 +449,13 @@ GEN_VXFORM(vmrglb, 6, 4);
>  GEN_VXFORM(vmrglh, 6, 5);
>  GEN_VXFORM(vmrglw, 6, 6);
>  
> -static void gen_vmrgew(DisasContext *ctx)
> +static void trans_vmrgew(DisasContext *ctx)
>  {
> -    TCGv_i64 tmp;
> -    TCGv_i64 avr;
> -    int VT, VA, VB;
> -    if (unlikely(!ctx->altivec_enabled)) {
> -        gen_exception(ctx, POWERPC_EXCP_VPU);
> -        return;
> -    }

This appears to drop the check for altivec_enabled.


r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl,
  2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
                   ` (8 preceding siblings ...)
  2019-06-06 17:13 ` [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Richard Henderson
@ 2019-06-07  3:51 ` Howard Spoelstra
  9 siblings, 0 replies; 27+ messages in thread
From: Howard Spoelstra @ 2019-06-07  3:51 UTC (permalink / raw)
  To: qemu-devel qemu-devel

Hi,

This series gives me several compilation errors.
When compiled with --disable-werror, OSX 10.3 guest on qemu-system-ppc
shows corrupted desktop graphics.

Compiled with:
./configure --target-list="ppc-softmmu" --enable-sdl --enable-gtk  && make
-j8

gcc is:
[hsp@fedora30 qemu-master]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr
--mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=
http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix
--enable-checking=release --enable-multilib --with-system-zlib
--enable-__cxa_atexit --disable-libunwind-exceptions
--enable-gnu-unique-object --enable-linker-build-id
--with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin
--enable-initfini-array --with-isl --enable-offload-targets=nvptx-none
--without-cuda-driver --enable-gnu-indirect-function --enable-cet
--with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 9.1.1 20190503 (Red Hat 9.1.1-1) (GCC)

Errors are:

/home/hsp/src/qemu-master/tcg/tcg-op.h:837:24: error: initialization of
‘TCGv_i64’ {aka ‘struct TCGv_i64_d *’} from incompatible pointer type
‘TCGv_i32’ {aka ‘struct TCGv_i32_d *’} [-Werror=incompatible-pointer-types]
  837 | #define tcg_temp_new() tcg_temp_new_i32()
      |                        ^~~~~~~~~~~~~~~~
/home/hsp/src/qemu-master/target/ppc/translate/vmx-impl.inc.c:513:19: note:
in expansion of macro ‘tcg_temp_new’
  513 |     TCGv_i64 EA = tcg_temp_new();
      |                   ^~~~~~~~~~~~
In file included from /home/hsp/src/qemu-master/target/ppc/translate.c:6826:
/home/hsp/src/qemu-master/target/ppc/translate/vmx-impl.inc.c:517:29:
error: passing argument 2 of ‘gen_addr_reg_index’ from incompatible pointer
type [-Werror=incompatible-pointer-types]
  517 |     gen_addr_reg_index(ctx, EA);
      |                             ^~
      |                             |
      |                             TCGv_i64 {aka struct TCGv_i64_d *}
/home/hsp/src/qemu-master/target/ppc/translate.c:2398:63: note: expected
‘TCGv_i32’ {aka ‘struct TCGv_i32_d *’} but argument is of type ‘TCGv_i64’
{aka ‘struct TCGv_i64_d *’}
 2398 | static inline void gen_addr_reg_index(DisasContext *ctx, TCGv EA)
In file included from /home/hsp/src/qemu-master/target/ppc/translate.c:6826:
/home/hsp/src/qemu-master/target/ppc/translate/vmx-impl.inc.c:545:19:
error: passing argument 1 of ‘tcg_temp_free_i32’ from incompatible pointer
type [-Werror=incompatible-pointer-types]
  545 |     tcg_temp_free(EA);
      |                   ^~
      |                   |
      |                   TCGv_i64 {aka struct TCGv_i64_d *}
In file included from /home/hsp/src/qemu-master/tcg/tcg-op.h:28,
                 from /home/hsp/src/qemu-master/target/ppc/translate.c:26:
/home/hsp/src/qemu-master/tcg/tcg.h:933:47: note: expected ‘TCGv_i32’ {aka
‘struct TCGv_i32_d *’} but argument is of type ‘TCGv_i64’ {aka ‘struct
TCGv_i64_d *’}
  933 | static inline void tcg_temp_free_i32(TCGv_i32 arg)
      |                                      ~~~~~~~~~^~~
In file included from /home/hsp/src/qemu-master/target/ppc/translate.c:26:
/home/hsp/src/qemu-master/target/ppc/translate/vmx-impl.inc.c: In function
‘trans_lvsr’:
/home/hsp/src/qemu-master/tcg/tcg-op.h:837:24: error: initialization of
‘TCGv_i64’ {aka ‘struct TCGv_i64_d *’} from incompatible pointer type
‘TCGv_i32’ {aka ‘struct TCGv_i32_d *’} [-Werror=incompatible-pointer-types]
  837 | #define tcg_temp_new() tcg_temp_new_i32()
      |                        ^~~~~~~~~~~~~~~~
/home/hsp/src/qemu-master/target/ppc/translate/vmx-impl.inc.c:561:19: note:
in expansion of macro ‘tcg_temp_new’
  561 |     TCGv_i64 EA = tcg_temp_new();
      |                   ^~~~~~~~~~~~
In file included from /home/hsp/src/qemu-master/target/ppc/translate.c:6826:
/home/hsp/src/qemu-master/target/ppc/translate/vmx-impl.inc.c:565:29:
error: passing argument 2 of ‘gen_addr_reg_index’ from incompatible pointer
type [-Werror=incompatible-pointer-types]
  565 |     gen_addr_reg_index(ctx, EA);
      |                             ^~
      |                             |
      |                             TCGv_i64 {aka struct TCGv_i64_d *}
/home/hsp/src/qemu-master/target/ppc/translate.c:2398:63: note: expected
‘TCGv_i32’ {aka ‘struct TCGv_i32_d *’} but argument is of type ‘TCGv_i64’
{aka ‘struct TCGv_i64_d *’}
 2398 | static inline void gen_addr_reg_index(DisasContext *ctx, TCGv EA)
In file included from /home/hsp/src/qemu-master/target/ppc/translate.c:6826:
/home/hsp/src/qemu-master/target/ppc/translate/vmx-impl.inc.c:596:19:
error: passing argument 1 of ‘tcg_temp_free_i32’ from incompatible pointer
type [-Werror=incompatible-pointer-types]
  596 |     tcg_temp_free(EA);
      |                   ^~
      |                   |
      |                   TCGv_i64 {aka struct TCGv_i64_d *}
In file included from /home/hsp/src/qemu-master/tcg/tcg-op.h:28,
                 from /home/hsp/src/qemu-master/target/ppc/translate.c:26:
/home/hsp/src/qemu-master/tcg/tcg.h:933:47: note: expected ‘TCGv_i32’ {aka
‘struct TCGv_i32_d *’} but argument is of type ‘TCGv_i64’ {aka ‘struct
TCGv_i64_d *’}
  933 | static inline void tcg_temp_free_i32(TCGv_i32 arg)
      |                                      ~~~~~~~~~^~~


Best,
Howard

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel]  ?==?utf-8?q? ?==?utf-8?q? [PATCH 0/8] Optimize emulation of ten Altivec instructions:?==?utf-8?q? lvsl,
  2019-06-06 17:13 ` [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Richard Henderson
@ 2019-06-12  7:31   ` Stefan Brankovic
  2019-06-17 11:32   ` [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
  1 sibling, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-12  7:31 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, david


>
>
> -------- Original Message --------
> Subject: Re: [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl,
> Date: Thursday, June 6, 2019 19:13 CEST
> From: Richard Henderson <richard.henderson@linaro.org>
> To: Stefan Brankovic <stefan.brankovic@rt-rk.com>, qemu-devel@nongnu.org
> CC: david@gibson.dropbear.id.au
> References: <1559816130-17113-1-git-send-email-stefan.brankovic@rt-rk.com>
>
>
>
> > On 6/6/19 5:15 AM, Stefan Brankovic wrote:
> > > Stefan Brankovic (8):
> > > target/ppc: Optimize emulation of lvsl and lvsr instructions
> > > target/ppc: Optimize emulation of vsl and vsr instructions
> > > target/ppc: Optimize emulation of vpkpx instruction
> > > target/ppc: Optimize emulation of vgbbd instruction
> > > target/ppc: Optimize emulation of vclzd instruction
> > > target/ppc: Optimize emulation of vclzw instruction
> > > target/ppc: Optimize emulation of vclzh and vclzb instructions
> > > target/ppc: Refactor emulation of vmrgew and vmrgow instructions
> > >
> > > target/ppc/translate/vmx-impl.inc.c | 705 ++++++++++++++++++++++++++++++++----
> > > 1 file changed, 636 insertions(+), 69 deletions(-)
> >
> > You should be removing the out-of-line helpers that are no longer used.
> >

Thank you for taking your time to review my code. I think that your suggestions
are all constructive and very useful. However, I was on a short leave this
week and I couldn't respond promptly. I will respond with more details in next
few days.

Kind Regards,
Stefan

> >
> > r~
>
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions
  2019-06-06 16:46   ` Richard Henderson
@ 2019-06-17 11:31     ` Stefan Brankovic
  0 siblings, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-17 11:31 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david


On 6.6.19. 18:46, Richard Henderson wrote:
> On 6/6/19 5:15 AM, Stefan Brankovic wrote:
>> +    tcg_gen_addi_i64(result, sh, 7);
>> +    for (i = 7; i >= 1; i--) {
>> +        tcg_gen_shli_i64(tmp, sh, i * 8);
>> +        tcg_gen_or_i64(result, result, tmp);
>> +        tcg_gen_addi_i64(sh, sh, 1);
>> +    }
> Better to replicate sh into the 8 positions and then use one add.
>
>      tcg_gen_muli_i64(sh, sh, 0x0101010101010101ull);
>      tcg_gen_addi_i64(hi_result, sh, 0x0001020304050607ull);
>      tcg_gen_addi_i64(lo_result, sh, 0x08090a0b0c0d0e0full);
>
> and
>
>      tcg_gen_subfi_i64(hi_result, 0x1011121314151617ull, sh);
>      tcg_gen_subfi_i64(lo_result, 0x18191a1b1c1d1e1full, sh);
>
> for lvsr.
>
I think you are right, this is definitely better way of implementing it. 
I will adopt your approach in v2.

Kind Regards,

Stefan

> r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl,
  2019-06-06 17:13 ` [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Richard Henderson
  2019-06-12  7:31   ` [Qemu-devel] ?==?utf-8?q? ?==?utf-8?q? [PATCH 0/8] Optimize emulation of ten Altivec instructions:?==?utf-8?q? lvsl, Stefan Brankovic
@ 2019-06-17 11:32   ` Stefan Brankovic
  1 sibling, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-17 11:32 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david


On 6.6.19. 19:13, Richard Henderson wrote:
> On 6/6/19 5:15 AM, Stefan Brankovic wrote:
>> Stefan Brankovic (8):
>>    target/ppc: Optimize emulation of lvsl and lvsr instructions
>>    target/ppc: Optimize emulation of vsl and vsr instructions
>>    target/ppc: Optimize emulation of vpkpx instruction
>>    target/ppc: Optimize emulation of vgbbd instruction
>>    target/ppc: Optimize emulation of vclzd instruction
>>    target/ppc: Optimize emulation of vclzw instruction
>>    target/ppc: Optimize emulation of vclzh and vclzb instructions
>>    target/ppc: Refactor emulation of vmrgew and vmrgow instructions
>>
>>   target/ppc/translate/vmx-impl.inc.c | 705 ++++++++++++++++++++++++++++++++----
>>   1 file changed, 636 insertions(+), 69 deletions(-)
> You should be removing the out-of-line helpers that are no longer used.
>
I agree. I will remove them in v2.

Kind Regards,

Stefan

> r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions
  2019-06-06 17:03   ` Richard Henderson
@ 2019-06-17 11:36     ` Stefan Brankovic
  0 siblings, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-17 11:36 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david


On 6.6.19. 19:03, Richard Henderson wrote:
> On 6/6/19 5:15 AM, Stefan Brankovic wrote:
>> +    tcg_gen_subi_i64(tmp, sh, 64);
>> +    tcg_gen_neg_i64(tmp, tmp);
> Better as
>
>      tcg_gen_subfi_i64(tmp, 64, sh);
>
I was aware there must be way of doing it in a single tcg invocation, 
but couldn't find right tcg instruction. I will apply this in v2.

Kind Regards,

Stefan

> r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions
  2019-06-06 20:38   ` Richard Henderson
@ 2019-06-17 11:42     ` Stefan Brankovic
  0 siblings, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-17 11:42 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david


On 6.6.19. 22:38, Richard Henderson wrote:
> On 6/6/19 5:15 AM, Stefan Brankovic wrote:
>> Optimize Altivec instruction vclzh (Vector Count Leading Zeros Halfword).
>> This instruction counts the number of leading zeros of each halfword element
>> in source register and places result in the appropriate halfword element of
>> destination register.
> For halfword, you're generating 32 operations.  A loop over the halfwords,
> similar to the word loop I suggested for the last patch, does not reduce this
> total, since one has to adjust the clz32 result.
>
> For byte, you're generating 64 operations.
>
> These expansions are so big that without host vector support it's probably best
> to leave them out-of-line.
>
> I can imagine a byte clz expansion like
>
> 	t0 = input >> 4;
> 	t1 = input << 4;
> 	cmp = input == 0 ? -1 : 0;
> 	input = cmp ? t1 : input;
> 	output = cmp & 4;
>
> 	t0 = input >> 6;
> 	t1 = input << 2;
> 	cmp = input == 0 ? -1 : 0;
> 	input = cmp ? t1 : input;
> 	t0 = cmp & 2;
> 	output += t0;
>
> 	t1 = input << 1;
> 	cmp = input >= 0 ? -1 : 0;
> 	output -= cmp;
>
> 	cmp = input == 0 ? -1 : 0;
> 	output -= cmp;
>
> which would expand to 20 x86_64 vector instructions.  A halfword expansion
> would require one more round and thus 25 instructions.

I based this patch on performance results and my measurements say that 
tcg implementation is still significantly superior to helper 
implementation, regardless of somewhat large number of instructions.

I can attach both performance measurements results and disassembly of 
both helper and tcg implementations, if you want me to do this.

>
> I'll also note that ARM, Power8, and S390 all support this as a native vector
> operation; only x86_64 would require the above expansion.  It probably makes
> sense to add this operation to tcg.

I agree with this, but currently we don't have this implemented in tcg, 
so I worked with what I have.

Kind Regards,

Stefan

> r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions
  2019-06-06 20:43   ` Richard Henderson
@ 2019-06-17 11:43     ` Stefan Brankovic
  0 siblings, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-17 11:43 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david


On 6.6.19. 22:43, Richard Henderson wrote:
> On 6/6/19 5:15 AM, Stefan Brankovic wrote:
>> +/*
>> + * We use this macro if one instruction is realized with direct
>> + * translation, and second one with helper.
>> + */
>> +#define GEN_VXFORM_TRANS_DUAL(name0, flg0, flg2_0, name1, flg1, flg2_1)\
>> +static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
>> +{                                                                      \
>> +    if ((Rc(ctx->opcode) == 0) &&                                      \
>> +        ((ctx->insns_flags & flg0) || (ctx->insns_flags2 & flg2_0))) { \
>> +        trans_##name0(ctx);                                            \
>> +    } else if ((Rc(ctx->opcode) == 1) &&                               \
>> +        ((ctx->insns_flags & flg1) || (ctx->insns_flags2 & flg2_1))) { \
>> +        gen_##name1(ctx);                                              \
>> +    } else {                                                           \
>> +        gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);            \
>> +    }                                                                  \
>> +}
>> +
>>   /* Adds support to provide invalid mask */
>>   #define GEN_VXFORM_DUAL_EXT(name0, flg0, flg2_0, inval0,                \
>>                               name1, flg1, flg2_1, inval1)                \
>> @@ -431,20 +449,13 @@ GEN_VXFORM(vmrglb, 6, 4);
>>   GEN_VXFORM(vmrglh, 6, 5);
>>   GEN_VXFORM(vmrglw, 6, 6);
>>   
>> -static void gen_vmrgew(DisasContext *ctx)
>> +static void trans_vmrgew(DisasContext *ctx)
>>   {
>> -    TCGv_i64 tmp;
>> -    TCGv_i64 avr;
>> -    int VT, VA, VB;
>> -    if (unlikely(!ctx->altivec_enabled)) {
>> -        gen_exception(ctx, POWERPC_EXCP_VPU);
>> -        return;
>> -    }
> This appears to drop the check for altivec_enabled.
>
Thank you for spotting this, I will fix this bug in v2.

Kind Regards,

Stefan

> r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction
  2019-06-06 18:34   ` Richard Henderson
@ 2019-06-17 11:50     ` Stefan Brankovic
  0 siblings, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-17 11:50 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david


On 6.6.19. 20:34, Richard Henderson wrote:
> On 6/6/19 5:15 AM, Stefan Brankovic wrote:
>> +    for (i = 0; i < 2; i++) {
>> +        if (i == 0) {
>> +            /* Get high doubleword element of vB in avr. */
>> +            get_avr64(avr, VB, true);
>> +        } else {
>> +            /* Get low doubleword element of vB in avr. */
>> +            get_avr64(avr, VB, false);
>> +        }
> Better as simply get_avr64(avr, VB, i);
Definitely shorter way to do this.
>
>> +        /*
>> +         * Perform count for every word element using tcg_gen_clzi_i64.
>> +         * Since it counts leading zeros on 64 bit lenght, we have to move
>> +         * ith word element to highest 32 bits of tmp, or it with mask(so we get
>> +         * all ones in lowest 32 bits), then perform tcg_gen_clzi_i64 and move
>> +         * it's result in appropriate word element of result.
>> +         */
>> +        tcg_gen_shli_i64(tmp, avr, 32);
>> +        tcg_gen_or_i64(tmp, tmp, mask);
>> +        tcg_gen_clzi_i64(result, tmp, 64);
>> +
>> +        tcg_gen_or_i64(tmp, avr, mask);
>> +        tcg_gen_clzi_i64(tmp, tmp, 64);
> s/64/32.
>
>> +        tcg_gen_deposit_i64(result, result, tmp, 32, 32);
> That said, it's probably better to treat this as 4 words, not 2 doublewords.
>
> 	for (i = 0; i < 4; i++) {
> 	    tcg_gen_ld_i32(tmp, cpu_env, avr_full_offset(VB) + i * 4);
> 	    tcg_gen_clzi_i32(tmp, tmp, 32);
> 	    tcg_gen_st_i32(tmp, cpu_env, avr_full_offset(VT) + i * 4);
> 	}
>
I will use this way in v2.

Kind Regards,

Stefan

> r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction
  2019-06-06 18:19   ` Richard Henderson
@ 2019-06-17 11:58     ` Stefan Brankovic
  0 siblings, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-17 11:58 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: david


On 6.6.19. 20:19, Richard Henderson wrote:
> On 6/6/19 5:15 AM, Stefan Brankovic wrote:
>> Optimize altivec instruction vgbbd (Vector Gather Bits by Bytes by Doubleword)
>> All ith bits (i in range 1 to 8) of each byte of doubleword element in
>> source register are concatenated and placed into ith byte of appropriate
>> doubleword element in destination register.
>>
>> Following solution is done for every doubleword element of source register
>> (placed in shifted variable):
>> We gather bits in 2x8 iterations.
>> In first iteration bit 1 of byte 1, bit 2 of byte 2,... bit 8 of byte 8 are
>> in their final spots so we just and avr with mask. For every next iteration,
>> we have to shift right both shifted(7 places) and mask(8 places), so we get
>> bit 1 of byte 2, bit 2 of byte 3.. bit 7 of byte 8 in right places so we and
>> shifted with new value of mask... After first 8 iteration(first for loop) we
>> have all first bits in their final place all second bits but second bit from
>> eight byte in their place,... only 1 eight bit from eight byte is in it's
>> place), so we and result1 with mask1 to save those bits that are at right
>> place and save them in result1. In second loop we do all operations
>> symetrical, so we get other half of bits on their final spots, and save
>> result in result2. Or of result1 and result2 is placed in appropriate
>> doubleword element of vD. We repeat this 2 times.
>>
>> Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
>> ---
>>   target/ppc/translate/vmx-impl.inc.c | 99 ++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 98 insertions(+), 1 deletion(-)
>>
>> diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
>> index 87f69dc..010f337 100644
>> --- a/target/ppc/translate/vmx-impl.inc.c
>> +++ b/target/ppc/translate/vmx-impl.inc.c
>> @@ -780,6 +780,103 @@ static void trans_vsr(DisasContext *ctx)
>>       tcg_temp_free_i64(tmp);
>>   }
>>   
>> +/*
>> + * vgbbd VRT,VRB - Vector Gather Bits by Bytes by Doubleword
>> + *
>> + * All ith bits (i in range 1 to 8) of each byte of doubleword element in source
>> + * register are concatenated and placed into ith byte of appropriate doubleword
>> + * element in destination register.
>> + *
>> + * Following solution is done for every doubleword element of source register
>> + * (placed in shifted variable):
>> + * We gather bits in 2x8 iterations.
>> + * In first iteration bit 1 of byte 1, bit 2 of byte 2,... bit 8 of byte 8 are
>> + * in their final spots so we just and avr with mask. For every next iteration,
>> + * we have to shift right both shifted(7 places) and mask(8 places), so we get
>> + * bit 1 of byte 2, bit 2 of byte 3.. bit 7 of byte 8 in right places so we and
>> + * shifted with new value of mask... After first 8 iteration(first for loop) we
>> + * have all first bits in their final place all second bits but second bit from
>> + * eight byte in their place,... only 1 eight bit from eight byte is in it's
>> + * place), so we and result1 with mask1 to save those bits that are at right
>> + * place and save them in result1. In second loop we do all operations
>> + * symetrical, so we get other half of bits on their final spots, and save
>> + * result in result2. Or of result1 and result2 is placed in appropriate
>> + * doubleword element of vD. We repeat this 2 times.
>> + */
>> +static void trans_vgbbd(DisasContext *ctx)
>> +{
>> +    int VT = rD(ctx->opcode);
>> +    int VB = rB(ctx->opcode);
>> +    TCGv_i64 tmp = tcg_temp_new_i64();
>> +    TCGv_i64 avr = tcg_temp_new_i64();
>> +    TCGv_i64 shifted = tcg_temp_new_i64();
>> +    TCGv_i64 result1 = tcg_temp_new_i64();
>> +    TCGv_i64 result2 = tcg_temp_new_i64();
>> +    uint64_t mask = 0x8040201008040201ULL;
>> +    uint64_t mask1 = 0x80c0e0f0f8fcfeffULL;
>> +    uint64_t mask2 = 0x7f3f1f0f07030100ULL;
>> +    int i;
>> +
>> +    get_avr64(avr, VB, true);
>> +    tcg_gen_movi_i64(result1, 0x0ULL);
>> +    tcg_gen_mov_i64(shifted, avr);
>> +    for (i = 0; i < 8; i++) {
>> +        tcg_gen_andi_i64(tmp, shifted, mask);
>> +        tcg_gen_or_i64(result1, result1, tmp);
>> +
>> +        tcg_gen_shri_i64(shifted, shifted, 7);
>> +        mask = mask >> 8;
>> +    }
>> +    tcg_gen_andi_i64(result1, result1, mask1);
> This masking appears to be redundant with the masking within the loop.
>
>> +
>> +    mask = 0x8040201008040201ULL;
>> +    tcg_gen_movi_i64(result2, 0x0ULL);
>> +    for (i = 0; i < 8; i++) {
>> +        tcg_gen_andi_i64(tmp, avr, mask);
>> +        tcg_gen_or_i64(result2, result2, tmp);
>> +
>> +        tcg_gen_shli_i64(avr, avr, 7);
>> +        mask = mask << 8;
>> +    }
>> +    tcg_gen_andi_i64(result2, result2, mask2);
> Similarly.
>
> Also, the first iteration of the second loop is redundant with the first
> iteration of the first loop.
>
> I will also note that these are large constants, not easily constructable.
> Therefore it would be best to avoid needing to construct them twice.  You can
> do this by processing the two doublewords simultaneously.  e.g.
>
> 	TCGv_i64 avr[2], out[2], tmp, tcg_mask;
>
> 	identity_mask = 0x8040201008040201ull;
> 	tcg_gen_movi_i64(tcg_mask, identity_mask);
> 	for (j = 0; j < 2; j++) {
> 	    get_avr(avr[j], VB, j);
> 	    tcg_gen_and_i64(out[j], avr[j], tcg_mask);
> 	}
> 	for (i = 1; i < 8; i++) {
> 	    tcg_gen_movi_i64(tcg_mask, identity_mask >> (i * 8);
> 	    for (j = 0; j < 2; j++) {
> 	        tcg_gen_shri_i64(tmp, avr[j], i * 7);
> 	        tcg_gen_and_i64(tmp, tmp, tcg_mask);
> 	        tcg_gen_or_i64(out[j], out[j], tmp);
> 	    }
> 	}
> 	for (i = 1; i < 8; i++) {
> 	    tcg_gen_movi_i64(tcg_mask, identity_mask << (i * 8));
> 	    for (j = 0; j < 2; j++) {
> 	        tcg_gen_shli_i64(tmp, avr[j], i * 7);
> 	        tcg_gen_and_i64(tmp, tmp, tcg_mask);
> 	        tcg_gen_or_i64(out[j], out[j], tmp);
> 	    }
> 	}
> 	for (j = 0; j < 2; j++) {
> 	    set_avr(VT, out[j], j);
> 	}
>
> This should produce the same results with fewer operations.

I agree with you, this should produce the same result with less 
instructions. I will implement this in v2.

Kind Regards,

Stefan

>
> r~


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions
  2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
@ 2019-06-19 11:03 ` Stefan Brankovic
  0 siblings, 0 replies; 27+ messages in thread
From: Stefan Brankovic @ 2019-06-19 11:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: david

Since I found this two instructions implemented with tcg, I refactored
them so they are consistent with other similar implementations that
I introduced in this patch.

Also, a new dual macro GEN_VXFORM_TRANS_DUAL is added. This macro is
used if one instruction is realized with direct translation, and second
one with a helper.

Signed-off-by: Stefan Brankovic <stefan.brankovic@rt-rk.com>
---
 target/ppc/translate/vmx-impl.inc.c | 66 +++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 29 deletions(-)

diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 81569a8..f052dcb 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -350,6 +350,28 @@ static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
     }                                                                  \
 }
 
+/*
+ * We use this macro if one instruction is realized with direct
+ * translation, and second one with helper.
+ */
+#define GEN_VXFORM_TRANS_DUAL(name0, flg0, flg2_0, name1, flg1, flg2_1)\
+static void glue(gen_, name0##_##name1)(DisasContext *ctx)             \
+{                                                                      \
+    if ((Rc(ctx->opcode) == 0) &&                                      \
+        ((ctx->insns_flags & flg0) || (ctx->insns_flags2 & flg2_0))) { \
+        if (unlikely(!ctx->altivec_enabled)) {                         \
+            gen_exception(ctx, POWERPC_EXCP_VPU);                      \
+            return;                                                    \
+        }                                                              \
+        trans_##name0(ctx);                                            \
+    } else if ((Rc(ctx->opcode) == 1) &&                               \
+        ((ctx->insns_flags & flg1) || (ctx->insns_flags2 & flg2_1))) { \
+        gen_##name1(ctx);                                              \
+    } else {                                                           \
+        gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);            \
+    }                                                                  \
+}
+
 /* Adds support to provide invalid mask */
 #define GEN_VXFORM_DUAL_EXT(name0, flg0, flg2_0, inval0,                \
                             name1, flg1, flg2_1, inval1)                \
@@ -431,20 +453,13 @@ GEN_VXFORM(vmrglb, 6, 4);
 GEN_VXFORM(vmrglh, 6, 5);
 GEN_VXFORM(vmrglw, 6, 6);
 
-static void gen_vmrgew(DisasContext *ctx)
+static void trans_vmrgew(DisasContext *ctx)
 {
-    TCGv_i64 tmp;
-    TCGv_i64 avr;
-    int VT, VA, VB;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    VT = rD(ctx->opcode);
-    VA = rA(ctx->opcode);
-    VB = rB(ctx->opcode);
-    tmp = tcg_temp_new_i64();
-    avr = tcg_temp_new_i64();
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
 
     get_avr64(avr, VB, true);
     tcg_gen_shri_i64(tmp, avr, 32);
@@ -462,21 +477,14 @@ static void gen_vmrgew(DisasContext *ctx)
     tcg_temp_free_i64(avr);
 }
 
-static void gen_vmrgow(DisasContext *ctx)
+static void trans_vmrgow(DisasContext *ctx)
 {
-    TCGv_i64 t0, t1;
-    TCGv_i64 avr;
-    int VT, VA, VB;
-    if (unlikely(!ctx->altivec_enabled)) {
-        gen_exception(ctx, POWERPC_EXCP_VPU);
-        return;
-    }
-    VT = rD(ctx->opcode);
-    VA = rA(ctx->opcode);
-    VB = rB(ctx->opcode);
-    t0 = tcg_temp_new_i64();
-    t1 = tcg_temp_new_i64();
-    avr = tcg_temp_new_i64();
+    int VT = rD(ctx->opcode);
+    int VA = rA(ctx->opcode);
+    int VB = rB(ctx->opcode);
+    TCGv_i64 t0 = tcg_temp_new_i64();
+    TCGv_i64 t1 = tcg_temp_new_i64();
+    TCGv_i64 avr = tcg_temp_new_i64();
 
     get_avr64(t0, VB, true);
     get_avr64(t1, VA, true);
@@ -1142,14 +1150,14 @@ GEN_VXFORM_ENV(vminfp, 5, 17);
 GEN_VXFORM_HETRO(vextublx, 6, 24)
 GEN_VXFORM_HETRO(vextuhlx, 6, 25)
 GEN_VXFORM_HETRO(vextuwlx, 6, 26)
-GEN_VXFORM_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
+GEN_VXFORM_TRANS_DUAL(vmrgow, PPC_NONE, PPC2_ALTIVEC_207,
                 vextuwlx, PPC_NONE, PPC2_ISA300)
 GEN_VXFORM_HETRO(vextubrx, 6, 28)
 GEN_VXFORM_HETRO(vextuhrx, 6, 29)
 GEN_VXFORM_HETRO(vextuwrx, 6, 30)
 GEN_VXFORM_TRANS(lvsl, 6, 31)
 GEN_VXFORM_TRANS(lvsr, 6, 32)
-GEN_VXFORM_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207, \
+GEN_VXFORM_TRANS_DUAL(vmrgew, PPC_NONE, PPC2_ALTIVEC_207,
                 vextuwrx, PPC_NONE, PPC2_ISA300)
 
 #define GEN_VXRFORM1(opname, name, str, opc2, opc3)                     \
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-06-19 11:07 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-06 10:15 [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
2019-06-06 10:15 ` [Qemu-devel] [PATCH 1/8] target/ppc: Optimize emulation of lvsl and lvsr instructions Stefan Brankovic
2019-06-06 16:46   ` Richard Henderson
2019-06-17 11:31     ` Stefan Brankovic
2019-06-06 10:15 ` [Qemu-devel] [PATCH 2/8] target/ppc: Optimize emulation of vsl and vsr instructions Stefan Brankovic
2019-06-06 17:03   ` Richard Henderson
2019-06-17 11:36     ` Stefan Brankovic
2019-06-06 10:15 ` [Qemu-devel] [PATCH 3/8] target/ppc: Optimize emulation of vpkpx instruction Stefan Brankovic
2019-06-06 10:15 ` [Qemu-devel] [PATCH 4/8] target/ppc: Optimize emulation of vgbbd instruction Stefan Brankovic
2019-06-06 18:19   ` Richard Henderson
2019-06-17 11:58     ` Stefan Brankovic
2019-06-06 10:15 ` [Qemu-devel] [PATCH 5/8] target/ppc: Optimize emulation of vclzd instruction Stefan Brankovic
2019-06-06 18:26   ` Richard Henderson
2019-06-06 10:15 ` [Qemu-devel] [PATCH 6/8] target/ppc: Optimize emulation of vclzw instruction Stefan Brankovic
2019-06-06 18:34   ` Richard Henderson
2019-06-17 11:50     ` Stefan Brankovic
2019-06-06 10:15 ` [Qemu-devel] [PATCH 7/8] target/ppc: Optimize emulation of vclzh and vclzb instructions Stefan Brankovic
2019-06-06 20:38   ` Richard Henderson
2019-06-17 11:42     ` Stefan Brankovic
2019-06-06 10:15 ` [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic
2019-06-06 20:43   ` Richard Henderson
2019-06-17 11:43     ` Stefan Brankovic
2019-06-06 17:13 ` [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Richard Henderson
2019-06-12  7:31   ` [Qemu-devel] ?==?utf-8?q? ?==?utf-8?q? [PATCH 0/8] Optimize emulation of ten Altivec instructions:?==?utf-8?q? lvsl, Stefan Brankovic
2019-06-17 11:32   ` [Qemu-devel] [PATCH 0/8] Optimize emulation of ten Altivec instructions: lvsl, Stefan Brankovic
2019-06-07  3:51 ` Howard Spoelstra
2019-06-19 11:03 [Qemu-devel] [PATCH 0/8] target/ppc: Optimize emulation of some Altivec instructions Stefan Brankovic
2019-06-19 11:03 ` [Qemu-devel] [PATCH 8/8] target/ppc: Refactor emulation of vmrgew and vmrgow instructions Stefan Brankovic

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.