All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4
@ 2016-09-28  5:31 Nikunj A Dadhania
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 1/9] target-ppc: Implement mfvsrld instruction Nikunj A Dadhania
                   ` (10 more replies)
  0 siblings, 11 replies; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh

This series contains 7 new instructions for POWER9 ISA3.0
Use newer qemu load/store tcg helpers and optimize stxvw4x and lxvw4x.

GCC was adding epilogue for every VSX instructions causing change in 
behaviour. For testing the load vector instructions used mfvsrld/mfvsrd 
for loading vsr to register. And for testing store vector, used mtvsrdd 
instructions. This helped in getting rid of the epilogue added by gcc. Tried 
adding the test cases to kvm-unit-tests, but executing vsx instructions 
results in cpu exception. Will debug that later. I will send the test code 
and steps to execute as reply to this email.

Patches:
    01:  mfvsrld: Move From VSR Lower Doubleword
    02:  mtvsrdd: Move To VSR Double Doubleword
    03:  mtvsrws: Move To VSR Word & Splat
    05:  lxvw4x: improve implementation
    05:  stxv4x: improve implementation
    06:  lxvh8x: Load VSX Vector Halfword*8
    07:  stxvh8x: Store VSX Vector Halfword*8
    08:  lxvb16x: Load VSX Vector Byte*16
    09:  stxvb16x: Store VSX Vector Byte*16

Changelog:
v3:
* Added 3 new VSR instructions.
* Fixed all the vector load/store instructions for BE/LE.
* Added detailed commit messages to patches.
* Dropped deposit32x2 and implemented it using tcg ops

v2: 
* Fix lxvw4x/stxv4x translation as LE/BE were both similar 
  one in tcg and other as helper
* Rename bswap32x2 to deposit32x2 as it does not need to 
  swap content(32bit)
* stxvh8x had a bug as David suggested.

v1: 
* More load/store cleanups in byte reverse routines
* ld64/st64 converted to newer macro and updated call sites
* Cleanup load with reservation and store conditional
* Return invalid random for darn instruction

v0:
* darn - read /dev/random to get the random number
* xxspltib - make is PPC64 only
* Consolidate load/store operations and use macros to generate qemu_st/ld
* Simplify load/store vsx endian manipulation

Nikunj A Dadhania (6):
  target-ppc: improve lxvw4x implementation
  target-ppc: improve stxvw4x implementation
  target-ppc: add lxvh8x instruction
  target-ppc: add stxvh8x instruction
  target-ppc: add lxvb16x instruction
  target-ppc: add stxvb16x instruction

Ravi Bangoria (3):
  target-ppc: Implement mfvsrld instruction
  target-ppc: Implement mtvsrdd instruction
  target-ppc: Implement mtvsrws instruction

 target-ppc/helper.h                 |   1 +
 target-ppc/mem_helper.c             |   6 +
 target-ppc/translate/vsx-impl.inc.c | 214 ++++++++++++++++++++++++++++++++----
 target-ppc/translate/vsx-ops.inc.c  |   7 ++
 4 files changed, 204 insertions(+), 24 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v4 1/9] target-ppc: Implement mfvsrld instruction
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
@ 2016-09-28  5:31 ` Nikunj A Dadhania
  2016-09-28 16:03   ` Richard Henderson
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction Nikunj A Dadhania
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh, Ravi Bangoria

From: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>

mfvsrld: Move From VSR Lower Doubleword

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/translate/vsx-impl.inc.c | 17 +++++++++++++++++
 target-ppc/translate/vsx-ops.inc.c  |  1 +
 2 files changed, 18 insertions(+)

diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
index eee6052..b669e8c 100644
--- a/target-ppc/translate/vsx-impl.inc.c
+++ b/target-ppc/translate/vsx-impl.inc.c
@@ -217,6 +217,23 @@ static void gen_##name(DisasContext *ctx)                       \
 MV_VSRD(mfvsrd, cpu_gpr[rA(ctx->opcode)], cpu_vsrh(xS(ctx->opcode)))
 MV_VSRD(mtvsrd, cpu_vsrh(xT(ctx->opcode)), cpu_gpr[rA(ctx->opcode)])
 
+static void gen_mfvsrld(DisasContext *ctx)
+{
+    if (xS(ctx->opcode) < 32) {
+        if (unlikely(!ctx->vsx_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VSXU);
+            return;
+        }
+    } else {
+        if (unlikely(!ctx->altivec_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VPU);
+            return;
+        }
+    }
+
+    tcg_gen_mov_i64(cpu_gpr[rA(ctx->opcode)], cpu_vsrl(xS(ctx->opcode)));
+}
+
 #endif
 
 static void gen_xxpermdi(DisasContext *ctx)
diff --git a/target-ppc/translate/vsx-ops.inc.c b/target-ppc/translate/vsx-ops.inc.c
index 414b73b..3b296f8 100644
--- a/target-ppc/translate/vsx-ops.inc.c
+++ b/target-ppc/translate/vsx-ops.inc.c
@@ -22,6 +22,7 @@ GEN_HANDLER_E(mtvsrwz, 0x1F, 0x13, 0x07, 0x0000F800, PPC_NONE, PPC2_VSX207),
 #if defined(TARGET_PPC64)
 GEN_HANDLER_E(mfvsrd, 0x1F, 0x13, 0x01, 0x0000F800, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(mtvsrd, 0x1F, 0x13, 0x05, 0x0000F800, PPC_NONE, PPC2_VSX207),
+GEN_HANDLER_E(mfvsrld, 0X1F, 0x13, 0x09, 0x0000F800, PPC_NONE, PPC2_ISA300),
 #endif
 
 #define GEN_XX1FORM(name, opc2, opc3, fl2)                              \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 1/9] target-ppc: Implement mfvsrld instruction Nikunj A Dadhania
@ 2016-09-28  5:31 ` Nikunj A Dadhania
  2016-09-28 16:01   ` Richard Henderson
  2016-09-29  1:29   ` David Gibson
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 3/9] target-ppc: Implement mtvsrws instruction Nikunj A Dadhania
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh, Ravi Bangoria

From: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>

mtvsrdd: Move To VSR Double Doubleword

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/translate/vsx-impl.inc.c | 23 +++++++++++++++++++++++
 target-ppc/translate/vsx-ops.inc.c  |  1 +
 2 files changed, 24 insertions(+)

diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
index b669e8c..f9db1d4 100644
--- a/target-ppc/translate/vsx-impl.inc.c
+++ b/target-ppc/translate/vsx-impl.inc.c
@@ -234,6 +234,29 @@ static void gen_mfvsrld(DisasContext *ctx)
     tcg_gen_mov_i64(cpu_gpr[rA(ctx->opcode)], cpu_vsrl(xS(ctx->opcode)));
 }
 
+static void gen_mtvsrdd(DisasContext *ctx)
+{
+    if (xT(ctx->opcode) < 32) {
+        if (unlikely(!ctx->vsx_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VSXU);
+            return;
+        }
+    } else {
+        if (unlikely(!ctx->altivec_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VPU);
+            return;
+        }
+    }
+
+    if (!rA(ctx->opcode)) {
+        tcg_gen_movi_i64(cpu_vsrh(xT(ctx->opcode)), 0);
+    } else {
+       tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_gpr[rA(ctx->opcode)]);
+    }
+
+    tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), cpu_gpr[rB(ctx->opcode)]);
+}
+
 #endif
 
 static void gen_xxpermdi(DisasContext *ctx)
diff --git a/target-ppc/translate/vsx-ops.inc.c b/target-ppc/translate/vsx-ops.inc.c
index 3b296f8..1287973 100644
--- a/target-ppc/translate/vsx-ops.inc.c
+++ b/target-ppc/translate/vsx-ops.inc.c
@@ -23,6 +23,7 @@ GEN_HANDLER_E(mtvsrwz, 0x1F, 0x13, 0x07, 0x0000F800, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(mfvsrd, 0x1F, 0x13, 0x01, 0x0000F800, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(mtvsrd, 0x1F, 0x13, 0x05, 0x0000F800, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(mfvsrld, 0X1F, 0x13, 0x09, 0x0000F800, PPC_NONE, PPC2_ISA300),
+GEN_HANDLER_E(mtvsrdd, 0X1F, 0x13, 0x0D, 0x0, PPC_NONE, PPC2_ISA300),
 #endif
 
 #define GEN_XX1FORM(name, opc2, opc3, fl2)                              \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v4 3/9] target-ppc: Implement mtvsrws instruction
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 1/9] target-ppc: Implement mfvsrld instruction Nikunj A Dadhania
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction Nikunj A Dadhania
@ 2016-09-28  5:31 ` Nikunj A Dadhania
  2016-09-28 16:04   ` Richard Henderson
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation Nikunj A Dadhania
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh, Ravi Bangoria

From: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>

mtvsrws: Move To VSR Word & Splat

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/translate/vsx-impl.inc.c | 23 +++++++++++++++++++++++
 target-ppc/translate/vsx-ops.inc.c  |  1 +
 2 files changed, 24 insertions(+)

diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
index f9db1d4..74d0533 100644
--- a/target-ppc/translate/vsx-impl.inc.c
+++ b/target-ppc/translate/vsx-impl.inc.c
@@ -257,6 +257,29 @@ static void gen_mtvsrdd(DisasContext *ctx)
     tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), cpu_gpr[rB(ctx->opcode)]);
 }
 
+static void gen_mtvsrws(DisasContext *ctx)
+{
+    TCGv_i64 tmp1 = tcg_temp_new_i64();
+
+    if (xT(ctx->opcode) < 32) {
+        if (unlikely(!ctx->vsx_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VSXU);
+            return;
+        }
+    } else {
+        if (unlikely(!ctx->altivec_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VPU);
+            return;
+        }
+    }
+
+    tcg_gen_andi_i64(tmp1, cpu_gpr[rA(ctx->opcode)], 0xFFFFFFFF);
+    tcg_gen_deposit_i64(cpu_vsrl(xT(ctx->opcode)), tmp1, tmp1, 32, 32);
+    tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_vsrl(xT(ctx->opcode)));
+
+    tcg_temp_free_i64(tmp1);
+}
+
 #endif
 
 static void gen_xxpermdi(DisasContext *ctx)
diff --git a/target-ppc/translate/vsx-ops.inc.c b/target-ppc/translate/vsx-ops.inc.c
index 1287973..d5f5b87 100644
--- a/target-ppc/translate/vsx-ops.inc.c
+++ b/target-ppc/translate/vsx-ops.inc.c
@@ -24,6 +24,7 @@ GEN_HANDLER_E(mfvsrd, 0x1F, 0x13, 0x01, 0x0000F800, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(mtvsrd, 0x1F, 0x13, 0x05, 0x0000F800, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(mfvsrld, 0X1F, 0x13, 0x09, 0x0000F800, PPC_NONE, PPC2_ISA300),
 GEN_HANDLER_E(mtvsrdd, 0X1F, 0x13, 0x0D, 0x0, PPC_NONE, PPC2_ISA300),
+GEN_HANDLER_E(mtvsrws, 0x1F, 0x13, 0x0C, 0x0000F800, PPC_NONE, PPC2_ISA300),
 #endif
 
 #define GEN_XX1FORM(name, opc2, opc3, fl2)                              \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
                   ` (2 preceding siblings ...)
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 3/9] target-ppc: Implement mtvsrws instruction Nikunj A Dadhania
@ 2016-09-28  5:31 ` Nikunj A Dadhania
  2016-09-28 16:07   ` Richard Henderson
  2016-09-29  1:38   ` David Gibson
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 5/9] target-ppc: improve stxvw4x implementation Nikunj A Dadhania
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh

Load 8byte at a time and manipulate.

Big-Endian Storage
+-------------+-------------+-------------+-------------+
| 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
+-------------+-------------+-------------+-------------+

Little-Endian Storage
+-------------+-------------+-------------+-------------+
| 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
+-------------+-------------+-------------+-------------+

Vector load results in:
+-------------+-------------+-------------+-------------+
| 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
+-------------+-------------+-------------+-------------+

Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
index 74d0533..1eca042 100644
--- a/target-ppc/translate/vsx-impl.inc.c
+++ b/target-ppc/translate/vsx-impl.inc.c
@@ -75,7 +75,6 @@ static void gen_lxvdsx(DisasContext *ctx)
 static void gen_lxvw4x(DisasContext *ctx)
 {
     TCGv EA;
-    TCGv_i64 tmp;
     TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
     TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
     if (unlikely(!ctx->vsx_enabled)) {
@@ -84,22 +83,28 @@ static void gen_lxvw4x(DisasContext *ctx)
     }
     gen_set_access_type(ctx, ACCESS_INT);
     EA = tcg_temp_new();
-    tmp = tcg_temp_new_i64();
 
     gen_addr_reg_index(ctx, EA);
-    gen_qemu_ld32u_i64(ctx, tmp, EA);
-    tcg_gen_addi_tl(EA, EA, 4);
-    gen_qemu_ld32u_i64(ctx, xth, EA);
-    tcg_gen_deposit_i64(xth, xth, tmp, 32, 32);
-
-    tcg_gen_addi_tl(EA, EA, 4);
-    gen_qemu_ld32u_i64(ctx, tmp, EA);
-    tcg_gen_addi_tl(EA, EA, 4);
-    gen_qemu_ld32u_i64(ctx, xtl, EA);
-    tcg_gen_deposit_i64(xtl, xtl, tmp, 32, 32);
-
+    if (ctx->le_mode) {
+        TCGv_i64 t0, t1;
+
+        t0 = tcg_temp_new_i64();
+        t1 = tcg_temp_new_i64();
+        tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ);
+        tcg_gen_shri_i64(t1, t0, 32);
+        tcg_gen_deposit_i64(xth, t1, t0, 32, 32);
+        tcg_gen_addi_tl(EA, EA, 8);
+        tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ);
+        tcg_gen_shri_i64(t1, t0, 32);
+        tcg_gen_deposit_i64(xtl, t1, t0, 32, 32);
+        tcg_temp_free_i64(t0);
+        tcg_temp_free_i64(t1);
+    } else {
+        tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
+        tcg_gen_addi_tl(EA, EA, 8);
+        tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
+    }
     tcg_temp_free(EA);
-    tcg_temp_free_i64(tmp);
 }
 
 #define VSX_STORE_SCALAR(name, operation)                     \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v4 5/9] target-ppc: improve stxvw4x implementation
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
                   ` (3 preceding siblings ...)
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation Nikunj A Dadhania
@ 2016-09-28  5:31 ` Nikunj A Dadhania
  2016-09-28 16:08   ` Richard Henderson
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 6/9] target-ppc: add lxvh8x instruction Nikunj A Dadhania
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh

Manipulate data and store 8bytes instead of 4bytes.

Vector:
+-------------+-------------+-------------+-------------+
| 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
+-------------+-------------+-------------+-------------+

Store results in following:

Big-Endian Storage
+-------------+-------------+-------------+-------------+
| 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
+-------------+-------------+-------------+-------------+

Little-Endian Storage
+-------------+-------------+-------------+-------------+
| 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
+-------------+-------------+-------------+-------------+

Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
index 1eca042..9fdab5f 100644
--- a/target-ppc/translate/vsx-impl.inc.c
+++ b/target-ppc/translate/vsx-impl.inc.c
@@ -147,7 +147,8 @@ static void gen_stxvd2x(DisasContext *ctx)
 
 static void gen_stxvw4x(DisasContext *ctx)
 {
-    TCGv_i64 tmp;
+    TCGv_i64 xsh = cpu_vsrh(xS(ctx->opcode));
+    TCGv_i64 xsl = cpu_vsrl(xS(ctx->opcode));
     TCGv EA;
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -156,21 +157,25 @@ static void gen_stxvw4x(DisasContext *ctx)
     gen_set_access_type(ctx, ACCESS_INT);
     EA = tcg_temp_new();
     gen_addr_reg_index(ctx, EA);
-    tmp = tcg_temp_new_i64();
-
-    tcg_gen_shri_i64(tmp, cpu_vsrh(xS(ctx->opcode)), 32);
-    gen_qemu_st32_i64(ctx, tmp, EA);
-    tcg_gen_addi_tl(EA, EA, 4);
-    gen_qemu_st32_i64(ctx, cpu_vsrh(xS(ctx->opcode)), EA);
-
-    tcg_gen_shri_i64(tmp, cpu_vsrl(xS(ctx->opcode)), 32);
-    tcg_gen_addi_tl(EA, EA, 4);
-    gen_qemu_st32_i64(ctx, tmp, EA);
-    tcg_gen_addi_tl(EA, EA, 4);
-    gen_qemu_st32_i64(ctx, cpu_vsrl(xS(ctx->opcode)), EA);
+    if (ctx->le_mode) {
+        TCGv_i64 t0 = tcg_temp_new_i64();
+        TCGv_i64 t1 = tcg_temp_new_i64();
 
+        tcg_gen_shri_i64(t0, xsh, 32);
+        tcg_gen_deposit_i64(t1, t0, xsh, 32, 32);
+        tcg_gen_qemu_st_i64(t1, EA, ctx->mem_idx, MO_LEQ);
+        tcg_gen_addi_tl(EA, EA, 8);
+        tcg_gen_shri_i64(t0, xsl, 32);
+        tcg_gen_deposit_i64(t1, t0, xsl, 32, 32);
+        tcg_gen_qemu_st_i64(t1, EA, ctx->mem_idx, MO_LEQ);
+        tcg_temp_free_i64(t0);
+        tcg_temp_free_i64(t1);
+    } else {
+        tcg_gen_qemu_st_i64(xsh, EA, ctx->mem_idx, MO_BEQ);
+        tcg_gen_addi_tl(EA, EA, 8);
+        tcg_gen_qemu_st_i64(xsl, EA, ctx->mem_idx, MO_BEQ);
+    }
     tcg_temp_free(EA);
-    tcg_temp_free_i64(tmp);
 }
 
 #define MV_VSRW(name, tcgop1, tcgop2, target, source)           \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v4 6/9] target-ppc: add lxvh8x instruction
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
                   ` (4 preceding siblings ...)
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 5/9] target-ppc: improve stxvw4x implementation Nikunj A Dadhania
@ 2016-09-28  5:31 ` Nikunj A Dadhania
  2016-09-28 16:12   ` Richard Henderson
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 7/9] target-ppc: add stxvh8x instruction Nikunj A Dadhania
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh

lxvh8x:  Load VSX Vector Halfword*8

Big-Endian Storage
+-------+-------+-------+-------+-------+-------+-------+-------+
| 00 01 | 10 11 | 20 21 | 30 31 | 40 41 | 50 51 | 60 61 | 70 71 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Little-Endian Storage
+-------+-------+-------+-------+-------+-------+-------+-------+
| 01 00 | 11 10 | 21 20 | 31 30 | 41 40 | 51 50 | 61 60 | 71 70 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Vector load results in:
+-------+-------+-------+-------+-------+-------+-------+-------+
| 00 01 | 10 11 | 20 21 | 30 31 | 40 41 | 50 51 | 60 61 | 70 71 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/helper.h                 |  1 +
 target-ppc/mem_helper.c             |  6 ++++++
 target-ppc/translate/vsx-impl.inc.c | 28 ++++++++++++++++++++++++++++
 target-ppc/translate/vsx-ops.inc.c  |  1 +
 4 files changed, 36 insertions(+)

diff --git a/target-ppc/helper.h b/target-ppc/helper.h
index a1c2962..9689000 100644
--- a/target-ppc/helper.h
+++ b/target-ppc/helper.h
@@ -298,6 +298,7 @@ DEF_HELPER_2(mtvscr, void, env, avr)
 DEF_HELPER_3(lvebx, void, env, avr, tl)
 DEF_HELPER_3(lvehx, void, env, avr, tl)
 DEF_HELPER_3(lvewx, void, env, avr, tl)
+DEF_HELPER_1(bswap16x4, i64, i64)
 DEF_HELPER_3(stvebx, void, env, avr, tl)
 DEF_HELPER_3(stvehx, void, env, avr, tl)
 DEF_HELPER_3(stvewx, void, env, avr, tl)
diff --git a/target-ppc/mem_helper.c b/target-ppc/mem_helper.c
index 6548715..29c7b5b 100644
--- a/target-ppc/mem_helper.c
+++ b/target-ppc/mem_helper.c
@@ -285,6 +285,12 @@ STVE(stvewx, cpu_stl_data_ra, bswap32, u32)
 #undef I
 #undef LVE
 
+uint64_t helper_bswap16x4(uint64_t x)
+{
+    uint64_t m = 0x00ff00ff00ff00ffull;
+    return ((x & m) << 8) | ((x >> 8) & m);
+}
+
 #undef HI_IDX
 #undef LO_IDX
 
diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
index 9fdab5f..51f3dcb 100644
--- a/target-ppc/translate/vsx-impl.inc.c
+++ b/target-ppc/translate/vsx-impl.inc.c
@@ -107,6 +107,34 @@ static void gen_lxvw4x(DisasContext *ctx)
     tcg_temp_free(EA);
 }
 
+static void gen_lxvh8x(DisasContext *ctx)
+{
+    TCGv EA;
+    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
+    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
+
+    if (unlikely(!ctx->vsx_enabled)) {
+        gen_exception(ctx, POWERPC_EXCP_VSXU);
+        return;
+    }
+    gen_set_access_type(ctx, ACCESS_INT);
+    EA = tcg_temp_new();
+    gen_addr_reg_index(ctx, EA);
+
+    if (ctx->le_mode) {
+        tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
+        gen_helper_bswap16x4(xth, xth);
+        tcg_gen_addi_tl(EA, EA, 8);
+        tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
+        gen_helper_bswap16x4(xtl, xtl);
+    } else {
+        tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
+        tcg_gen_addi_tl(EA, EA, 8);
+        tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
+    }
+    tcg_temp_free(EA);
+}
+
 #define VSX_STORE_SCALAR(name, operation)                     \
 static void gen_##name(DisasContext *ctx)                     \
 {                                                             \
diff --git a/target-ppc/translate/vsx-ops.inc.c b/target-ppc/translate/vsx-ops.inc.c
index d5f5b87..c52e6ff 100644
--- a/target-ppc/translate/vsx-ops.inc.c
+++ b/target-ppc/translate/vsx-ops.inc.c
@@ -7,6 +7,7 @@ GEN_HANDLER_E(lxsspx, 0x1F, 0x0C, 0x10, 0, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(lxvd2x, 0x1F, 0x0C, 0x1A, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(lxvdsx, 0x1F, 0x0C, 0x0A, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(lxvw4x, 0x1F, 0x0C, 0x18, 0, PPC_NONE, PPC2_VSX),
+GEN_HANDLER_E(lxvh8x, 0x1F, 0x0C, 0x19, 0, PPC_NONE,  PPC2_ISA300),
 
 GEN_HANDLER_E(stxsdx, 0x1F, 0xC, 0x16, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(stxsibx, 0x1F, 0xD, 0x1C, 0, PPC_NONE, PPC2_ISA300),
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v4 7/9] target-ppc: add stxvh8x instruction
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
                   ` (5 preceding siblings ...)
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 6/9] target-ppc: add lxvh8x instruction Nikunj A Dadhania
@ 2016-09-28  5:31 ` Nikunj A Dadhania
  2016-09-28 16:13   ` Richard Henderson
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 8/9] target-ppc: add lxvb16x instruction Nikunj A Dadhania
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh

stxvh8x:  Store VSX Vector Halfword*8

Vector:
+-------+-------+-------+-------+-------+-------+-------+-------+
| 00 01 | 10 11 | 20 21 | 30 31 | 40 41 | 50 51 | 60 61 | 70 71 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Store results in following:

Big-Endian Storage
+-------+-------+-------+-------+-------+-------+-------+-------+
| 00 01 | 10 11 | 20 21 | 30 31 | 40 41 | 50 51 | 60 61 | 70 71 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Little-Endian Storage
+-------+-------+-------+-------+-------+-------+-------+-------+
| 01 00 | 11 10 | 21 20 | 31 30 | 41 40 | 51 50 | 61 60 | 71 70 |
+-------+-------+-------+-------+-------+-------+-------+-------+

Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/translate/vsx-impl.inc.c | 27 +++++++++++++++++++++++++++
 target-ppc/translate/vsx-ops.inc.c  |  1 +
 2 files changed, 28 insertions(+)

diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
index 51f3dcb..35168af 100644
--- a/target-ppc/translate/vsx-impl.inc.c
+++ b/target-ppc/translate/vsx-impl.inc.c
@@ -206,6 +206,33 @@ static void gen_stxvw4x(DisasContext *ctx)
     tcg_temp_free(EA);
 }
 
+static void gen_stxvh8x(DisasContext *ctx)
+{
+    TCGv_i64 xsh = cpu_vsrh(xS(ctx->opcode));
+    TCGv_i64 xsl = cpu_vsrl(xS(ctx->opcode));
+    TCGv EA;
+
+    if (unlikely(!ctx->vsx_enabled)) {
+        gen_exception(ctx, POWERPC_EXCP_VSXU);
+        return;
+    }
+    gen_set_access_type(ctx, ACCESS_INT);
+    EA = tcg_temp_new();
+    gen_addr_reg_index(ctx, EA);
+    if (ctx->le_mode) {
+        gen_helper_bswap16x4(xsh, xsh);
+        tcg_gen_qemu_st_i64(xsh, EA, ctx->mem_idx, MO_BEQ);
+        tcg_gen_addi_tl(EA, EA, 8);
+        gen_helper_bswap16x4(xsl, xsl);
+        tcg_gen_qemu_st_i64(xsl, EA, ctx->mem_idx, MO_BEQ);
+    } else {
+        tcg_gen_qemu_st_i64(xsh, EA, ctx->mem_idx, MO_BEQ);
+        tcg_gen_addi_tl(EA, EA, 8);
+        tcg_gen_qemu_st_i64(xsl, EA, ctx->mem_idx, MO_BEQ);
+    }
+    tcg_temp_free(EA);
+}
+
 #define MV_VSRW(name, tcgop1, tcgop2, target, source)           \
 static void gen_##name(DisasContext *ctx)                       \
 {                                                               \
diff --git a/target-ppc/translate/vsx-ops.inc.c b/target-ppc/translate/vsx-ops.inc.c
index c52e6ff..17975ec 100644
--- a/target-ppc/translate/vsx-ops.inc.c
+++ b/target-ppc/translate/vsx-ops.inc.c
@@ -16,6 +16,7 @@ GEN_HANDLER_E(stxsiwx, 0x1F, 0xC, 0x04, 0, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(stxsspx, 0x1F, 0xC, 0x14, 0, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(stxvd2x, 0x1F, 0xC, 0x1E, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(stxvw4x, 0x1F, 0xC, 0x1C, 0, PPC_NONE, PPC2_VSX),
+GEN_HANDLER_E(stxvh8x, 0x1F, 0x0C, 0x1D, 0, PPC_NONE,  PPC2_ISA300),
 
 GEN_HANDLER_E(mfvsrwz, 0x1F, 0x13, 0x03, 0x0000F800, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(mtvsrwa, 0x1F, 0x13, 0x06, 0x0000F800, PPC_NONE, PPC2_VSX207),
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v4 8/9] target-ppc: add lxvb16x instruction
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
                   ` (6 preceding siblings ...)
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 7/9] target-ppc: add stxvh8x instruction Nikunj A Dadhania
@ 2016-09-28  5:31 ` Nikunj A Dadhania
  2016-09-28 16:13   ` Richard Henderson
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 9/9] target-ppc: add stxvb16x instruction Nikunj A Dadhania
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh

lxvb16x: Load VSX Vector Byte*16

Little/Big-endian Storage
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|F0|F1|F2|F3|F4|F5|F6|F7|E0|E1|E2|E3|E4|E5|E6|E7|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Vector load results in:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|F0|F1|F2|F3|F4|F5|F6|F7|E0|E1|E2|E3|E4|E5|E6|E7|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/translate/vsx-impl.inc.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
index 35168af..54e0c1e 100644
--- a/target-ppc/translate/vsx-impl.inc.c
+++ b/target-ppc/translate/vsx-impl.inc.c
@@ -135,6 +135,25 @@ static void gen_lxvh8x(DisasContext *ctx)
     tcg_temp_free(EA);
 }
 
+static void gen_lxvb16x(DisasContext *ctx)
+{
+    TCGv EA;
+    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
+    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
+
+    if (unlikely(!ctx->vsx_enabled)) {
+        gen_exception(ctx, POWERPC_EXCP_VSXU);
+        return;
+    }
+    gen_set_access_type(ctx, ACCESS_INT);
+    EA = tcg_temp_new();
+    gen_addr_reg_index(ctx, EA);
+    tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
+    tcg_gen_addi_tl(EA, EA, 8);
+    tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
+    tcg_temp_free(EA);
+}
+
 #define VSX_STORE_SCALAR(name, operation)                     \
 static void gen_##name(DisasContext *ctx)                     \
 {                                                             \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [PATCH v4 9/9] target-ppc: add stxvb16x instruction
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
                   ` (7 preceding siblings ...)
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 8/9] target-ppc: add lxvb16x instruction Nikunj A Dadhania
@ 2016-09-28  5:31 ` Nikunj A Dadhania
  2016-09-28 16:13   ` Richard Henderson
  2016-09-28  5:38 ` [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
  2016-09-28  9:28 ` [Qemu-devel] [Qemu-ppc] " Thomas Huth
  10 siblings, 1 reply; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:31 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, nikunj, benh

stxvb16x: Store VSX Vector Byte*16

Vector:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|F0|F1|F2|F3|F4|F5|F6|F7|E0|E1|E2|E3|E4|E5|E6|E7|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Store results in following:

Little/Big-endian Storage
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|F0|F1|F2|F3|F4|F5|F6|F7|E0|E1|E2|E3|E4|E5|E6|E7|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
---
 target-ppc/translate/vsx-impl.inc.c | 19 +++++++++++++++++++
 target-ppc/translate/vsx-ops.inc.c  |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
index 54e0c1e..31b3d45 100644
--- a/target-ppc/translate/vsx-impl.inc.c
+++ b/target-ppc/translate/vsx-impl.inc.c
@@ -252,6 +252,25 @@ static void gen_stxvh8x(DisasContext *ctx)
     tcg_temp_free(EA);
 }
 
+static void gen_stxvb16x(DisasContext *ctx)
+{
+    TCGv_i64 xsh = cpu_vsrh(xS(ctx->opcode));
+    TCGv_i64 xsl = cpu_vsrl(xS(ctx->opcode));
+    TCGv EA;
+
+    if (unlikely(!ctx->vsx_enabled)) {
+        gen_exception(ctx, POWERPC_EXCP_VSXU);
+        return;
+    }
+    gen_set_access_type(ctx, ACCESS_INT);
+    EA = tcg_temp_new();
+    gen_addr_reg_index(ctx, EA);
+    tcg_gen_qemu_st_i64(xsh, EA, ctx->mem_idx, MO_BEQ);
+    tcg_gen_addi_tl(EA, EA, 8);
+    tcg_gen_qemu_st_i64(xsl, EA, ctx->mem_idx, MO_BEQ);
+    tcg_temp_free(EA);
+}
+
 #define MV_VSRW(name, tcgop1, tcgop2, target, source)           \
 static void gen_##name(DisasContext *ctx)                       \
 {                                                               \
diff --git a/target-ppc/translate/vsx-ops.inc.c b/target-ppc/translate/vsx-ops.inc.c
index 17975ec..10eb4b9 100644
--- a/target-ppc/translate/vsx-ops.inc.c
+++ b/target-ppc/translate/vsx-ops.inc.c
@@ -8,6 +8,7 @@ GEN_HANDLER_E(lxvd2x, 0x1F, 0x0C, 0x1A, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(lxvdsx, 0x1F, 0x0C, 0x0A, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(lxvw4x, 0x1F, 0x0C, 0x18, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(lxvh8x, 0x1F, 0x0C, 0x19, 0, PPC_NONE,  PPC2_ISA300),
+GEN_HANDLER_E(lxvb16x, 0x1F, 0x0C, 0x1B, 0, PPC_NONE, PPC2_ISA300),
 
 GEN_HANDLER_E(stxsdx, 0x1F, 0xC, 0x16, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(stxsibx, 0x1F, 0xD, 0x1C, 0, PPC_NONE, PPC2_ISA300),
@@ -17,6 +18,7 @@ GEN_HANDLER_E(stxsspx, 0x1F, 0xC, 0x14, 0, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(stxvd2x, 0x1F, 0xC, 0x1E, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(stxvw4x, 0x1F, 0xC, 0x1C, 0, PPC_NONE, PPC2_VSX),
 GEN_HANDLER_E(stxvh8x, 0x1F, 0x0C, 0x1D, 0, PPC_NONE,  PPC2_ISA300),
+GEN_HANDLER_E(stxvb16x, 0x1F, 0x0C, 0x1F, 0, PPC_NONE, PPC2_ISA300),
 
 GEN_HANDLER_E(mfvsrwz, 0x1F, 0x13, 0x03, 0x0000F800, PPC_NONE, PPC2_VSX207),
 GEN_HANDLER_E(mtvsrwa, 0x1F, 0x13, 0x06, 0x0000F800, PPC_NONE, PPC2_VSX207),
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
                   ` (8 preceding siblings ...)
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 9/9] target-ppc: add stxvb16x instruction Nikunj A Dadhania
@ 2016-09-28  5:38 ` Nikunj A Dadhania
  2016-09-28  9:28 ` [Qemu-devel] [Qemu-ppc] " Thomas Huth
  10 siblings, 0 replies; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28  5:38 UTC (permalink / raw)
  To: qemu-ppc, david, rth; +Cc: qemu-devel, benh

[-- Attachment #1: Type: text/plain, Size: 1677 bytes --]

Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> writes:

> This series contains 7 new instructions for POWER9 ISA3.0
> Use newer qemu load/store tcg helpers and optimize stxvw4x and lxvw4x.
>
> GCC was adding epilogue for every VSX instructions causing change in 
> behaviour. For testing the load vector instructions used mfvsrld/mfvsrd 
> for loading vsr to register. And for testing store vector, used mtvsrdd 
> instructions. This helped in getting rid of the epilogue added by gcc. Tried 
> adding the test cases to kvm-unit-tests, but executing vsx instructions 
> results in cpu exception. Will debug that later. I will send the test code 
> and steps to execute as reply to this email.

Source code for stxv_x.c and lxv_x.c is attached and following are the 
steps to use them:

Compile using IBM Advance toolchain[1]:
=======================================
/opt/at10.0/bin/powerpc64-linux-gnu-gcc -static -O3 lxv_x.c -o be_lxv_x
/opt/at10.0/bin/powerpc64-linux-gnu-gcc -static -O3 stxv_x.c -o be_stxv_x
/opt/at10.0/bin/powerpc64le-linux-gnu-gcc -static -O3 lxv_x.c -o le_lxv_x
/opt/at10.0/bin/powerpc64le-linux-gnu-gcc -static -O3 stxv_x.c -o le_stxv_x

Run following for testing the instructions:
===========================================

for i in lxv_x stxv_x
do
    echo "Running ... $i"
    echo ">>>>>>>>>>>>>>>> LE LE LE >>>>>>>>>>>>>>"
    ../qemu/ppc64le-linux-user/qemu-ppc64le   -cpu POWER9 le_${i}
    echo ">>>>>>>>>>>>>>>> BE BE BE >>>>>>>>>>>>>>"
    ../qemu/ppc64-linux-user/qemu-ppc64   -cpu POWER9 be_${i}
    echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>"
done

Regards
Nikunj

1. ftp://ftp.unicamp.br/pub/linuxpatch/toolchain/at/redhat/Fedora22


[-- Attachment #2: stxv_x.c --]
[-- Type: text/plain, Size: 814 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>

static void print16x1(uint8_t *p)
{
  int i;
  for(i = 0; i < 16; i++)
    printf(" %02X ", p[i]);
  printf("\n");
}

int main(void) {
  __vector uint8_t vrt8;
  uint8_t rb8[16];
  unsigned long hi = 0x0001020310111213;
  unsigned long lo = 0x2021222330313233;

  asm volatile("mtvsrdd %x0, %2, %3;"
               "stxvw4x %x0, 0, %1;"
               : "=ws"(vrt8): "r"(&rb8), "r"(hi), "r"(lo));
  print16x1(rb8);

  asm volatile("mtvsrdd %x0, %2, %3;"
               "stxvh8x %x0, 0, %1;"
               : "=ws"(vrt8) : "r"(&rb8), "r"(hi), "r"(lo));
  print16x1(rb8);

  asm volatile("mtvsrdd %x0, %2, %3;"
               "stxvb16x %x0, 0, %1;"
               : "=ws"(vrt8) : "r"(&rb8), "r"(hi), "r"(lo));
  print16x1(rb8);

  return EXIT_SUCCESS;
}

[-- Attachment #3: lxv_x.c --]
[-- Type: text/plain, Size: 1563 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>

int main(void) {
  __vector uint8_t vrt8;
  unsigned long lo, hi;

#if __BYTE_ORDER == __LITTLE_ENDIAN
  uint8_t rb32[16] = {0x03, 0x02, 0x01, 0x00, 0x13, 0x12, 0x11, 0x10,
                      0x23, 0x22, 0x21, 0x20, 0x33, 0x32, 0x31, 0x30};
  uint8_t rb16[16] = {0x01, 0x00, 0x11, 0x10, 0x21, 0x20, 0x31, 0x30,
                      0x41, 0x40, 0x51, 0x50, 0x61, 0x60, 0x71, 0x70};
#else
  uint8_t rb32[16] = {0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13,
                      0x20, 0x21, 0x22, 0x23, 0x30, 0x31, 0x32, 0x33};
  uint8_t rb16[16] = {0x00, 0x01, 0x10, 0x11, 0x20, 0x21, 0x30, 0x31,
                      0x40, 0x41, 0x50, 0x51, 0x60, 0x61, 0x70, 0x71};
#endif

  uint8_t rb8[16] = {0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7,
                     0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7};

  asm volatile("lxvw4x %x0, 0, %1;"
               "mfvsrd %2, %x0;"
               "mfvsrld %3, %x0;"
               : "=ws"(vrt8): "r"(&rb32), "r"(hi), "r"(lo));
  printf("lxvw4x:  hi %016lx lo %016lx \n", hi, lo);

  asm volatile("lxvh8x %x0, 0, %1;"
               "mfvsrd %2, %x0;"
               "mfvsrld %3, %x0;"
               : "=ws"(vrt8): "r"(&rb16), "r"(hi), "r"(lo));
  printf("lxvh8x:  hi %016lx lo %016lx \n", hi, lo);

  asm volatile("lxvb16x %x0, 0, %1;"
               "mfvsrd %2, %x0;"
               "mfvsrld %3, %x0;"
               : "=ws"(vrt8): "r"(&rb8), "r"(hi), "r"(lo));
  printf("lxvb16x: hi %016lx lo %016lx \n", hi, lo);

  return EXIT_SUCCESS;
}


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v4 0/9] POWER9 TCG enablements - part4
  2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
                   ` (9 preceding siblings ...)
  2016-09-28  5:38 ` [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
@ 2016-09-28  9:28 ` Thomas Huth
  2016-09-28 11:34   ` Nikunj A Dadhania
  10 siblings, 1 reply; 33+ messages in thread
From: Thomas Huth @ 2016-09-28  9:28 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david, rth; +Cc: qemu-devel

On 28.09.2016 07:31, Nikunj A Dadhania wrote:
> This series contains 7 new instructions for POWER9 ISA3.0
> Use newer qemu load/store tcg helpers and optimize stxvw4x and lxvw4x.
> 
> GCC was adding epilogue for every VSX instructions causing change in 
> behaviour. For testing the load vector instructions used mfvsrld/mfvsrd 
> for loading vsr to register. And for testing store vector, used mtvsrdd 
> instructions. This helped in getting rid of the epilogue added by gcc. Tried 
> adding the test cases to kvm-unit-tests, but executing vsx instructions 
> results in cpu exception. Will debug that later. I will send the test code 
> and steps to execute as reply to this email.

Did you enable the VEC bit in the MSR before trying to run your
instruction in a kvm-unit-test? If not, that might be the cause.

Alternatively, there is also a tests/tcg/ folder in QEMU ... you could
add a tests/tcg/ppc64 subfolder there.

 Thomas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v4 0/9] POWER9 TCG enablements - part4
  2016-09-28  9:28 ` [Qemu-devel] [Qemu-ppc] " Thomas Huth
@ 2016-09-28 11:34   ` Nikunj A Dadhania
  0 siblings, 0 replies; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28 11:34 UTC (permalink / raw)
  To: Thomas Huth, qemu-ppc, david, rth; +Cc: qemu-devel

Thomas Huth <thuth@redhat.com> writes:

> On 28.09.2016 07:31, Nikunj A Dadhania wrote:
>> This series contains 7 new instructions for POWER9 ISA3.0
>> Use newer qemu load/store tcg helpers and optimize stxvw4x and lxvw4x.
>> 
>> GCC was adding epilogue for every VSX instructions causing change in 
>> behaviour. For testing the load vector instructions used mfvsrld/mfvsrd 
>> for loading vsr to register. And for testing store vector, used mtvsrdd 
>> instructions. This helped in getting rid of the epilogue added by gcc. Tried 
>> adding the test cases to kvm-unit-tests, but executing vsx instructions 
>> results in cpu exception. Will debug that later. I will send the test code 
>> and steps to execute as reply to this email.
>
> Did you enable the VEC bit in the MSR before trying to run your
> instruction in a kvm-unit-test? If not, that might be the cause.

Yes, that did the trick, thanks :-)

> Alternatively, there is also a tests/tcg/ folder in QEMU ... you could
> add a tests/tcg/ppc64 subfolder there.

Haven't looked at it yet.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction Nikunj A Dadhania
@ 2016-09-28 16:01   ` Richard Henderson
  2016-09-28 17:06     ` Nikunj A Dadhania
  2016-09-29  1:29   ` David Gibson
  1 sibling, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 16:01 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh, Ravi Bangoria

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> +    if (!rA(ctx->opcode)) {
> +        tcg_gen_movi_i64(cpu_vsrh(xT(ctx->opcode)), 0);
> +    } else {
> +       tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_gpr[rA(ctx->opcode)]);
> +    }

Indentation.  Otherwise,

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 1/9] target-ppc: Implement mfvsrld instruction
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 1/9] target-ppc: Implement mfvsrld instruction Nikunj A Dadhania
@ 2016-09-28 16:03   ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 16:03 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh, Ravi Bangoria

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> From: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
> 
> mfvsrld: Move From VSR Lower Doubleword
> 
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  target-ppc/translate/vsx-impl.inc.c | 17 +++++++++++++++++
>  target-ppc/translate/vsx-ops.inc.c  |  1 +
>  2 files changed, 18 insertions(+)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 3/9] target-ppc: Implement mtvsrws instruction
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 3/9] target-ppc: Implement mtvsrws instruction Nikunj A Dadhania
@ 2016-09-28 16:04   ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 16:04 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh, Ravi Bangoria

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> +    tcg_gen_andi_i64(tmp1, cpu_gpr[rA(ctx->opcode)], 0xFFFFFFFF);
> +    tcg_gen_deposit_i64(cpu_vsrl(xT(ctx->opcode)), tmp1, tmp1, 32, 32);
> +    tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_vsrl(xT(ctx->opcode)));

The andi is not necessary; the deposit handles all masking.  Otherwise,


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation Nikunj A Dadhania
@ 2016-09-28 16:07   ` Richard Henderson
  2016-09-29  1:38   ` David Gibson
  1 sibling, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 16:07 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> Load 8byte at a time and manipulate.
> 
> Big-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+
> 
> Little-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
> +-------------+-------------+-------------+-------------+
> 
> Vector load results in:
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++--------------
>  1 file changed, 19 insertions(+), 14 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 5/9] target-ppc: improve stxvw4x implementation
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 5/9] target-ppc: improve stxvw4x implementation Nikunj A Dadhania
@ 2016-09-28 16:08   ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 16:08 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> Manipulate data and store 8bytes instead of 4bytes.
> 
> Vector:
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+
> 
> Store results in following:
> 
> Big-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+
> 
> Little-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
> +-------------+-------------+-------------+-------------+
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++--------------
>  1 file changed, 19 insertions(+), 14 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 6/9] target-ppc: add lxvh8x instruction
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 6/9] target-ppc: add lxvh8x instruction Nikunj A Dadhania
@ 2016-09-28 16:12   ` Richard Henderson
  2016-09-28 17:11     ` Nikunj A Dadhania
  0 siblings, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 16:12 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> +DEF_HELPER_1(bswap16x4, i64, i64)

DEF_HELPER_FLAGS_1(bswap16x4, TCG_CALL_NO_RWG_SE, i64, i64)

> +    uint64_t m = 0x00ff00ff00ff00ffull;
> +    return ((x & m) << 8) | ((x >> 8) & m);

... although I suppose this is only 5 instructions, and could reasonably be
done inline too.  Especially if you shared the one 64-bit constant across the
two bswaps.


> +    if (ctx->le_mode) {
> +        tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
> +        gen_helper_bswap16x4(xth, xth);
> +        tcg_gen_addi_tl(EA, EA, 8);
> +        tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
> +        gen_helper_bswap16x4(xtl, xtl);
> +    } else {
> +        tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
> +        tcg_gen_addi_tl(EA, EA, 8);
> +        tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
> +    }

Better to not duplicate this.

  tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
  tcg_gen_addi_tl(EA, EA, 8);
  tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
  if (ctx->le_mode) {
    gen_helper_bswap16x4(xth, xth);
    gen_helper_bswap16x4(xtl, xtl);
  }


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 7/9] target-ppc: add stxvh8x instruction
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 7/9] target-ppc: add stxvh8x instruction Nikunj A Dadhania
@ 2016-09-28 16:13   ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 16:13 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> +    if (ctx->le_mode) {
> +        gen_helper_bswap16x4(xsh, xsh);
> +        tcg_gen_qemu_st_i64(xsh, EA, ctx->mem_idx, MO_BEQ);
> +        tcg_gen_addi_tl(EA, EA, 8);
> +        gen_helper_bswap16x4(xsl, xsl);
> +        tcg_gen_qemu_st_i64(xsl, EA, ctx->mem_idx, MO_BEQ);

You cannot clobber xsh and xsl like this.  You need temporaries.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 8/9] target-ppc: add lxvb16x instruction
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 8/9] target-ppc: add lxvb16x instruction Nikunj A Dadhania
@ 2016-09-28 16:13   ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 16:13 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> lxvb16x: Load VSX Vector Byte*16
> 
> Little/Big-endian Storage
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> |F0|F1|F2|F3|F4|F5|F6|F7|E0|E1|E2|E3|E4|E5|E6|E7|
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> 
> Vector load results in:
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> |F0|F1|F2|F3|F4|F5|F6|F7|E0|E1|E2|E3|E4|E5|E6|E7|
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  target-ppc/translate/vsx-impl.inc.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 9/9] target-ppc: add stxvb16x instruction
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 9/9] target-ppc: add stxvb16x instruction Nikunj A Dadhania
@ 2016-09-28 16:13   ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 16:13 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh

On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
> stxvb16x: Store VSX Vector Byte*16
> 
> Vector:
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> |F0|F1|F2|F3|F4|F5|F6|F7|E0|E1|E2|E3|E4|E5|E6|E7|
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> 
> Store results in following:
> 
> Little/Big-endian Storage
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> |F0|F1|F2|F3|F4|F5|F6|F7|E0|E1|E2|E3|E4|E5|E6|E7|
> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
> 
> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  target-ppc/translate/vsx-impl.inc.c | 19 +++++++++++++++++++
>  target-ppc/translate/vsx-ops.inc.c  |  2 ++
>  2 files changed, 21 insertions(+)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction
  2016-09-28 16:01   ` Richard Henderson
@ 2016-09-28 17:06     ` Nikunj A Dadhania
  0 siblings, 0 replies; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28 17:06 UTC (permalink / raw)
  To: Richard Henderson, qemu-ppc, david; +Cc: qemu-devel, benh, Ravi Bangoria

Richard Henderson <rth@twiddle.net> writes:

> On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
>> +    if (!rA(ctx->opcode)) {
>> +        tcg_gen_movi_i64(cpu_vsrh(xT(ctx->opcode)), 0);
>> +    } else {
>> +       tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_gpr[rA(ctx->opcode)]);
>> +    }
>
> Indentation.  Otherwise,

Sure, not sure how it escaped check-patch.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 6/9] target-ppc: add lxvh8x instruction
  2016-09-28 16:12   ` Richard Henderson
@ 2016-09-28 17:11     ` Nikunj A Dadhania
  2016-09-28 17:22       ` Richard Henderson
  0 siblings, 1 reply; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-28 17:11 UTC (permalink / raw)
  To: Richard Henderson, qemu-ppc, david; +Cc: qemu-devel, benh

Richard Henderson <rth@twiddle.net> writes:

> On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
>> +DEF_HELPER_1(bswap16x4, i64, i64)
>
> DEF_HELPER_FLAGS_1(bswap16x4, TCG_CALL_NO_RWG_SE, i64, i64)
>
>> +    uint64_t m = 0x00ff00ff00ff00ffull;
>> +    return ((x & m) << 8) | ((x >> 8) & m);
>
> ... although I suppose this is only 5 instructions, and could reasonably be
> done inline too.  Especially if you shared the one 64-bit constant across the
> two bswaps.

Something like this:

static void gen_bswap16x4(TCGv_i64 val)
{
    TCGv_i64 mask = tcg_const_i64(0x00FF00FF00FF00FF);
    TCGv_i64 t0 = tcg_temp_new_i64();
    TCGv_i64 t1 = tcg_temp_new_i64();

    /* val = ((val & mask) << 8) | ((val >> 8) & mask) */
    tcg_gen_and_i64(t0, val, mask); 
    tcg_gen_shri_i64(t0, t0, 8);
    tcg_gen_shli_i64(t1, val, 8);
    tcg_gen_and_i64(t1, t1, mask);
    tcg_gen_or_i64(val, t0, t1);

    tcg_temp_free_i64(t0);
    tcg_temp_free_i64(t1);
    tcg_temp_free_i64(mask);
}

>
>
>> +    if (ctx->le_mode) {
>> +        tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
>> +        gen_helper_bswap16x4(xth, xth);
>> +        tcg_gen_addi_tl(EA, EA, 8);
>> +        tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
>> +        gen_helper_bswap16x4(xtl, xtl);
>> +    } else {
>> +        tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
>> +        tcg_gen_addi_tl(EA, EA, 8);
>> +        tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
>> +    }
>
> Better to not duplicate this.
>
>   tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
>   tcg_gen_addi_tl(EA, EA, 8);
>   tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
>   if (ctx->le_mode) {
>     gen_helper_bswap16x4(xth, xth);
>     gen_helper_bswap16x4(xtl, xtl);
>   }

Sure, much better, thanks.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 6/9] target-ppc: add lxvh8x instruction
  2016-09-28 17:11     ` Nikunj A Dadhania
@ 2016-09-28 17:22       ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2016-09-28 17:22 UTC (permalink / raw)
  To: Nikunj A Dadhania, qemu-ppc, david; +Cc: qemu-devel, benh

On 09/28/2016 10:11 AM, Nikunj A Dadhania wrote:
> Richard Henderson <rth@twiddle.net> writes:
> 
>> On 09/27/2016 10:31 PM, Nikunj A Dadhania wrote:
>>> +DEF_HELPER_1(bswap16x4, i64, i64)
>>
>> DEF_HELPER_FLAGS_1(bswap16x4, TCG_CALL_NO_RWG_SE, i64, i64)
>>
>>> +    uint64_t m = 0x00ff00ff00ff00ffull;
>>> +    return ((x & m) << 8) | ((x >> 8) & m);
>>
>> ... although I suppose this is only 5 instructions, and could reasonably be
>> done inline too.  Especially if you shared the one 64-bit constant across the
>> two bswaps.
> 
> Something like this:
> 
> static void gen_bswap16x4(TCGv_i64 val)
> {
>     TCGv_i64 mask = tcg_const_i64(0x00FF00FF00FF00FF);
>     TCGv_i64 t0 = tcg_temp_new_i64();
>     TCGv_i64 t1 = tcg_temp_new_i64();
> 
>     /* val = ((val & mask) << 8) | ((val >> 8) & mask) */
>     tcg_gen_and_i64(t0, val, mask); 
>     tcg_gen_shri_i64(t0, t0, 8);
>     tcg_gen_shli_i64(t1, val, 8);
>     tcg_gen_and_i64(t1, t1, mask);
>     tcg_gen_or_i64(val, t0, t1);
> 
>     tcg_temp_free_i64(t0);
>     tcg_temp_free_i64(t1);
>     tcg_temp_free_i64(mask);
> }

Like that, except that since you always perform this twice, you should share
the expensive constant load.  Recall also that you need temporaries for the
store, so

static void gen_bswap16x8(TCGv_i64 outh, TCGv_i64 outl,
                          TCGv_i64 inh, TCGv_i64 inl)


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction Nikunj A Dadhania
  2016-09-28 16:01   ` Richard Henderson
@ 2016-09-29  1:29   ` David Gibson
  2016-09-29  3:20     ` Nikunj A Dadhania
  1 sibling, 1 reply; 33+ messages in thread
From: David Gibson @ 2016-09-29  1:29 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: qemu-ppc, rth, qemu-devel, benh, Ravi Bangoria

[-- Attachment #1: Type: text/plain, Size: 2940 bytes --]

On Wed, Sep 28, 2016 at 11:01:20AM +0530, Nikunj A Dadhania wrote:
> From: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
> 
> mtvsrdd: Move To VSR Double Doubleword
> 
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  target-ppc/translate/vsx-impl.inc.c | 23 +++++++++++++++++++++++
>  target-ppc/translate/vsx-ops.inc.c  |  1 +
>  2 files changed, 24 insertions(+)
> 
> diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
> index b669e8c..f9db1d4 100644
> --- a/target-ppc/translate/vsx-impl.inc.c
> +++ b/target-ppc/translate/vsx-impl.inc.c
> @@ -234,6 +234,29 @@ static void gen_mfvsrld(DisasContext *ctx)
>      tcg_gen_mov_i64(cpu_gpr[rA(ctx->opcode)], cpu_vsrl(xS(ctx->opcode)));
>  }
>  
> +static void gen_mtvsrdd(DisasContext *ctx)
> +{
> +    if (xT(ctx->opcode) < 32) {
> +        if (unlikely(!ctx->vsx_enabled)) {
> +            gen_exception(ctx, POWERPC_EXCP_VSXU);
> +            return;
> +        }
> +    } else {
> +        if (unlikely(!ctx->altivec_enabled)) {
> +            gen_exception(ctx, POWERPC_EXCP_VPU);
> +            return;
> +        }
> +    }

Huh.. so in the ISA doc version I have at least (p114), the
pseudo-code for the instruction states either vector or VSX
exceptions.  The text however says either FP or vector exceptions.

The pseudo-code version seems more sensible which is what you've
implemented, so I'm guessing this is just an error in the descriptive
text.

It'd be nice to confirm that against real hardware behaviour if
possible though.

> +
> +    if (!rA(ctx->opcode)) {
> +        tcg_gen_movi_i64(cpu_vsrh(xT(ctx->opcode)), 0);
> +    } else {
> +       tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_gpr[rA(ctx->opcode)]);
> +    }
> +
> +    tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), cpu_gpr[rB(ctx->opcode)]);
> +}
> +
>  #endif
>  
>  static void gen_xxpermdi(DisasContext *ctx)
> diff --git a/target-ppc/translate/vsx-ops.inc.c b/target-ppc/translate/vsx-ops.inc.c
> index 3b296f8..1287973 100644
> --- a/target-ppc/translate/vsx-ops.inc.c
> +++ b/target-ppc/translate/vsx-ops.inc.c
> @@ -23,6 +23,7 @@ GEN_HANDLER_E(mtvsrwz, 0x1F, 0x13, 0x07, 0x0000F800, PPC_NONE, PPC2_VSX207),
>  GEN_HANDLER_E(mfvsrd, 0x1F, 0x13, 0x01, 0x0000F800, PPC_NONE, PPC2_VSX207),
>  GEN_HANDLER_E(mtvsrd, 0x1F, 0x13, 0x05, 0x0000F800, PPC_NONE, PPC2_VSX207),
>  GEN_HANDLER_E(mfvsrld, 0X1F, 0x13, 0x09, 0x0000F800, PPC_NONE, PPC2_ISA300),
> +GEN_HANDLER_E(mtvsrdd, 0X1F, 0x13, 0x0D, 0x0, PPC_NONE, PPC2_ISA300),
>  #endif
>  
>  #define GEN_XX1FORM(name, opc2, opc3, fl2)                              \

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
  2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation Nikunj A Dadhania
  2016-09-28 16:07   ` Richard Henderson
@ 2016-09-29  1:38   ` David Gibson
  2016-09-29  2:34     ` Nikunj A Dadhania
  2016-09-29  3:41     ` Nikunj A Dadhania
  1 sibling, 2 replies; 33+ messages in thread
From: David Gibson @ 2016-09-29  1:38 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: qemu-ppc, rth, qemu-devel, benh

[-- Attachment #1: Type: text/plain, Size: 3488 bytes --]

On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
> Load 8byte at a time and manipulate.
> 
> Big-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+
> 
> Little-Endian Storage
> +-------------+-------------+-------------+-------------+
> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
> +-------------+-------------+-------------+-------------+
> 
> Vector load results in:
> +-------------+-------------+-------------+-------------+
> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> +-------------+-------------+-------------+-------------+

Ok.  I'm guessing from this that implementing those GPR<->VSR
instructions showed that the earlier versions were endian-incorrect as
I suspected.

Have you verified that this new implementation is actually faster (or
at least no slower) on LE than the original implementation with
individual 32-bit stores?

> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
> ---
>  target-ppc/translate/vsx-impl.inc.c | 33 +++++++++++++++++++--------------
>  1 file changed, 19 insertions(+), 14 deletions(-)
> 
> diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
> index 74d0533..1eca042 100644
> --- a/target-ppc/translate/vsx-impl.inc.c
> +++ b/target-ppc/translate/vsx-impl.inc.c
> @@ -75,7 +75,6 @@ static void gen_lxvdsx(DisasContext *ctx)
>  static void gen_lxvw4x(DisasContext *ctx)
>  {
>      TCGv EA;
> -    TCGv_i64 tmp;
>      TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
>      TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
>      if (unlikely(!ctx->vsx_enabled)) {
> @@ -84,22 +83,28 @@ static void gen_lxvw4x(DisasContext *ctx)
>      }
>      gen_set_access_type(ctx, ACCESS_INT);
>      EA = tcg_temp_new();
> -    tmp = tcg_temp_new_i64();
>  
>      gen_addr_reg_index(ctx, EA);
> -    gen_qemu_ld32u_i64(ctx, tmp, EA);
> -    tcg_gen_addi_tl(EA, EA, 4);
> -    gen_qemu_ld32u_i64(ctx, xth, EA);
> -    tcg_gen_deposit_i64(xth, xth, tmp, 32, 32);
> -
> -    tcg_gen_addi_tl(EA, EA, 4);
> -    gen_qemu_ld32u_i64(ctx, tmp, EA);
> -    tcg_gen_addi_tl(EA, EA, 4);
> -    gen_qemu_ld32u_i64(ctx, xtl, EA);
> -    tcg_gen_deposit_i64(xtl, xtl, tmp, 32, 32);
> -
> +    if (ctx->le_mode) {
> +        TCGv_i64 t0, t1;
> +
> +        t0 = tcg_temp_new_i64();
> +        t1 = tcg_temp_new_i64();
> +        tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ);
> +        tcg_gen_shri_i64(t1, t0, 32);
> +        tcg_gen_deposit_i64(xth, t1, t0, 32, 32);
> +        tcg_gen_addi_tl(EA, EA, 8);
> +        tcg_gen_qemu_ld_i64(t0, EA, ctx->mem_idx, MO_LEQ);
> +        tcg_gen_shri_i64(t1, t0, 32);
> +        tcg_gen_deposit_i64(xtl, t1, t0, 32, 32);
> +        tcg_temp_free_i64(t0);
> +        tcg_temp_free_i64(t1);
> +    } else {
> +        tcg_gen_qemu_ld_i64(xth, EA, ctx->mem_idx, MO_BEQ);
> +        tcg_gen_addi_tl(EA, EA, 8);
> +        tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
> +    }
>      tcg_temp_free(EA);
> -    tcg_temp_free_i64(tmp);
>  }
>  
>  #define VSX_STORE_SCALAR(name, operation)                     \

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
  2016-09-29  1:38   ` David Gibson
@ 2016-09-29  2:34     ` Nikunj A Dadhania
  2016-09-29  3:41     ` Nikunj A Dadhania
  1 sibling, 0 replies; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-29  2:34 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, rth, qemu-devel, benh

David Gibson <david@gibson.dropbear.id.au> writes:

> [ Unknown signature status ]
> On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
>> Load 8byte at a time and manipulate.
>> 
>> Big-Endian Storage
>> +-------------+-------------+-------------+-------------+
>> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
>> +-------------+-------------+-------------+-------------+
>> 
>> Little-Endian Storage
>> +-------------+-------------+-------------+-------------+
>> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
>> +-------------+-------------+-------------+-------------+
>> 
>> Vector load results in:
>> +-------------+-------------+-------------+-------------+
>> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
>> +-------------+-------------+-------------+-------------+
>
> Ok.  I'm guessing from this that implementing those GPR<->VSR
> instructions showed that the earlier versions were endian-incorrect as
> I suspected.
>
> Have you verified that this new implementation is actually faster (or
> at least no slower) on LE than the original implementation with
> individual 32-bit stores?

I haven't, will check it once and get back.

Regards
Nikunj

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction
  2016-09-29  1:29   ` David Gibson
@ 2016-09-29  3:20     ` Nikunj A Dadhania
  0 siblings, 0 replies; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-29  3:20 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, rth, qemu-devel, benh, Ravi Bangoria

David Gibson <david@gibson.dropbear.id.au> writes:

> [ Unknown signature status ]
> On Wed, Sep 28, 2016 at 11:01:20AM +0530, Nikunj A Dadhania wrote:
>> From: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
>> 
>> mtvsrdd: Move To VSR Double Doubleword
>> 
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
>> Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
>> ---
>>  target-ppc/translate/vsx-impl.inc.c | 23 +++++++++++++++++++++++
>>  target-ppc/translate/vsx-ops.inc.c  |  1 +
>>  2 files changed, 24 insertions(+)
>> 
>> diff --git a/target-ppc/translate/vsx-impl.inc.c b/target-ppc/translate/vsx-impl.inc.c
>> index b669e8c..f9db1d4 100644
>> --- a/target-ppc/translate/vsx-impl.inc.c
>> +++ b/target-ppc/translate/vsx-impl.inc.c
>> @@ -234,6 +234,29 @@ static void gen_mfvsrld(DisasContext *ctx)
>>      tcg_gen_mov_i64(cpu_gpr[rA(ctx->opcode)], cpu_vsrl(xS(ctx->opcode)));
>>  }
>>  
>> +static void gen_mtvsrdd(DisasContext *ctx)
>> +{
>> +    if (xT(ctx->opcode) < 32) {
>> +        if (unlikely(!ctx->vsx_enabled)) {
>> +            gen_exception(ctx, POWERPC_EXCP_VSXU);
>> +            return;
>> +        }
>> +    } else {
>> +        if (unlikely(!ctx->altivec_enabled)) {
>> +            gen_exception(ctx, POWERPC_EXCP_VPU);
>> +            return;
>> +        }
>> +    }
>
> Huh.. so in the ISA doc version I have at least (p114), the
> pseudo-code for the instruction states either vector or VSX
> exceptions.  The text however says either FP or vector exceptions.
>
> The pseudo-code version seems more sensible which is what you've
> implemented, so I'm guessing this is just an error in the descriptive
> text.

Right.

> It'd be nice to confirm that against real hardware behaviour if
> possible though.

Sure, will check it.

Regards,
Nikunj

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
  2016-09-29  1:38   ` David Gibson
  2016-09-29  2:34     ` Nikunj A Dadhania
@ 2016-09-29  3:41     ` Nikunj A Dadhania
  2016-09-29  3:48       ` Richard Henderson
  2016-09-29  3:55       ` David Gibson
  1 sibling, 2 replies; 33+ messages in thread
From: Nikunj A Dadhania @ 2016-09-29  3:41 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, rth, qemu-devel, benh

David Gibson <david@gibson.dropbear.id.au> writes:

> [ Unknown signature status ]
> On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
>> Load 8byte at a time and manipulate.
>> 
>> Big-Endian Storage
>> +-------------+-------------+-------------+-------------+
>> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
>> +-------------+-------------+-------------+-------------+
>> 
>> Little-Endian Storage
>> +-------------+-------------+-------------+-------------+
>> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
>> +-------------+-------------+-------------+-------------+
>> 
>> Vector load results in:
>> +-------------+-------------+-------------+-------------+
>> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
>> +-------------+-------------+-------------+-------------+
>
> Ok.  I'm guessing from this that implementing those GPR<->VSR
> instructions showed that the earlier versions were endian-incorrect as
> I suspected.
>
> Have you verified that this new implementation is actually faster (or
> at least no slower) on LE than the original implementation with
> individual 32-bit stores?

Result of million lxvw4x, mfvsrd/mfvsrld and print

Without patch:
==============
[tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
real	0m2.812s
user	0m2.792s
sys	0m0.020s
[tcg_test]$

With patch:
===========
[tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
real	0m2.801s
user	0m2.783s
sys	0m0.018s
[tcg_test]$

Not much perceivable difference, is there a better way to benchmark?

Regards
Nikunj

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
  2016-09-29  3:41     ` Nikunj A Dadhania
@ 2016-09-29  3:48       ` Richard Henderson
  2016-09-29  3:57         ` David Gibson
  2016-09-29  3:55       ` David Gibson
  1 sibling, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2016-09-29  3:48 UTC (permalink / raw)
  To: Nikunj A Dadhania, David Gibson; +Cc: qemu-ppc, qemu-devel, benh

On 09/28/2016 08:41 PM, Nikunj A Dadhania wrote:
> Without patch:
> ==============
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> real	0m2.812s
> user	0m2.792s
> sys	0m0.020s
> [tcg_test]$
>
> With patch:
> ===========
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> real	0m2.801s
> user	0m2.783s
> sys	0m0.018s
> [tcg_test]$
>
> Not much perceivable difference, is there a better way to benchmark?

There should be more of a difference for softmmu, since the tlb lookup for the 
memory is more expensive.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
  2016-09-29  3:41     ` Nikunj A Dadhania
  2016-09-29  3:48       ` Richard Henderson
@ 2016-09-29  3:55       ` David Gibson
  1 sibling, 0 replies; 33+ messages in thread
From: David Gibson @ 2016-09-29  3:55 UTC (permalink / raw)
  To: Nikunj A Dadhania; +Cc: qemu-ppc, rth, qemu-devel, benh

[-- Attachment #1: Type: text/plain, Size: 2756 bytes --]

On Thu, Sep 29, 2016 at 09:11:10AM +0530, Nikunj A Dadhania wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > [ Unknown signature status ]
> > On Wed, Sep 28, 2016 at 11:01:22AM +0530, Nikunj A Dadhania wrote:
> >> Load 8byte at a time and manipulate.
> >> 
> >> Big-Endian Storage
> >> +-------------+-------------+-------------+-------------+
> >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> >> +-------------+-------------+-------------+-------------+
> >> 
> >> Little-Endian Storage
> >> +-------------+-------------+-------------+-------------+
> >> | 33 22 11 00 | 77 66 55 44 | BB AA 99 88 | FF EE DD CC |
> >> +-------------+-------------+-------------+-------------+
> >> 
> >> Vector load results in:
> >> +-------------+-------------+-------------+-------------+
> >> | 00 11 22 33 | 44 55 66 77 | 88 99 AA BB | CC DD EE FF |
> >> +-------------+-------------+-------------+-------------+
> >
> > Ok.  I'm guessing from this that implementing those GPR<->VSR
> > instructions showed that the earlier versions were endian-incorrect as
> > I suspected.
> >
> > Have you verified that this new implementation is actually faster (or
> > at least no slower) on LE than the original implementation with
> > individual 32-bit stores?
> 
> Result of million lxvw4x, mfvsrd/mfvsrld and print
> 
> Without patch:
> ==============
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> real	0m2.812s
> user	0m2.792s
> sys	0m0.020s
> [tcg_test]$
> 
> With patch:
> ===========
> [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> real	0m2.801s
> user	0m2.783s
> sys	0m0.018s
> [tcg_test]$
> 
> Not much perceivable difference, is there a better way to benchmark?

Not dramatically, that I can think of.  A few tweaks you can make:
    * Increase the loop counter so the test simply runs for longer
    * Also run the test multiple times, so you can get an idea of how
      much the results vary from one run to another
    * Run the test on a system that's as idle of other activity as you
      can make it (at both host and guest level).

For out purposes the user time is probably the meaningful thing here,
and should show less variance than the system and real time.

Note that it would be interesting to get these results for both a
power and x86 host.

In any case the results above are enough to convince me that the
change isn't likely to be a significant regression.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation
  2016-09-29  3:48       ` Richard Henderson
@ 2016-09-29  3:57         ` David Gibson
  0 siblings, 0 replies; 33+ messages in thread
From: David Gibson @ 2016-09-29  3:57 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Nikunj A Dadhania, qemu-ppc, qemu-devel, benh

[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]

On Wed, Sep 28, 2016 at 08:48:54PM -0700, Richard Henderson wrote:
> On 09/28/2016 08:41 PM, Nikunj A Dadhania wrote:
> > Without patch:
> > ==============
> > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> > real	0m2.812s
> > user	0m2.792s
> > sys	0m0.020s
> > [tcg_test]$
> > 
> > With patch:
> > ===========
> > [tcg_test]$ time ../qemu/ppc64le-linux-user/qemu-ppc64le  -cpu POWER9 le_lxvw4x  >/dev/null
> > real	0m2.801s
> > user	0m2.783s
> > sys	0m0.018s
> > [tcg_test]$
> > 
> > Not much perceivable difference, is there a better way to benchmark?
> 
> There should be more of a difference for softmmu, since the tlb lookup for
> the memory is more expensive.

Good point.  Oh.. also, I'd remove the prints from the benchmark for
this purpose.  The time involved in the syscalls and whatnot for the
print will just add noise to the measurement (sending to /dev/null
reduces the impact, but it's probably still significant compared to a
simple math operation).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-09-29  4:23 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-28  5:31 [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 1/9] target-ppc: Implement mfvsrld instruction Nikunj A Dadhania
2016-09-28 16:03   ` Richard Henderson
2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 2/9] target-ppc: Implement mtvsrdd instruction Nikunj A Dadhania
2016-09-28 16:01   ` Richard Henderson
2016-09-28 17:06     ` Nikunj A Dadhania
2016-09-29  1:29   ` David Gibson
2016-09-29  3:20     ` Nikunj A Dadhania
2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 3/9] target-ppc: Implement mtvsrws instruction Nikunj A Dadhania
2016-09-28 16:04   ` Richard Henderson
2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 4/9] target-ppc: improve lxvw4x implementation Nikunj A Dadhania
2016-09-28 16:07   ` Richard Henderson
2016-09-29  1:38   ` David Gibson
2016-09-29  2:34     ` Nikunj A Dadhania
2016-09-29  3:41     ` Nikunj A Dadhania
2016-09-29  3:48       ` Richard Henderson
2016-09-29  3:57         ` David Gibson
2016-09-29  3:55       ` David Gibson
2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 5/9] target-ppc: improve stxvw4x implementation Nikunj A Dadhania
2016-09-28 16:08   ` Richard Henderson
2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 6/9] target-ppc: add lxvh8x instruction Nikunj A Dadhania
2016-09-28 16:12   ` Richard Henderson
2016-09-28 17:11     ` Nikunj A Dadhania
2016-09-28 17:22       ` Richard Henderson
2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 7/9] target-ppc: add stxvh8x instruction Nikunj A Dadhania
2016-09-28 16:13   ` Richard Henderson
2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 8/9] target-ppc: add lxvb16x instruction Nikunj A Dadhania
2016-09-28 16:13   ` Richard Henderson
2016-09-28  5:31 ` [Qemu-devel] [PATCH v4 9/9] target-ppc: add stxvb16x instruction Nikunj A Dadhania
2016-09-28 16:13   ` Richard Henderson
2016-09-28  5:38 ` [Qemu-devel] [PATCH v4 0/9] POWER9 TCG enablements - part4 Nikunj A Dadhania
2016-09-28  9:28 ` [Qemu-devel] [Qemu-ppc] " Thomas Huth
2016-09-28 11:34   ` Nikunj A Dadhania

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.