[Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c
@ 2019-01-27  9:02 Mark Cave-Ayland
  2019-01-27  9:02 ` [Qemu-devel] [PATCH v3 1/8] target/ppc: implement complete set of Vsr* macros Mark Cave-Ayland
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27  9:02 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, richard.henderson, david

>From working on the TCG vector operations patchset, it is apparent that there
are a large number of endian-based hacks in int_helper.c which can be removed by
making use of the various Vsr* macros.

Patch 1 is simple enough, and implements the complete set of Vsr* macros for
both big endian and little endian hosts.

Patches 2 and 3 rework the vector merge and multiply algorithms so that they
no longer require the endian-dependent HI_IDX and LO_IDX macros.

Patches 4 and 5 then completely remove the HI_IDX/LO_IDX and EL_IDX macros by
replacing the element accesses with their equivalent Vsr* macro instead.

Patches 6 and 7 tidy up the VEXT_SIGNED macro and fix a potential shift bug
in the ROTRu32 and ROTRu64 macros pointed out by Richard during the review of
v1.

Finally patch 8 is an inspection-based removal of other HOST_WORDS_BIGENDIAN
hacks in int_helper.c, again replacing accesses with the relevant Vsr* macro
instead.

Note that there are still some endian hacks left in int_helper.c after this
patchset: in particular the delightfully evil VECTOR_FOR_INORDER_I macro still
remains in places where the index variable was used for purposes other than
accessing elements within the vector.

There were also some parts where additional conversion could be done, but I
wasn't confident enough to make the change without access to PPC64 test images
or real big-endian host hardware.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>

v3:
- Rebase onto master
- Add R-B tags from Richard

v2:
- Add R-B tags from Richard
- Add patches 6 and 7 to simplify the VEXT_SIGNED macro and fix a potential
  shift bug in the ROTRu32 and ROTRu64 macros
- Don't require ordered access for VNEG and VGENERIC_DO macros as pointed out
  by Richard

Mark Cave-Ayland (8):
  target/ppc: implement complete set of Vsr* macros
  target/ppc: rework vmrg{l,h}{b,h,w} instructions to use Vsr* macros
  target/ppc: rework vmul{e,o}{s,u}{b,h,w} instructions to use Vsr*
    macros
  target/ppc: eliminate use of HI_IDX and LO_IDX macros from
    int_helper.c
  target/ppc: eliminate use of EL_IDX macros from int_helper.c
  target/ppc: simplify VEXT_SIGNED macro in int_helper.c
  target/ppc: remove ROTRu32 and ROTRu64 macros from int_helper.c
  target/ppc: remove various HOST_WORDS_BIGENDIAN hacks in int_helper.c

 target/ppc/int_helper.c | 521 ++++++++++++++++++++----------------------------
 target/ppc/internal.h   |   9 +-
 2 files changed, 223 insertions(+), 307 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v3 1/8] target/ppc: implement complete set of Vsr* macros
  2019-01-27  9:02 [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c Mark Cave-Ayland
@ 2019-01-27  9:02 ` Mark Cave-Ayland
  2019-01-28  9:19   ` David Gibson
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use " Mark Cave-Ayland
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27  9:02 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, richard.henderson, david

This prepares us for eliminating the use of direct array access within the VMX
instruction implementations.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/ppc/internal.h | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/target/ppc/internal.h b/target/ppc/internal.h
index c7c0f77dd6..f26a71ffcf 100644
--- a/target/ppc/internal.h
+++ b/target/ppc/internal.h
@@ -206,16 +206,23 @@ EXTRACT_HELPER_SPLIT_3(DCMX_XV, 5, 16, 0, 1, 2, 5, 1, 6, 6);
 
 #if defined(HOST_WORDS_BIGENDIAN)
 #define VsrB(i) u8[i]
+#define VsrSB(i) s8[i]
 #define VsrH(i) u16[i]
+#define VsrSH(i) s16[i]
 #define VsrW(i) u32[i]
+#define VsrSW(i) s32[i]
 #define VsrD(i) u64[i]
+#define VsrSD(i) s64[i]
 #else
 #define VsrB(i) u8[15 - (i)]
+#define VsrSB(i) s8[15 - (i)]
 #define VsrH(i) u16[7 - (i)]
+#define VsrSH(i) s16[7 - (i)]
 #define VsrW(i) u32[3 - (i)]
+#define VsrSW(i) s32[3 - (i)]
 #define VsrD(i) u64[1 - (i)]
+#define VsrSD(i) s64[1 - (i)]
 #endif
-
 static inline void getVSR(int n, ppc_vsr_t *vsr, CPUPPCState *env)
 {
     vsr->VsrD(0) = env->vsr[n].u64[0];
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27  9:02 [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c Mark Cave-Ayland
  2019-01-27  9:02 ` [Qemu-devel] [PATCH v3 1/8] target/ppc: implement complete set of Vsr* macros Mark Cave-Ayland
@ 2019-01-27  9:03 ` Mark Cave-Ayland
  2019-01-27 12:07   ` [Qemu-devel] [Qemu-ppc] " BALATON Zoltan
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 3/8] target/ppc: rework vmul{e, o}{s, u}{b, " Mark Cave-Ayland
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27  9:03 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, richard.henderson, david

The current implementations make use of the endian-specific macros MRGLO/MRGHI
and also reference HI_IDX and LO_IDX directly to calculate array offsets.

Rework the implementation to use the Vsr* macros so that these per-endian
references can be removed.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
---
 target/ppc/int_helper.c | 52 ++++++++++++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 27 deletions(-)

diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 598731d47a..f084a706ee 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -976,43 +976,41 @@ void helper_vmladduhm(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
     }
 }
 
-#define VMRG_DO(name, element, highp)                                   \
+#define VMRG_DOLO(name, element, access)                                \
     void helper_v##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)       \
     {                                                                   \
         ppc_avr_t result;                                               \
         int i;                                                          \
-        size_t n_elems = ARRAY_SIZE(r->element);                        \
+        int m = ARRAY_SIZE(r->element) - 1;                             \
                                                                         \
-        for (i = 0; i < n_elems / 2; i++) {                             \
-            if (highp) {                                                \
-                result.element[i*2+HI_IDX] = a->element[i];             \
-                result.element[i*2+LO_IDX] = b->element[i];             \
-            } else {                                                    \
-                result.element[n_elems - i * 2 - (1 + HI_IDX)] =        \
-                    b->element[n_elems - i - 1];                        \
-                result.element[n_elems - i * 2 - (1 + LO_IDX)] =        \
-                    a->element[n_elems - i - 1];                        \
-            }                                                           \
+        for (i = 0; i < ARRAY_SIZE(r->element); i++) {                  \
+            result.access(m - i) = (i & 1) ? a->access(m - (i >> 1))    \
+                                           : b->access(m - (i >> 1));   \
         }                                                               \
         *r = result;                                                    \
     }
-#if defined(HOST_WORDS_BIGENDIAN)
-#define MRGHI 0
-#define MRGLO 1
-#else
-#define MRGHI 1
-#define MRGLO 0
-#endif
-#define VMRG(suffix, element)                   \
-    VMRG_DO(mrgl##suffix, element, MRGHI)       \
-    VMRG_DO(mrgh##suffix, element, MRGLO)
-VMRG(b, u8)
-VMRG(h, u16)
-VMRG(w, u32)
+
+#define VMRG_DOHI(name, element, access)                                \
+    void helper_v##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)       \
+    {                                                                   \
+        ppc_avr_t result;                                               \
+        int i;                                                          \
+                                                                        \
+        for (i = 0; i < ARRAY_SIZE(r->element); i++) {                  \
+            result.access(i) = (i & 1) ? b->access(i >> 1)              \
+                                       : a->access(i >> 1);             \
+        }                                                               \
+        *r = result;                                                    \
+    }
+
+#define VMRG(suffix, element, access)                  \
+    VMRG_DOLO(mrgl##suffix, element, access)           \
+    VMRG_DOHI(mrgh##suffix, element, access)
+VMRG(b, u8, VsrB)
+VMRG(h, u16, VsrH)
+VMRG(w, u32, VsrW)
 #undef VMRG_DO
 #undef VMRG
-#undef MRGHI
-#undef MRGLO
 
 void helper_vmsummbm(CPUPPCState *env, ppc_avr_t *r, ppc_avr_t *a,
                      ppc_avr_t *b, ppc_avr_t *c)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v3 3/8] target/ppc: rework vmul{e, o}{s, u}{b, h, w} instructions to use Vsr* macros
  2019-01-27  9:02 [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c Mark Cave-Ayland
  2019-01-27  9:02 ` [Qemu-devel] [PATCH v3 1/8] target/ppc: implement complete set of Vsr* macros Mark Cave-Ayland
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use " Mark Cave-Ayland
@ 2019-01-27  9:03 ` Mark Cave-Ayland
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 4/8] target/ppc: eliminate use of HI_IDX and LO_IDX macros from int_helper.c Mark Cave-Ayland
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27  9:03 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, richard.henderson, david

The current implementations make use of the endian-specific macros HI_IDX and
LO_IDX directly to calculate array offsets.

Rework the implementation to use the Vsr* macros so that these per-endian
references can be removed.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/ppc/int_helper.c | 48 +++++++++++++++++++++++++++---------------------
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index f084a706ee..dc5c9fb8d8 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -1118,33 +1118,39 @@ void helper_vmsumuhs(CPUPPCState *env, ppc_avr_t *r, ppc_avr_t *a,
     }
 }
 
-#define VMUL_DO(name, mul_element, prod_element, cast, evenp)           \
+#define VMUL_DO_EVN(name, mul_element, mul_access, prod_access, cast)   \
     void helper_v##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)       \
     {                                                                   \
         int i;                                                          \
                                                                         \
-        VECTOR_FOR_INORDER_I(i, prod_element) {                         \
-            if (evenp) {                                                \
-                r->prod_element[i] =                                    \
-                    (cast)a->mul_element[i * 2 + HI_IDX] *              \
-                    (cast)b->mul_element[i * 2 + HI_IDX];               \
-            } else {                                                    \
-                r->prod_element[i] =                                    \
-                    (cast)a->mul_element[i * 2 + LO_IDX] *              \
-                    (cast)b->mul_element[i * 2 + LO_IDX];               \
-            }                                                           \
+        for (i = 0; i < ARRAY_SIZE(r->mul_element); i += 2) {           \
+            r->prod_access(i >> 1) = (cast)a->mul_access(i) *           \
+                                     (cast)b->mul_access(i);            \
+        }                                                               \
+    }
+
+#define VMUL_DO_ODD(name, mul_element, mul_access, prod_access, cast)   \
+    void helper_v##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)       \
+    {                                                                   \
+        int i;                                                          \
+                                                                        \
+        for (i = 0; i < ARRAY_SIZE(r->mul_element); i += 2) {           \
+            r->prod_access(i >> 1) = (cast)a->mul_access(i + 1) *       \
+                                     (cast)b->mul_access(i + 1);        \
         }                                                               \
     }
-#define VMUL(suffix, mul_element, prod_element, cast)            \
-    VMUL_DO(mule##suffix, mul_element, prod_element, cast, 1)    \
-    VMUL_DO(mulo##suffix, mul_element, prod_element, cast, 0)
-VMUL(sb, s8, s16, int16_t)
-VMUL(sh, s16, s32, int32_t)
-VMUL(sw, s32, s64, int64_t)
-VMUL(ub, u8, u16, uint16_t)
-VMUL(uh, u16, u32, uint32_t)
-VMUL(uw, u32, u64, uint64_t)
-#undef VMUL_DO
+
+#define VMUL(suffix, mul_element, mul_access, prod_access, cast)       \
+    VMUL_DO_EVN(mule##suffix, mul_element, mul_access, prod_access, cast)  \
+    VMUL_DO_ODD(mulo##suffix, mul_element, mul_access, prod_access, cast)
+VMUL(sb, s8, VsrSB, VsrSH, int16_t)
+VMUL(sh, s16, VsrSH, VsrSW, int32_t)
+VMUL(sw, s32, VsrSW, VsrSD, int64_t)
+VMUL(ub, u8, VsrB, VsrH, uint16_t)
+VMUL(uh, u16, VsrH, VsrW, uint32_t)
+VMUL(uw, u32, VsrW, VsrD, uint64_t)
+#undef VMUL_DO_EVN
+#undef VMUL_DO_ODD
 #undef VMUL
 
 void helper_vperm(CPUPPCState *env, ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v3 4/8] target/ppc: eliminate use of HI_IDX and LO_IDX macros from int_helper.c
  2019-01-27  9:02 [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c Mark Cave-Ayland
                   ` (2 preceding siblings ...)
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 3/8] target/ppc: rework vmul{e, o}{s, u}{b, " Mark Cave-Ayland
@ 2019-01-27  9:03 ` Mark Cave-Ayland
  2019-01-27 19:13   ` Richard Henderson
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 5/8] target/ppc: eliminate use of EL_IDX " Mark Cave-Ayland
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27  9:03 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, richard.henderson, david

The original purpose of these macros was to correctly reference the high and low
parts of the VSRs regardless of the host endianness.

Replace these direct references to high and low parts with the relevant VsrD
macro instead, and completely remove the now-unused HI_IDX and LO_IDX macros.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
---
 target/ppc/int_helper.c | 180 +++++++++++++++++++++++-------------------------
 1 file changed, 85 insertions(+), 95 deletions(-)

diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index dc5c9fb8d8..addbe54c21 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -389,14 +389,6 @@ target_ulong helper_602_mfrom(target_ulong arg)
 /*****************************************************************************/
 /* Altivec extension helpers */
 #if defined(HOST_WORDS_BIGENDIAN)
-#define HI_IDX 0
-#define LO_IDX 1
-#else
-#define HI_IDX 1
-#define LO_IDX 0
-#endif
-
-#if defined(HOST_WORDS_BIGENDIAN)
 #define VECTOR_FOR_INORDER_I(index, element)                    \
     for (index = 0; index < ARRAY_SIZE(r->element); index++)
 #else
@@ -514,8 +506,8 @@ void helper_vprtybq(ppc_avr_t *r, ppc_avr_t *b)
     res ^= res >> 32;
     res ^= res >> 16;
     res ^= res >> 8;
-    r->u64[LO_IDX] = res & 1;
-    r->u64[HI_IDX] = 0;
+    r->VsrD(1) = res & 1;
+    r->VsrD(0) = 0;
 }
 
 #define VARITH_DO(name, op, element)                                    \
@@ -1243,8 +1235,8 @@ void helper_vbpermq(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
         }
     }
 
-    r->u64[HI_IDX] = perm;
-    r->u64[LO_IDX] = 0;
+    r->VsrD(0) = perm;
+    r->VsrD(1) = 0;
 }
 
 #undef VBPERMQ_INDEX
@@ -1573,25 +1565,25 @@ void helper_vpmsumd(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
     ppc_avr_t prod[2];
 
     VECTOR_FOR_INORDER_I(i, u64) {
-        prod[i].u64[LO_IDX] = prod[i].u64[HI_IDX] = 0;
+        prod[i].VsrD(1) = prod[i].VsrD(0) = 0;
         for (j = 0; j < 64; j++) {
             if (a->u64[i] & (1ull<<j)) {
                 ppc_avr_t bshift;
                 if (j == 0) {
-                    bshift.u64[HI_IDX] = 0;
-                    bshift.u64[LO_IDX] = b->u64[i];
+                    bshift.VsrD(0) = 0;
+                    bshift.VsrD(1) = b->u64[i];
                 } else {
-                    bshift.u64[HI_IDX] = b->u64[i] >> (64-j);
-                    bshift.u64[LO_IDX] = b->u64[i] << j;
+                    bshift.VsrD(0) = b->u64[i] >> (64 - j);
+                    bshift.VsrD(1) = b->u64[i] << j;
                 }
-                prod[i].u64[LO_IDX] ^= bshift.u64[LO_IDX];
-                prod[i].u64[HI_IDX] ^= bshift.u64[HI_IDX];
+                prod[i].VsrD(1) ^= bshift.VsrD(1);
+                prod[i].VsrD(0) ^= bshift.VsrD(0);
             }
         }
     }
 
-    r->u64[LO_IDX] = prod[0].u64[LO_IDX] ^ prod[1].u64[LO_IDX];
-    r->u64[HI_IDX] = prod[0].u64[HI_IDX] ^ prod[1].u64[HI_IDX];
+    r->VsrD(1) = prod[0].VsrD(1) ^ prod[1].VsrD(1);
+    r->VsrD(0) = prod[0].VsrD(0) ^ prod[1].VsrD(0);
 #endif
 }
 
@@ -1809,7 +1801,7 @@ VEXTU_X_DO(vextuwrx, 32, 0)
 #define VSHIFT(suffix, leftp)                                           \
     void helper_vs##suffix(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)    \
     {                                                                   \
-        int shift = b->u8[LO_IDX*15] & 0x7;                             \
+        int shift = b->VsrB(15) & 0x7;                                  \
         int doit = 1;                                                   \
         int i;                                                          \
                                                                         \
@@ -1820,15 +1812,15 @@ VEXTU_X_DO(vextuwrx, 32, 0)
             if (shift == 0) {                                           \
                 *r = *a;                                                \
             } else if (leftp) {                                         \
-                uint64_t carry = a->u64[LO_IDX] >> (64 - shift);        \
+                uint64_t carry = a->VsrD(1) >> (64 - shift);            \
                                                                         \
-                r->u64[HI_IDX] = (a->u64[HI_IDX] << shift) | carry;     \
-                r->u64[LO_IDX] = a->u64[LO_IDX] << shift;               \
+                r->VsrD(0) = (a->VsrD(0) << shift) | carry;             \
+                r->VsrD(1) = a->VsrD(1) << shift;                       \
             } else {                                                    \
-                uint64_t carry = a->u64[HI_IDX] << (64 - shift);        \
+                uint64_t carry = a->VsrD(0) << (64 - shift);            \
                                                                         \
-                r->u64[LO_IDX] = (a->u64[LO_IDX] >> shift) | carry;     \
-                r->u64[HI_IDX] = a->u64[HI_IDX] >> shift;               \
+                r->VsrD(1) = (a->VsrD(1) >> shift) | carry;             \
+                r->VsrD(0) = a->VsrD(0) >> shift;                       \
             }                                                           \
         }                                                               \
     }
@@ -1914,7 +1906,7 @@ void helper_vsldoi(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t shift)
 
 void helper_vslo(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 {
-    int sh = (b->u8[LO_IDX*0xf] >> 3) & 0xf;
+    int sh = (b->VsrB(0xf) >> 3) & 0xf;
 
 #if defined(HOST_WORDS_BIGENDIAN)
     memmove(&r->u8[0], &a->u8[sh], 16 - sh);
@@ -2110,7 +2102,7 @@ VSR(d, u64, 0x3F)
 
 void helper_vsro(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 {
-    int sh = (b->u8[LO_IDX * 0xf] >> 3) & 0xf;
+    int sh = (b->VsrB(0xf) >> 3) & 0xf;
 
 #if defined(HOST_WORDS_BIGENDIAN)
     memmove(&r->u8[sh], &a->u8[0], 16 - sh);
@@ -2366,13 +2358,13 @@ static inline void avr_qw_not(ppc_avr_t *t, ppc_avr_t a)
 
 static int avr_qw_cmpu(ppc_avr_t a, ppc_avr_t b)
 {
-    if (a.u64[HI_IDX] < b.u64[HI_IDX]) {
+    if (a.VsrD(0) < b.VsrD(0)) {
         return -1;
-    } else if (a.u64[HI_IDX] > b.u64[HI_IDX]) {
+    } else if (a.VsrD(0) > b.VsrD(0)) {
         return 1;
-    } else if (a.u64[LO_IDX] < b.u64[LO_IDX]) {
+    } else if (a.VsrD(1) < b.VsrD(1)) {
         return -1;
-    } else if (a.u64[LO_IDX] > b.u64[LO_IDX]) {
+    } else if (a.VsrD(1) > b.VsrD(1)) {
         return 1;
     } else {
         return 0;
@@ -2381,17 +2373,17 @@ static int avr_qw_cmpu(ppc_avr_t a, ppc_avr_t b)
 
 static void avr_qw_add(ppc_avr_t *t, ppc_avr_t a, ppc_avr_t b)
 {
-    t->u64[LO_IDX] = a.u64[LO_IDX] + b.u64[LO_IDX];
-    t->u64[HI_IDX] = a.u64[HI_IDX] + b.u64[HI_IDX] +
-                     (~a.u64[LO_IDX] < b.u64[LO_IDX]);
+    t->VsrD(1) = a.VsrD(1) + b.VsrD(1);
+    t->VsrD(0) = a.VsrD(0) + b.VsrD(0) +
+                     (~a.VsrD(1) < b.VsrD(1));
 }
 
 static int avr_qw_addc(ppc_avr_t *t, ppc_avr_t a, ppc_avr_t b)
 {
     ppc_avr_t not_a;
-    t->u64[LO_IDX] = a.u64[LO_IDX] + b.u64[LO_IDX];
-    t->u64[HI_IDX] = a.u64[HI_IDX] + b.u64[HI_IDX] +
-                     (~a.u64[LO_IDX] < b.u64[LO_IDX]);
+    t->VsrD(1) = a.VsrD(1) + b.VsrD(1);
+    t->VsrD(0) = a.VsrD(0) + b.VsrD(0) +
+                     (~a.VsrD(1) < b.VsrD(1));
     avr_qw_not(&not_a, a);
     return avr_qw_cmpu(not_a, b) < 0;
 }
@@ -2413,11 +2405,11 @@ void helper_vaddeuqm(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
     r->u128 = a->u128 + b->u128 + (c->u128 & 1);
 #else
 
-    if (c->u64[LO_IDX] & 1) {
+    if (c->VsrD(1) & 1) {
         ppc_avr_t tmp;
 
-        tmp.u64[HI_IDX] = 0;
-        tmp.u64[LO_IDX] = c->u64[LO_IDX] & 1;
+        tmp.VsrD(0) = 0;
+        tmp.VsrD(1) = c->VsrD(1) & 1;
         avr_qw_add(&tmp, *a, tmp);
         avr_qw_add(r, tmp, *b);
     } else {
@@ -2435,8 +2427,8 @@ void helper_vaddcuq(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 
     avr_qw_not(&not_a, *a);
 
-    r->u64[HI_IDX] = 0;
-    r->u64[LO_IDX] = (avr_qw_cmpu(not_a, *b) < 0);
+    r->VsrD(0) = 0;
+    r->VsrD(1) = (avr_qw_cmpu(not_a, *b) < 0);
 #endif
 }
 
@@ -2451,7 +2443,7 @@ void helper_vaddecuq(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
     r->u128 = carry_out;
 #else
 
-    int carry_in = c->u64[LO_IDX] & 1;
+    int carry_in = c->VsrD(1) & 1;
     int carry_out = 0;
     ppc_avr_t tmp;
 
@@ -2461,8 +2453,8 @@ void helper_vaddecuq(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
         ppc_avr_t one = QW_ONE;
         carry_out = avr_qw_addc(&tmp, tmp, one);
     }
-    r->u64[HI_IDX] = 0;
-    r->u64[LO_IDX] = carry_out;
+    r->VsrD(0) = 0;
+    r->VsrD(1) = carry_out;
 #endif
 }
 
@@ -2490,8 +2482,8 @@ void helper_vsubeuqm(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
     avr_qw_not(&tmp, *b);
     avr_qw_add(&sum, *a, tmp);
 
-    tmp.u64[HI_IDX] = 0;
-    tmp.u64[LO_IDX] = c->u64[LO_IDX] & 1;
+    tmp.VsrD(0) = 0;
+    tmp.VsrD(1) = c->VsrD(1) & 1;
     avr_qw_add(r, sum, tmp);
 #endif
 }
@@ -2507,10 +2499,10 @@ void helper_vsubcuq(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
         ppc_avr_t tmp;
         avr_qw_not(&tmp, *b);
         avr_qw_add(&tmp, *a, tmp);
-        carry = ((tmp.s64[HI_IDX] == -1ull) && (tmp.s64[LO_IDX] == -1ull));
+        carry = ((tmp.VsrSD(0) == -1ull) && (tmp.VsrSD(1) == -1ull));
     }
-    r->u64[HI_IDX] = 0;
-    r->u64[LO_IDX] = carry;
+    r->VsrD(0) = 0;
+    r->VsrD(1) = carry;
 #endif
 }
 
@@ -2521,17 +2513,17 @@ void helper_vsubecuq(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
         (~a->u128 < ~b->u128) ||
         ((c->u128 & 1) && (a->u128 + ~b->u128 == (__uint128_t)-1));
 #else
-    int carry_in = c->u64[LO_IDX] & 1;
+    int carry_in = c->VsrD(1) & 1;
     int carry_out = (avr_qw_cmpu(*a, *b) > 0);
     if (!carry_out && carry_in) {
         ppc_avr_t tmp;
         avr_qw_not(&tmp, *b);
         avr_qw_add(&tmp, *a, tmp);
-        carry_out = ((tmp.u64[HI_IDX] == -1ull) && (tmp.u64[LO_IDX] == -1ull));
+        carry_out = ((tmp.VsrD(0) == -1ull) && (tmp.VsrD(1) == -1ull));
     }
 
-    r->u64[HI_IDX] = 0;
-    r->u64[LO_IDX] = carry_out;
+    r->VsrD(0) = 0;
+    r->VsrD(1) = carry_out;
 #endif
 }
 
@@ -2629,7 +2621,7 @@ static bool bcd_is_valid(ppc_avr_t *bcd)
 
 static int bcd_cmp_zero(ppc_avr_t *bcd)
 {
-    if (bcd->u64[HI_IDX] == 0 && (bcd->u64[LO_IDX] >> 4) == 0) {
+    if (bcd->VsrD(0) == 0 && (bcd->VsrD(1) >> 4) == 0) {
         return CRF_EQ;
     } else {
         return (bcd_get_sgn(bcd) == 1) ? CRF_GT : CRF_LT;
@@ -2749,7 +2741,7 @@ uint32_t helper_bcdadd(ppc_avr_t *r,  ppc_avr_t *a, ppc_avr_t *b, uint32_t ps)
     }
 
     if (unlikely(invalid)) {
-        result.u64[HI_IDX] = result.u64[LO_IDX] = -1;
+        result.VsrD(0) = result.VsrD(1) = -1;
         cr = CRF_SO;
     } else if (overflow) {
         cr |= CRF_SO;
@@ -2818,7 +2810,7 @@ uint32_t helper_bcdctn(ppc_avr_t *r, ppc_avr_t *b, uint32_t ps)
     int invalid = (sgnb == 0);
     ppc_avr_t ret = { .u64 = { 0, 0 } };
 
-    int ox_flag = (b->u64[HI_IDX] != 0) || ((b->u64[LO_IDX] >> 32) != 0);
+    int ox_flag = (b->VsrD(0) != 0) || ((b->VsrD(1) >> 32) != 0);
 
     for (i = 1; i < 8; i++) {
         set_national_digit(&ret, 0x30 + bcd_get_digit(b, i, &invalid), i);
@@ -2898,7 +2890,7 @@ uint32_t helper_bcdctz(ppc_avr_t *r, ppc_avr_t *b, uint32_t ps)
     int invalid = (sgnb == 0);
     ppc_avr_t ret = { .u64 = { 0, 0 } };
 
-    int ox_flag = ((b->u64[HI_IDX] >> 4) != 0);
+    int ox_flag = ((b->VsrD(0) >> 4) != 0);
 
     for (i = 0; i < 16; i++) {
         digit = bcd_get_digit(b, i + 1, &invalid);
@@ -2939,13 +2931,13 @@ uint32_t helper_bcdcfsq(ppc_avr_t *r, ppc_avr_t *b, uint32_t ps)
     uint64_t hi_value;
     ppc_avr_t ret = { .u64 = { 0, 0 } };
 
-    if (b->s64[HI_IDX] < 0) {
-        lo_value = -b->s64[LO_IDX];
-        hi_value = ~b->u64[HI_IDX] + !lo_value;
+    if (b->VsrSD(0) < 0) {
+        lo_value = -b->VsrSD(1);
+        hi_value = ~b->VsrD(0) + !lo_value;
         bcd_put_digit(&ret, 0xD, 0);
     } else {
-        lo_value = b->u64[LO_IDX];
-        hi_value = b->u64[HI_IDX];
+        lo_value = b->VsrD(1);
+        hi_value = b->VsrD(0);
         bcd_put_digit(&ret, bcd_preferred_sgn(0, ps), 0);
     }
 
@@ -2993,11 +2985,11 @@ uint32_t helper_bcdctsq(ppc_avr_t *r, ppc_avr_t *b, uint32_t ps)
     }
 
     if (sgnb == -1) {
-        r->s64[LO_IDX] = -lo_value;
-        r->s64[HI_IDX] = ~hi_value + !r->s64[LO_IDX];
+        r->VsrSD(1) = -lo_value;
+        r->VsrSD(0) = ~hi_value + !r->VsrSD(1);
     } else {
-        r->s64[LO_IDX] = lo_value;
-        r->s64[HI_IDX] = hi_value;
+        r->VsrSD(1) = lo_value;
+        r->VsrSD(0) = hi_value;
     }
 
     cr = bcd_cmp_zero(b);
@@ -3057,7 +3049,7 @@ uint32_t helper_bcds(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t ps)
     bool ox_flag = false;
     int sgnb = bcd_get_sgn(b);
     ppc_avr_t ret = *b;
-    ret.u64[LO_IDX] &= ~0xf;
+    ret.VsrD(1) &= ~0xf;
 
     if (bcd_is_valid(b) == false) {
         return CRF_SO;
@@ -3070,9 +3062,9 @@ uint32_t helper_bcds(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t ps)
     }
 
     if (i > 0) {
-        ulshift(&ret.u64[LO_IDX], &ret.u64[HI_IDX], i * 4, &ox_flag);
+        ulshift(&ret.VsrD(1), &ret.VsrD(0), i * 4, &ox_flag);
     } else {
-        urshift(&ret.u64[LO_IDX], &ret.u64[HI_IDX], -i * 4);
+        urshift(&ret.VsrD(1), &ret.VsrD(0), -i * 4);
     }
     bcd_put_digit(&ret, bcd_preferred_sgn(sgnb, ps), 0);
 
@@ -3109,13 +3101,13 @@ uint32_t helper_bcdus(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t ps)
 #endif
     if (i >= 32) {
         ox_flag = true;
-        ret.u64[LO_IDX] = ret.u64[HI_IDX] = 0;
+        ret.VsrD(1) = ret.VsrD(0) = 0;
     } else if (i <= -32) {
-        ret.u64[LO_IDX] = ret.u64[HI_IDX] = 0;
+        ret.VsrD(1) = ret.VsrD(0) = 0;
     } else if (i > 0) {
-        ulshift(&ret.u64[LO_IDX], &ret.u64[HI_IDX], i * 4, &ox_flag);
+        ulshift(&ret.VsrD(1), &ret.VsrD(0), i * 4, &ox_flag);
     } else {
-        urshift(&ret.u64[LO_IDX], &ret.u64[HI_IDX], -i * 4);
+        urshift(&ret.VsrD(1), &ret.VsrD(0), -i * 4);
     }
     *r = ret;
 
@@ -3135,7 +3127,7 @@ uint32_t helper_bcdsr(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t ps)
     bool ox_flag = false;
     int sgnb = bcd_get_sgn(b);
     ppc_avr_t ret = *b;
-    ret.u64[LO_IDX] &= ~0xf;
+    ret.VsrD(1) &= ~0xf;
 
 #if defined(HOST_WORDS_BIGENDIAN)
     int i = a->s8[7];
@@ -3156,9 +3148,9 @@ uint32_t helper_bcdsr(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t ps)
     }
 
     if (i > 0) {
-        ulshift(&ret.u64[LO_IDX], &ret.u64[HI_IDX], i * 4, &ox_flag);
+        ulshift(&ret.VsrD(1), &ret.VsrD(0), i * 4, &ox_flag);
     } else {
-        urshift(&ret.u64[LO_IDX], &ret.u64[HI_IDX], -i * 4);
+        urshift(&ret.VsrD(1), &ret.VsrD(0), -i * 4);
 
         if (bcd_get_digit(&ret, 0, &invalid) >= 5) {
             bcd_add_mag(&ret, &ret, &bcd_one, &invalid, &unused);
@@ -3192,19 +3184,19 @@ uint32_t helper_bcdtrunc(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t ps)
 
     if (i > 16 && i < 32) {
         mask = (uint64_t)-1 >> (128 - i * 4);
-        if (ret.u64[HI_IDX] & ~mask) {
+        if (ret.VsrD(0) & ~mask) {
             ox_flag = CRF_SO;
         }
 
-        ret.u64[HI_IDX] &= mask;
+        ret.VsrD(0) &= mask;
     } else if (i >= 0 && i <= 16) {
         mask = (uint64_t)-1 >> (64 - i * 4);
-        if (ret.u64[HI_IDX] || (ret.u64[LO_IDX] & ~mask)) {
+        if (ret.VsrD(0) || (ret.VsrD(1) & ~mask)) {
             ox_flag = CRF_SO;
         }
 
-        ret.u64[LO_IDX] &= mask;
-        ret.u64[HI_IDX] = 0;
+        ret.VsrD(1) &= mask;
+        ret.VsrD(0) = 0;
     }
     bcd_put_digit(&ret, bcd_preferred_sgn(bcd_get_sgn(b), ps), 0);
     *r = ret;
@@ -3235,28 +3227,28 @@ uint32_t helper_bcdutrunc(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t ps)
 #endif
     if (i > 16 && i < 33) {
         mask = (uint64_t)-1 >> (128 - i * 4);
-        if (ret.u64[HI_IDX] & ~mask) {
+        if (ret.VsrD(0) & ~mask) {
             ox_flag = CRF_SO;
         }
 
-        ret.u64[HI_IDX] &= mask;
+        ret.VsrD(0) &= mask;
     } else if (i > 0 && i <= 16) {
         mask = (uint64_t)-1 >> (64 - i * 4);
-        if (ret.u64[HI_IDX] || (ret.u64[LO_IDX] & ~mask)) {
+        if (ret.VsrD(0) || (ret.VsrD(1) & ~mask)) {
             ox_flag = CRF_SO;
         }
 
-        ret.u64[LO_IDX] &= mask;
-        ret.u64[HI_IDX] = 0;
+        ret.VsrD(1) &= mask;
+        ret.VsrD(0) = 0;
     } else if (i == 0) {
-        if (ret.u64[HI_IDX] || ret.u64[LO_IDX]) {
+        if (ret.VsrD(0) || ret.VsrD(1)) {
             ox_flag = CRF_SO;
         }
-        ret.u64[HI_IDX] = ret.u64[LO_IDX] = 0;
+        ret.VsrD(0) = ret.VsrD(1) = 0;
     }
 
     *r = ret;
-    if (r->u64[HI_IDX] == 0 && r->u64[LO_IDX] == 0) {
+    if (r->VsrD(0) == 0 && r->VsrD(1) == 0) {
         return ox_flag | CRF_EQ;
     }
 
@@ -3428,8 +3420,6 @@ void helper_vpermxor(ppc_avr_t *r,  ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
 }
 
 #undef VECTOR_FOR_INORDER_I
-#undef HI_IDX
-#undef LO_IDX
 
 /*****************************************************************************/
 /* SPE extension helpers */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v3 5/8] target/ppc: eliminate use of EL_IDX macros from int_helper.c
  2019-01-27  9:02 [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c Mark Cave-Ayland
                   ` (3 preceding siblings ...)
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 4/8] target/ppc: eliminate use of HI_IDX and LO_IDX macros from int_helper.c Mark Cave-Ayland
@ 2019-01-27  9:03 ` Mark Cave-Ayland
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 6/8] target/ppc: simplify VEXT_SIGNED macro in int_helper.c Mark Cave-Ayland
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27  9:03 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, richard.henderson, david

These macros can be eliminated by instead using the relavant Vsr* macros in
the few locations where they appear.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/ppc/int_helper.c | 66 ++++++++++++++++++++-----------------------------
 1 file changed, 27 insertions(+), 39 deletions(-)

diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index addbe54c21..4cc7fdfd25 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -3320,12 +3320,7 @@ void helper_vncipherlast(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
     *r = result;
 }
 
-#define ROTRu32(v, n) (((v) >> (n)) | ((v) << (32-n)))
-#if defined(HOST_WORDS_BIGENDIAN)
-#define EL_IDX(i) (i)
-#else
-#define EL_IDX(i) (3 - (i))
-#endif
+#define ROTRu32(v, n) (((v) >> (n)) | ((v) << (32 - n)))
 
 void helper_vshasigmaw(ppc_avr_t *r,  ppc_avr_t *a, uint32_t st_six)
 {
@@ -3333,40 +3328,34 @@ void helper_vshasigmaw(ppc_avr_t *r,  ppc_avr_t *a, uint32_t st_six)
     int six = st_six & 0xF;
     int i;
 
-    VECTOR_FOR_INORDER_I(i, u32) {
+    for (i = 0; i < ARRAY_SIZE(r->u32); i++) {
         if (st == 0) {
             if ((six & (0x8 >> i)) == 0) {
-                r->u32[EL_IDX(i)] = ROTRu32(a->u32[EL_IDX(i)], 7) ^
-                                    ROTRu32(a->u32[EL_IDX(i)], 18) ^
-                                    (a->u32[EL_IDX(i)] >> 3);
+                r->VsrW(i) = ROTRu32(a->VsrW(i), 7) ^
+                             ROTRu32(a->VsrW(i), 18) ^
+                             (a->VsrW(i) >> 3);
             } else { /* six.bit[i] == 1 */
-                r->u32[EL_IDX(i)] = ROTRu32(a->u32[EL_IDX(i)], 17) ^
-                                    ROTRu32(a->u32[EL_IDX(i)], 19) ^
-                                    (a->u32[EL_IDX(i)] >> 10);
+                r->VsrW(i) = ROTRu32(a->VsrW(i), 17) ^
+                             ROTRu32(a->VsrW(i), 19) ^
+                             (a->VsrW(i) >> 10);
             }
         } else { /* st == 1 */
             if ((six & (0x8 >> i)) == 0) {
-                r->u32[EL_IDX(i)] = ROTRu32(a->u32[EL_IDX(i)], 2) ^
-                                    ROTRu32(a->u32[EL_IDX(i)], 13) ^
-                                    ROTRu32(a->u32[EL_IDX(i)], 22);
+                r->VsrW(i) = ROTRu32(a->VsrW(i), 2) ^
+                             ROTRu32(a->VsrW(i), 13) ^
+                             ROTRu32(a->VsrW(i), 22);
             } else { /* six.bit[i] == 1 */
-                r->u32[EL_IDX(i)] = ROTRu32(a->u32[EL_IDX(i)], 6) ^
-                                    ROTRu32(a->u32[EL_IDX(i)], 11) ^
-                                    ROTRu32(a->u32[EL_IDX(i)], 25);
+                r->VsrW(i) = ROTRu32(a->VsrW(i), 6) ^
+                             ROTRu32(a->VsrW(i), 11) ^
+                             ROTRu32(a->VsrW(i), 25);
             }
         }
     }
 }
 
 #undef ROTRu32
-#undef EL_IDX
 
 #define ROTRu64(v, n) (((v) >> (n)) | ((v) << (64-n)))
-#if defined(HOST_WORDS_BIGENDIAN)
-#define EL_IDX(i) (i)
-#else
-#define EL_IDX(i) (1 - (i))
-#endif
 
 void helper_vshasigmad(ppc_avr_t *r,  ppc_avr_t *a, uint32_t st_six)
 {
@@ -3374,33 +3363,32 @@ void helper_vshasigmad(ppc_avr_t *r,  ppc_avr_t *a, uint32_t st_six)
     int six = st_six & 0xF;
     int i;
 
-    VECTOR_FOR_INORDER_I(i, u64) {
+    for (i = 0; i < ARRAY_SIZE(r->u64); i++) {
         if (st == 0) {
             if ((six & (0x8 >> (2*i))) == 0) {
-                r->u64[EL_IDX(i)] = ROTRu64(a->u64[EL_IDX(i)], 1) ^
-                                    ROTRu64(a->u64[EL_IDX(i)], 8) ^
-                                    (a->u64[EL_IDX(i)] >> 7);
+                r->VsrD(i) = ROTRu64(a->VsrD(i), 1) ^
+                             ROTRu64(a->VsrD(i), 8) ^
+                             (a->VsrD(i) >> 7);
             } else { /* six.bit[2*i] == 1 */
-                r->u64[EL_IDX(i)] = ROTRu64(a->u64[EL_IDX(i)], 19) ^
-                                    ROTRu64(a->u64[EL_IDX(i)], 61) ^
-                                    (a->u64[EL_IDX(i)] >> 6);
+                r->VsrD(i) = ROTRu64(a->VsrD(i), 19) ^
+                             ROTRu64(a->VsrD(i), 61) ^
+                             (a->VsrD(i) >> 6);
             }
         } else { /* st == 1 */
             if ((six & (0x8 >> (2*i))) == 0) {
-                r->u64[EL_IDX(i)] = ROTRu64(a->u64[EL_IDX(i)], 28) ^
-                                    ROTRu64(a->u64[EL_IDX(i)], 34) ^
-                                    ROTRu64(a->u64[EL_IDX(i)], 39);
+                r->VsrD(i) = ROTRu64(a->VsrD(i), 28) ^
+                             ROTRu64(a->VsrD(i), 34) ^
+                             ROTRu64(a->VsrD(i), 39);
             } else { /* six.bit[2*i] == 1 */
-                r->u64[EL_IDX(i)] = ROTRu64(a->u64[EL_IDX(i)], 14) ^
-                                    ROTRu64(a->u64[EL_IDX(i)], 18) ^
-                                    ROTRu64(a->u64[EL_IDX(i)], 41);
+                r->VsrD(i) = ROTRu64(a->VsrD(i), 14) ^
+                             ROTRu64(a->VsrD(i), 18) ^
+                             ROTRu64(a->VsrD(i), 41);
             }
         }
     }
 }
 
 #undef ROTRu64
-#undef EL_IDX
 
 void helper_vpermxor(ppc_avr_t *r,  ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
 {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v3 6/8] target/ppc: simplify VEXT_SIGNED macro in int_helper.c
  2019-01-27  9:02 [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c Mark Cave-Ayland
                   ` (4 preceding siblings ...)
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 5/8] target/ppc: eliminate use of EL_IDX " Mark Cave-Ayland
@ 2019-01-27  9:03 ` Mark Cave-Ayland
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 7/8] target/ppc: remove ROTRu32 and ROTRu64 macros from int_helper.c Mark Cave-Ayland
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 8/8] target/ppc: remove various HOST_WORDS_BIGENDIAN hacks in int_helper.c Mark Cave-Ayland
  7 siblings, 0 replies; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27  9:03 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, richard.henderson, david

As pointed out by Richard: it does not need the mask argument, nor does it need
the recast argument. The masking is implied by the cast argument, and the
recast is implied by the assignment.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/ppc/int_helper.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 4cc7fdfd25..39f6c96543 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -2038,19 +2038,19 @@ void helper_xxinsertw(CPUPPCState *env, target_ulong xtn,
     putVSR(xtn, &xt, env);
 }
 
-#define VEXT_SIGNED(name, element, mask, cast, recast)              \
+#define VEXT_SIGNED(name, element, cast)                            \
 void helper_##name(ppc_avr_t *r, ppc_avr_t *b)                      \
 {                                                                   \
     int i;                                                          \
     VECTOR_FOR_INORDER_I(i, element) {                              \
-        r->element[i] = (recast)((cast)(b->element[i] & mask));     \
+        r->element[i] = (cast)b->element[i];                        \
     }                                                               \
 }
-VEXT_SIGNED(vextsb2w, s32, UINT8_MAX, int8_t, int32_t)
-VEXT_SIGNED(vextsb2d, s64, UINT8_MAX, int8_t, int64_t)
-VEXT_SIGNED(vextsh2w, s32, UINT16_MAX, int16_t, int32_t)
-VEXT_SIGNED(vextsh2d, s64, UINT16_MAX, int16_t, int64_t)
-VEXT_SIGNED(vextsw2d, s64, UINT32_MAX, int32_t, int64_t)
+VEXT_SIGNED(vextsb2w, s32, int8_t)
+VEXT_SIGNED(vextsb2d, s64, int8_t)
+VEXT_SIGNED(vextsh2w, s32, int16_t)
+VEXT_SIGNED(vextsh2d, s64, int16_t)
+VEXT_SIGNED(vextsw2d, s64, int32_t)
 #undef VEXT_SIGNED
 
 #define VNEG(name, element)                                         \
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v3 7/8] target/ppc: remove ROTRu32 and ROTRu64 macros from int_helper.c
  2019-01-27  9:02 [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c Mark Cave-Ayland
                   ` (5 preceding siblings ...)
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 6/8] target/ppc: simplify VEXT_SIGNED macro in int_helper.c Mark Cave-Ayland
@ 2019-01-27  9:03 ` Mark Cave-Ayland
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 8/8] target/ppc: remove various HOST_WORDS_BIGENDIAN hacks in int_helper.c Mark Cave-Ayland
  7 siblings, 0 replies; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27  9:03 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, richard.henderson, david

Richard points out that these macros suffer from a -fsanitize=shift bug in that
they improperly handle n == 0 turning it into a shift by 32/64 respectively.
Replace them with QEMU's existing ror32() and ror64() functions instead.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/ppc/int_helper.c | 48 ++++++++++++++++++++----------------------------
 1 file changed, 20 insertions(+), 28 deletions(-)

diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 39f6c96543..c3eba9ed41 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -3320,8 +3320,6 @@ void helper_vncipherlast(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
     *r = result;
 }
 
-#define ROTRu32(v, n) (((v) >> (n)) | ((v) << (32 - n)))
-
 void helper_vshasigmaw(ppc_avr_t *r,  ppc_avr_t *a, uint32_t st_six)
 {
     int st = (st_six & 0x10) != 0;
@@ -3331,32 +3329,28 @@ void helper_vshasigmaw(ppc_avr_t *r,  ppc_avr_t *a, uint32_t st_six)
     for (i = 0; i < ARRAY_SIZE(r->u32); i++) {
         if (st == 0) {
             if ((six & (0x8 >> i)) == 0) {
-                r->VsrW(i) = ROTRu32(a->VsrW(i), 7) ^
-                             ROTRu32(a->VsrW(i), 18) ^
+                r->VsrW(i) = ror32(a->VsrW(i), 7) ^
+                             ror32(a->VsrW(i), 18) ^
                              (a->VsrW(i) >> 3);
             } else { /* six.bit[i] == 1 */
-                r->VsrW(i) = ROTRu32(a->VsrW(i), 17) ^
-                             ROTRu32(a->VsrW(i), 19) ^
+                r->VsrW(i) = ror32(a->VsrW(i), 17) ^
+                             ror32(a->VsrW(i), 19) ^
                              (a->VsrW(i) >> 10);
             }
         } else { /* st == 1 */
             if ((six & (0x8 >> i)) == 0) {
-                r->VsrW(i) = ROTRu32(a->VsrW(i), 2) ^
-                             ROTRu32(a->VsrW(i), 13) ^
-                             ROTRu32(a->VsrW(i), 22);
+                r->VsrW(i) = ror32(a->VsrW(i), 2) ^
+                             ror32(a->VsrW(i), 13) ^
+                             ror32(a->VsrW(i), 22);
             } else { /* six.bit[i] == 1 */
-                r->VsrW(i) = ROTRu32(a->VsrW(i), 6) ^
-                             ROTRu32(a->VsrW(i), 11) ^
-                             ROTRu32(a->VsrW(i), 25);
+                r->VsrW(i) = ror32(a->VsrW(i), 6) ^
+                             ror32(a->VsrW(i), 11) ^
+                             ror32(a->VsrW(i), 25);
             }
         }
     }
 }
 
-#undef ROTRu32
-
-#define ROTRu64(v, n) (((v) >> (n)) | ((v) << (64-n)))
-
 void helper_vshasigmad(ppc_avr_t *r,  ppc_avr_t *a, uint32_t st_six)
 {
     int st = (st_six & 0x10) != 0;
@@ -3366,30 +3360,28 @@ void helper_vshasigmad(ppc_avr_t *r,  ppc_avr_t *a, uint32_t st_six)
     for (i = 0; i < ARRAY_SIZE(r->u64); i++) {
         if (st == 0) {
             if ((six & (0x8 >> (2*i))) == 0) {
-                r->VsrD(i) = ROTRu64(a->VsrD(i), 1) ^
-                             ROTRu64(a->VsrD(i), 8) ^
+                r->VsrD(i) = ror64(a->VsrD(i), 1) ^
+                             ror64(a->VsrD(i), 8) ^
                              (a->VsrD(i) >> 7);
             } else { /* six.bit[2*i] == 1 */
-                r->VsrD(i) = ROTRu64(a->VsrD(i), 19) ^
-                             ROTRu64(a->VsrD(i), 61) ^
+                r->VsrD(i) = ror64(a->VsrD(i), 19) ^
+                             ror64(a->VsrD(i), 61) ^
                              (a->VsrD(i) >> 6);
             }
         } else { /* st == 1 */
             if ((six & (0x8 >> (2*i))) == 0) {
-                r->VsrD(i) = ROTRu64(a->VsrD(i), 28) ^
-                             ROTRu64(a->VsrD(i), 34) ^
-                             ROTRu64(a->VsrD(i), 39);
+                r->VsrD(i) = ror64(a->VsrD(i), 28) ^
+                             ror64(a->VsrD(i), 34) ^
+                             ror64(a->VsrD(i), 39);
             } else { /* six.bit[2*i] == 1 */
-                r->VsrD(i) = ROTRu64(a->VsrD(i), 14) ^
-                             ROTRu64(a->VsrD(i), 18) ^
-                             ROTRu64(a->VsrD(i), 41);
+                r->VsrD(i) = ror64(a->VsrD(i), 14) ^
+                             ror64(a->VsrD(i), 18) ^
+                             ror64(a->VsrD(i), 41);
             }
         }
     }
 }
 
-#undef ROTRu64
-
 void helper_vpermxor(ppc_avr_t *r,  ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
 {
     ppc_avr_t result;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH v3 8/8] target/ppc: remove various HOST_WORDS_BIGENDIAN hacks in int_helper.c
  2019-01-27  9:02 [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c Mark Cave-Ayland
                   ` (6 preceding siblings ...)
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 7/8] target/ppc: remove ROTRu32 and ROTRu64 macros from int_helper.c Mark Cave-Ayland
@ 2019-01-27  9:03 ` Mark Cave-Ayland
  7 siblings, 0 replies; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27  9:03 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, richard.henderson, david

Following on from the previous work, there are numerous endian-related hacks
in int_helper.c that can now be replaced with Vsr* macros.

There are also a few places where the VECTOR_FOR_INORDER_I macro can be
replaced with a normal iterator since the processing order is irrelevant.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/ppc/int_helper.c | 155 ++++++++++++++----------------------------------
 1 file changed, 45 insertions(+), 110 deletions(-)

diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index c3eba9ed41..3b4cf248e9 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -443,8 +443,8 @@ void helper_lvsl(ppc_avr_t *r, target_ulong sh)
 {
     int i, j = (sh & 0xf);
 
-    VECTOR_FOR_INORDER_I(i, u8) {
-        r->u8[i] = j++;
+    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+        r->VsrB(i) = j++;
     }
 }
 
@@ -452,18 +452,14 @@ void helper_lvsr(ppc_avr_t *r, target_ulong sh)
 {
     int i, j = 0x10 - (sh & 0xf);
 
-    VECTOR_FOR_INORDER_I(i, u8) {
-        r->u8[i] = j++;
+    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+        r->VsrB(i) = j++;
     }
 }
 
 void helper_mtvscr(CPUPPCState *env, ppc_avr_t *r)
 {
-#if defined(HOST_WORDS_BIGENDIAN)
-    env->vscr = r->u32[3];
-#else
-    env->vscr = r->u32[0];
-#endif
+    env->vscr = r->VsrW(3);
     set_flush_to_zero(vscr_nj, &env->vec_status);
 }
 
@@ -870,8 +866,8 @@ target_ulong helper_vclzlsbb(ppc_avr_t *r)
 {
     target_ulong count = 0;
     int i;
-    VECTOR_FOR_INORDER_I(i, u8) {
-        if (r->u8[i] & 0x01) {
+    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+        if (r->VsrB(i) & 0x01) {
             break;
         }
         count++;
@@ -883,12 +879,8 @@ target_ulong helper_vctzlsbb(ppc_avr_t *r)
 {
     target_ulong count = 0;
     int i;
-#if defined(HOST_WORDS_BIGENDIAN)
     for (i = ARRAY_SIZE(r->u8) - 1; i >= 0; i--) {
-#else
-    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
-#endif
-        if (r->u8[i] & 0x01) {
+        if (r->VsrB(i) & 0x01) {
             break;
         }
         count++;
@@ -1151,18 +1143,14 @@ void helper_vperm(CPUPPCState *env, ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b,
     ppc_avr_t result;
     int i;
 
-    VECTOR_FOR_INORDER_I(i, u8) {
-        int s = c->u8[i] & 0x1f;
-#if defined(HOST_WORDS_BIGENDIAN)
+    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+        int s = c->VsrB(i) & 0x1f;
         int index = s & 0xf;
-#else
-        int index = 15 - (s & 0xf);
-#endif
 
         if (s & 0x10) {
-            result.u8[i] = b->u8[index];
+            result.VsrB(i) = b->VsrB(index);
         } else {
-            result.u8[i] = a->u8[index];
+            result.VsrB(i) = a->VsrB(index);
         }
     }
     *r = result;
@@ -1174,18 +1162,14 @@ void helper_vpermr(CPUPPCState *env, ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b,
     ppc_avr_t result;
     int i;
 
-    VECTOR_FOR_INORDER_I(i, u8) {
-        int s = c->u8[i] & 0x1f;
-#if defined(HOST_WORDS_BIGENDIAN)
+    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+        int s = c->VsrB(i) & 0x1f;
         int index = 15 - (s & 0xf);
-#else
-        int index = s & 0xf;
-#endif
 
         if (s & 0x10) {
-            result.u8[i] = a->u8[index];
+            result.VsrB(i) = a->VsrB(index);
         } else {
-            result.u8[i] = b->u8[index];
+            result.VsrB(i) = b->VsrB(index);
         }
     }
     *r = result;
@@ -1882,25 +1866,14 @@ void helper_vsldoi(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, uint32_t shift)
     int i;
     ppc_avr_t result;
 
-#if defined(HOST_WORDS_BIGENDIAN)
     for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
         int index = sh + i;
         if (index > 0xf) {
-            result.u8[i] = b->u8[index - 0x10];
-        } else {
-            result.u8[i] = a->u8[index];
-        }
-    }
-#else
-    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
-        int index = (16 - sh) + i;
-        if (index > 0xf) {
-            result.u8[i] = a->u8[index - 0x10];
+            result.VsrB(i) = b->VsrB(index - 0x10);
         } else {
-            result.u8[i] = b->u8[index];
+            result.VsrB(i) = a->VsrB(index);
         }
     }
-#endif
     *r = result;
 }
 
@@ -1919,25 +1892,20 @@ void helper_vslo(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
 
 /* Experimental testing shows that hardware masks the immediate.  */
 #define _SPLAT_MASKED(element) (splat & (ARRAY_SIZE(r->element) - 1))
-#if defined(HOST_WORDS_BIGENDIAN)
 #define SPLAT_ELEMENT(element) _SPLAT_MASKED(element)
-#else
-#define SPLAT_ELEMENT(element)                                  \
-    (ARRAY_SIZE(r->element) - 1 - _SPLAT_MASKED(element))
-#endif
-#define VSPLT(suffix, element)                                          \
+#define VSPLT(suffix, element, access)                                  \
     void helper_vsplt##suffix(ppc_avr_t *r, ppc_avr_t *b, uint32_t splat) \
     {                                                                   \
-        uint32_t s = b->element[SPLAT_ELEMENT(element)];                \
+        uint32_t s = b->access(SPLAT_ELEMENT(element));                 \
         int i;                                                          \
                                                                         \
         for (i = 0; i < ARRAY_SIZE(r->element); i++) {                  \
-            r->element[i] = s;                                          \
+            r->access(i) = s;                                           \
         }                                                               \
     }
-VSPLT(b, u8)
-VSPLT(h, u16)
-VSPLT(w, u32)
+VSPLT(b, u8, VsrB)
+VSPLT(h, u16, VsrH)
+VSPLT(w, u32, VsrW)
 #undef VSPLT
 #undef SPLAT_ELEMENT
 #undef _SPLAT_MASKED
@@ -1998,17 +1966,10 @@ void helper_xxextractuw(CPUPPCState *env, target_ulong xtn,
     getVSR(xbn, &xb, env);
     memset(&xt, 0, sizeof(xt));
 
-#if defined(HOST_WORDS_BIGENDIAN)
     ext_index = index;
     for (i = 0; i < es; i++, ext_index++) {
-        xt.u8[8 - es + i] = xb.u8[ext_index % 16];
-    }
-#else
-    ext_index = 15 - index;
-    for (i = es - 1; i >= 0; i--, ext_index--) {
-        xt.u8[8 + i] = xb.u8[ext_index % 16];
+        xt.VsrB(8 - es + i) = xb.VsrB(ext_index % 16);
     }
-#endif
 
     putVSR(xtn, &xt, env);
 }
@@ -2023,17 +1984,10 @@ void helper_xxinsertw(CPUPPCState *env, target_ulong xtn,
     getVSR(xbn, &xb, env);
     getVSR(xtn, &xt, env);
 
-#if defined(HOST_WORDS_BIGENDIAN)
     ins_index = index;
     for (i = 0; i < es && ins_index < 16; i++, ins_index++) {
-        xt.u8[ins_index] = xb.u8[8 - es + i];
-    }
-#else
-    ins_index = 15 - index;
-    for (i = es - 1; i >= 0 && ins_index >= 0; i--, ins_index--) {
-        xt.u8[ins_index] = xb.u8[8 + i];
+        xt.VsrB(ins_index) = xb.VsrB(8 - es + i);
     }
-#endif
 
     putVSR(xtn, &xt, env);
 }
@@ -2042,7 +1996,7 @@ void helper_xxinsertw(CPUPPCState *env, target_ulong xtn,
 void helper_##name(ppc_avr_t *r, ppc_avr_t *b)                      \
 {                                                                   \
     int i;                                                          \
-    VECTOR_FOR_INORDER_I(i, element) {                              \
+    for (i = 0; i < ARRAY_SIZE(r->element); i++) {                  \
         r->element[i] = (cast)b->element[i];                        \
     }                                                               \
 }
@@ -2057,7 +2011,7 @@ VEXT_SIGNED(vextsw2d, s64, int32_t)
 void helper_##name(ppc_avr_t *r, ppc_avr_t *b)                      \
 {                                                                   \
     int i;                                                          \
-    VECTOR_FOR_INORDER_I(i, element) {                              \
+    for (i = 0; i < ARRAY_SIZE(r->element); i++) {                  \
         r->element[i] = -b->element[i];                             \
     }                                                               \
 }
@@ -2129,17 +2083,13 @@ void helper_vsumsws(CPUPPCState *env, ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
     ppc_avr_t result;
     int sat = 0;
 
-#if defined(HOST_WORDS_BIGENDIAN)
-    upper = ARRAY_SIZE(r->s32)-1;
-#else
-    upper = 0;
-#endif
-    t = (int64_t)b->s32[upper];
+    upper = ARRAY_SIZE(r->s32) - 1;
+    t = (int64_t)b->VsrSW(upper);
     for (i = 0; i < ARRAY_SIZE(r->s32); i++) {
-        t += a->s32[i];
-        result.s32[i] = 0;
+        t += a->VsrSW(i);
+        result.VsrSW(i) = 0;
     }
-    result.s32[upper] = cvtsdsw(t, &sat);
+    result.VsrSW(upper) = cvtsdsw(t, &sat);
     *r = result;
 
     if (sat) {
@@ -2153,19 +2103,15 @@ void helper_vsum2sws(CPUPPCState *env, ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
     ppc_avr_t result;
     int sat = 0;
 
-#if defined(HOST_WORDS_BIGENDIAN)
     upper = 1;
-#else
-    upper = 0;
-#endif
     for (i = 0; i < ARRAY_SIZE(r->u64); i++) {
-        int64_t t = (int64_t)b->s32[upper + i * 2];
+        int64_t t = (int64_t)b->VsrSW(upper + i * 2);
 
-        result.u64[i] = 0;
+        result.VsrW(i) = 0;
         for (j = 0; j < ARRAY_SIZE(r->u64); j++) {
-            t += a->s32[2 * i + j];
+            t += a->VsrSW(2 * i + j);
         }
-        result.s32[upper + i * 2] = cvtsdsw(t, &sat);
+        result.VsrSW(upper + i * 2) = cvtsdsw(t, &sat);
     }
 
     *r = result;
@@ -2290,7 +2236,7 @@ VUPK(lsw, s64, s32, UPKLO)
     {                                                                   \
         int i;                                                          \
                                                                         \
-        VECTOR_FOR_INORDER_I(i, element) {                              \
+        for (i = 0; i < ARRAY_SIZE(r->element); i++) {                  \
             r->element[i] = name(b->element[i]);                        \
         }                                                               \
     }
@@ -2630,20 +2576,12 @@ static int bcd_cmp_zero(ppc_avr_t *bcd)
 
 static uint16_t get_national_digit(ppc_avr_t *reg, int n)
 {
-#if defined(HOST_WORDS_BIGENDIAN)
-    return reg->u16[7 - n];
-#else
-    return reg->u16[n];
-#endif
+    return reg->VsrH(7 - n);
 }
 
 static void set_national_digit(ppc_avr_t *reg, uint8_t val, int n)
 {
-#if defined(HOST_WORDS_BIGENDIAN)
-    reg->u16[7 - n] = val;
-#else
-    reg->u16[n] = val;
-#endif
+    reg->VsrH(7 - n) = val;
 }
 
 static int bcd_cmp_mag(ppc_avr_t *a, ppc_avr_t *b)
@@ -3387,14 +3325,11 @@ void helper_vpermxor(ppc_avr_t *r,  ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
     ppc_avr_t result;
     int i;
 
-    VECTOR_FOR_INORDER_I(i, u8) {
-        int indexA = c->u8[i] >> 4;
-        int indexB = c->u8[i] & 0xF;
-#if defined(HOST_WORDS_BIGENDIAN)
-        result.u8[i] = a->u8[indexA] ^ b->u8[indexB];
-#else
-        result.u8[i] = a->u8[15-indexA] ^ b->u8[15-indexB];
-#endif
+    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
+        int indexA = c->VsrB(i) >> 4;
+        int indexB = c->VsrB(i) & 0xF;
+
+        result.VsrB(i) = a->VsrB(indexA) ^ b->VsrB(indexB);
     }
     *r = result;
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use " Mark Cave-Ayland
@ 2019-01-27 12:07   ` BALATON Zoltan
  2019-01-27 15:19     ` Mark Cave-Ayland
  0 siblings, 1 reply; 21+ messages in thread
From: BALATON Zoltan @ 2019-01-27 12:07 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: qemu-devel, qemu-ppc, richard.henderson, david

On Sun, 27 Jan 2019, Mark Cave-Ayland wrote:
> The current implementations make use of the endian-specific macros MRGLO/MRGHI
> and also reference HI_IDX and LO_IDX directly to calculate array offsets.
>
> Rework the implementation to use the Vsr* macros so that these per-endian
> references can be removed.
>
> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
> ---
> target/ppc/int_helper.c | 52 ++++++++++++++++++++++++-------------------------
> 1 file changed, 25 insertions(+), 27 deletions(-)
>
> diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
> index 598731d47a..f084a706ee 100644
> --- a/target/ppc/int_helper.c
> +++ b/target/ppc/int_helper.c
> @@ -976,43 +976,41 @@ void helper_vmladduhm(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b, ppc_avr_t *c)
>     }
> }
>
> -#define VMRG_DO(name, element, highp)                                   \
> +#define VMRG_DOLO(name, element, access)                                \
>     void helper_v##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)       \
>     {                                                                   \
>         ppc_avr_t result;                                               \
>         int i;                                                          \
> -        size_t n_elems = ARRAY_SIZE(r->element);                        \
> +        int m = ARRAY_SIZE(r->element) - 1;                             \
>                                                                         \
> -        for (i = 0; i < n_elems / 2; i++) {                             \
> -            if (highp) {                                                \
> -                result.element[i*2+HI_IDX] = a->element[i];             \
> -                result.element[i*2+LO_IDX] = b->element[i];             \
> -            } else {                                                    \
> -                result.element[n_elems - i * 2 - (1 + HI_IDX)] =        \
> -                    b->element[n_elems - i - 1];                        \
> -                result.element[n_elems - i * 2 - (1 + LO_IDX)] =        \
> -                    a->element[n_elems - i - 1];                        \
> -            }                                                           \
> +        for (i = 0; i < ARRAY_SIZE(r->element); i++) {                  \

Isn't this a performance hit? You seem to do twice as many iterations now:

- before, the loop was to ARRAY_SIZE/2 and was called twice so it 
executed ARRAY_SIZE times

- after you have a loop to ARRAY_SIZE but still called twice so it 
executes 2*ARRAY_SIZE times

Or do I miss something? If these are replaced with hardware vector 
instructions at the end then it may not matter to those who have CPU with 
needed vector instructions but for others this may be slower than the 
previous hand optimised version. But I haven't tested it, just guessing.

Regrads,
BALATON Zoltan

> +            result.access(m - i) = (i & 1) ? a->access(m - (i >> 1))    \
> +                                           : b->access(m - (i >> 1));   \
>         }                                                               \
>         *r = result;                                                    \
>     }
> -#if defined(HOST_WORDS_BIGENDIAN)
> -#define MRGHI 0
> -#define MRGLO 1
> -#else
> -#define MRGHI 1
> -#define MRGLO 0
> -#endif
> -#define VMRG(suffix, element)                   \
> -    VMRG_DO(mrgl##suffix, element, MRGHI)       \
> -    VMRG_DO(mrgh##suffix, element, MRGLO)
> -VMRG(b, u8)
> -VMRG(h, u16)
> -VMRG(w, u32)
> +
> +#define VMRG_DOHI(name, element, access)                                \
> +    void helper_v##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)       \
> +    {                                                                   \
> +        ppc_avr_t result;                                               \
> +        int i;                                                          \
> +                                                                        \
> +        for (i = 0; i < ARRAY_SIZE(r->element); i++) {                  \
> +            result.access(i) = (i & 1) ? b->access(i >> 1)              \
> +                                       : a->access(i >> 1);             \
> +        }                                                               \
> +        *r = result;                                                    \
> +    }
> +
> +#define VMRG(suffix, element, access)                  \
> +    VMRG_DOLO(mrgl##suffix, element, access)           \
> +    VMRG_DOHI(mrgh##suffix, element, access)
> +VMRG(b, u8, VsrB)
> +VMRG(h, u16, VsrH)
> +VMRG(w, u32, VsrW)
> #undef VMRG_DO
> #undef VMRG
> -#undef MRGHI
> -#undef MRGLO
>
> void helper_vmsummbm(CPUPPCState *env, ppc_avr_t *r, ppc_avr_t *a,
>                      ppc_avr_t *b, ppc_avr_t *c)
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27 12:07   ` [Qemu-devel] [Qemu-ppc] " BALATON Zoltan
@ 2019-01-27 15:19     ` Mark Cave-Ayland
  2019-01-27 17:26       ` Richard Henderson
  0 siblings, 1 reply; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27 15:19 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: richard.henderson, qemu-ppc, qemu-devel, david

On 27/01/2019 12:07, BALATON Zoltan wrote:

> On Sun, 27 Jan 2019, Mark Cave-Ayland wrote:
>> The current implementations make use of the endian-specific macros MRGLO/MRGHI
>> and also reference HI_IDX and LO_IDX directly to calculate array offsets.
>>
>> Rework the implementation to use the Vsr* macros so that these per-endian
>> references can be removed.
>>
>> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
>> ---
>> target/ppc/int_helper.c | 52 ++++++++++++++++++++++++-------------------------
>> 1 file changed, 25 insertions(+), 27 deletions(-)
>>
>> diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
>> index 598731d47a..f084a706ee 100644
>> --- a/target/ppc/int_helper.c
>> +++ b/target/ppc/int_helper.c
>> @@ -976,43 +976,41 @@ void helper_vmladduhm(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t
>> *b, ppc_avr_t *c)
>>     }
>> }
>>
>> -#define VMRG_DO(name, element, highp)                                   \
>> +#define VMRG_DOLO(name, element, access)                                \
>>     void helper_v##name(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)       \
>>     {                                                                   \
>>         ppc_avr_t result;                                               \
>>         int i;                                                          \
>> -        size_t n_elems = ARRAY_SIZE(r->element);                        \
>> +        int m = ARRAY_SIZE(r->element) - 1;                             \
>>                                                                         \
>> -        for (i = 0; i < n_elems / 2; i++) {                             \
>> -            if (highp) {                                                \
>> -                result.element[i*2+HI_IDX] = a->element[i];             \
>> -                result.element[i*2+LO_IDX] = b->element[i];             \
>> -            } else {                                                    \
>> -                result.element[n_elems - i * 2 - (1 + HI_IDX)] =        \
>> -                    b->element[n_elems - i - 1];                        \
>> -                result.element[n_elems - i * 2 - (1 + LO_IDX)] =        \
>> -                    a->element[n_elems - i - 1];                        \
>> -            }                                                           \
>> +        for (i = 0; i < ARRAY_SIZE(r->element); i++) {                  \
> 
> Isn't this a performance hit? You seem to do twice as many iterations now:
> 
> - before, the loop was to ARRAY_SIZE/2 and was called twice so it executed ARRAY_SIZE
> times
> 
> - after you have a loop to ARRAY_SIZE but still called twice so it executes
> 2*ARRAY_SIZE times
> 
> Or do I miss something? If these are replaced with hardware vector instructions at
> the end then it may not matter to those who have CPU with needed vector instructions
> but for others this may be slower than the previous hand optimised version. But I
> haven't tested it, just guessing.

One point to clarify here is that the HI and LO variants of vmrg{l,h}{b,h,w} are
separate instructions, so the input elements are being iterated over once, both
before and after the patch.

Simplifying the patch down then effectively what happens is that the patch has
changed the merge implementation from:

    for (i = 0; i < ARRAY_SIZE / 2; i++) {
        result[f(2 * i)] = a->element[g(i)];
        result[f(2 * i + 1)] = b->element[g(i)];
    }

to:

    for (i = 0; i < ARRAY_SIZE; i++) {
        result[f(i)] = (i & 1) ? a->element[g(i)] : b->element[g(i)];
    }

So you're still calculating the same number of result elements, even though the loop
executes twice as many times.

Could this make the loop slower? I certainly haven't noticed any obvious performance
difference during testing (OS X uses merge quite a bit for display rendering), and
I'd hope that with a good compiler and modern branch prediction then any effect here
would be negligible.


ATB,

Mark.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27 15:19     ` Mark Cave-Ayland
@ 2019-01-27 17:26       ` Richard Henderson
  2019-01-27 17:45         ` Mark Cave-Ayland
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2019-01-27 17:26 UTC (permalink / raw)
  To: Mark Cave-Ayland, BALATON Zoltan; +Cc: qemu-ppc, qemu-devel, david

On 1/27/19 7:19 AM, Mark Cave-Ayland wrote:
> Could this make the loop slower? I certainly haven't noticed any obvious
> performance difference during testing (OS X uses merge quite a bit for
> display rendering), and I'd hope that with a good compiler and modern branch
> prediction then any effect here would be negligible.
I would expect the i < n/2 loop to be faster, because the assignments are
unconditional.  FWIW.


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27 17:26       ` Richard Henderson
@ 2019-01-27 17:45         ` Mark Cave-Ayland
  2019-01-27 18:07           ` Richard Henderson
  2019-01-27 20:31           ` BALATON Zoltan
  0 siblings, 2 replies; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-27 17:45 UTC (permalink / raw)
  To: Richard Henderson, BALATON Zoltan; +Cc: qemu-ppc, qemu-devel, david

On 27/01/2019 17:26, Richard Henderson wrote:

> On 1/27/19 7:19 AM, Mark Cave-Ayland wrote:
>> Could this make the loop slower? I certainly haven't noticed any obvious
>> performance difference during testing (OS X uses merge quite a bit for
>> display rendering), and I'd hope that with a good compiler and modern branch
>> prediction then any effect here would be negligible.
>
> I would expect the i < n/2 loop to be faster, because the assignments are
> unconditional.  FWIW.

Do you have any idea as to how much faster? Is it something that would show up as
significant within the context of QEMU?

As well as eliminating the HI_IDX/LO_IDX constants I do find the updated version much
easier to read, so I would prefer to keep it if possible. What about unrolling the
loop into 2 separate ones e.g.

    for (i = 0; i < ARRAY_SIZE(r->element); i+=2) {
        result.access(i) = a->access(i >> 1);
    }

    for (i = 1; i < ARRAY_SIZE(r->element); i+=2) {
        result.access(i) = b->access(i >> 1);
    }

Would you expect this to perform better than the version proposed in the patchset?


ATB,

Mark.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27 17:45         ` Mark Cave-Ayland
@ 2019-01-27 18:07           ` Richard Henderson
  2019-01-29  2:28             ` David Gibson
  2019-01-29 18:49             ` Mark Cave-Ayland
  2019-01-27 20:31           ` BALATON Zoltan
  1 sibling, 2 replies; 21+ messages in thread
From: Richard Henderson @ 2019-01-27 18:07 UTC (permalink / raw)
  To: Mark Cave-Ayland, BALATON Zoltan; +Cc: qemu-ppc, qemu-devel, david

On 1/27/19 9:45 AM, Mark Cave-Ayland wrote:
>> I would expect the i < n/2 loop to be faster, because the assignments are
>> unconditional.  FWIW.
> 
> Do you have any idea as to how much faster? Is it something that would show
> up as significant within the context of QEMU?

I don't have any numbers on that, no.

> As well as eliminating the HI_IDX/LO_IDX constants I do find the updated
> version much easier to read, so I would prefer to keep it if possible.
> What about unrolling the loop into 2 separate ones...

I doubt that would be helpful.

I would think that

#define VMRG_DO(name, access, ofs)
...
    int i, half = ARRAY_SIZE(r->access(0)) / 2;
...
    for (i = 0; i < half; i++) {
        result.access(2 * i + 0) = a->access(i + ofs);
        result.access(2 * i + 1) = b->access(i + ofs);
    }

where OFS = 0 for HI and half for LO is best.  I find it quite readable, and it
avoids duplicating code between LO and HI as you're currently doing.


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/8] target/ppc: eliminate use of HI_IDX and LO_IDX macros from int_helper.c
  2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 4/8] target/ppc: eliminate use of HI_IDX and LO_IDX macros from int_helper.c Mark Cave-Ayland
@ 2019-01-27 19:13   ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2019-01-27 19:13 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel, qemu-ppc, david

On 1/27/19 1:03 AM, Mark Cave-Ayland wrote:
> The original purpose of these macros was to correctly reference the high and low
> parts of the VSRs regardless of the host endianness.
> 
> Replace these direct references to high and low parts with the relevant VsrD
> macro instead, and completely remove the now-unused HI_IDX and LO_IDX macros.
> 
> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
> ---
>  target/ppc/int_helper.c | 180 +++++++++++++++++++++++-------------------------
>  1 file changed, 85 insertions(+), 95 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27 17:45         ` Mark Cave-Ayland
  2019-01-27 18:07           ` Richard Henderson
@ 2019-01-27 20:31           ` BALATON Zoltan
  2019-01-27 20:47             ` BALATON Zoltan
  1 sibling, 1 reply; 21+ messages in thread
From: BALATON Zoltan @ 2019-01-27 20:31 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: Richard Henderson, qemu-ppc, qemu-devel, david

On Sun, 27 Jan 2019, Mark Cave-Ayland wrote:
> On 27/01/2019 17:26, Richard Henderson wrote:
>
>> On 1/27/19 7:19 AM, Mark Cave-Ayland wrote:
>>> Could this make the loop slower? I certainly haven't noticed any obvious
>>> performance difference during testing (OS X uses merge quite a bit for
>>> display rendering), and I'd hope that with a good compiler and modern branch
>>> prediction then any effect here would be negligible.
>>
>> I would expect the i < n/2 loop to be faster, because the assignments are
>> unconditional.  FWIW.
>
> Do you have any idea as to how much faster? Is it something that would show up as
> significant within the context of QEMU?

I don't have numbers either but since these vector ops are meant to and 
are used for speeding up repetitive calculations I'd expect it to be run 
many times which means that even a small difference would add up. So I 
think it's worth trying to make these optimal also when host vector ops 
cannot be used.

I don't know about a good benchmark to measure this. Maybe you could try 
converting some video in Mac OS X or something similar that's known to use 
AltiVec/VMX. There are also these under MorphOS on mac99:
http://www.amiga-news.de/en/news/AN-2012-02-00011-EN.html
where the mplayer one is mostly VMX bound I think and lame is more 
dependent on floating point ops but that also has a VMX version (still 
mainly float I think). I'd copy input file to RAM: disk first to avoid 
overhead from IDE emulation. But these are probably too short to measure 
this.

I can't test this now but maybe someone reading this on the list who can 
try it with and without this series could help.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27 20:31           ` BALATON Zoltan
@ 2019-01-27 20:47             ` BALATON Zoltan
  0 siblings, 0 replies; 21+ messages in thread
From: BALATON Zoltan @ 2019-01-27 20:47 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: Richard Henderson, qemu-ppc, qemu-devel, david

On Sun, 27 Jan 2019, BALATON Zoltan wrote:
> On Sun, 27 Jan 2019, Mark Cave-Ayland wrote:
>> On 27/01/2019 17:26, Richard Henderson wrote:
>> 
>>> On 1/27/19 7:19 AM, Mark Cave-Ayland wrote:
>>>> Could this make the loop slower? I certainly haven't noticed any obvious
>>>> performance difference during testing (OS X uses merge quite a bit for
>>>> display rendering), and I'd hope that with a good compiler and modern 
>>>> branch
>>>> prediction then any effect here would be negligible.
>>> 
>>> I would expect the i < n/2 loop to be faster, because the assignments are
>>> unconditional.  FWIW.
>> 
>> Do you have any idea as to how much faster? Is it something that would show 
>> up as
>> significant within the context of QEMU?
>
> I don't have numbers either but since these vector ops are meant to and are 
> used for speeding up repetitive calculations I'd expect it to be run many 
> times which means that even a small difference would add up. So I think it's 
> worth trying to make these optimal also when host vector ops cannot be used.
>
> I don't know about a good benchmark to measure this. Maybe you could try 
> converting some video in Mac OS X or something similar that's known to use 
> AltiVec/VMX. There are also these under MorphOS on mac99:
> http://www.amiga-news.de/en/news/AN-2012-02-00011-EN.html
> where the mplayer one is mostly VMX bound I think and lame is more dependent 
> on floating point ops but that also has a VMX version (still mainly float I 
> think). I'd copy input file to RAM: disk first to avoid overhead from IDE 
> emulation. But these are probably too short to measure this.
>
> I can't test this now but maybe someone reading this on the list who can try 
> it with and without this series could help.

I've found these (untested and quite old but may work) so you don't need 
MorphOS only OS X:

https://tmkk.undo.jp/lame/index_e.html

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/8] target/ppc: implement complete set of Vsr* macros
  2019-01-27  9:02 ` [Qemu-devel] [PATCH v3 1/8] target/ppc: implement complete set of Vsr* macros Mark Cave-Ayland
@ 2019-01-28  9:19   ` David Gibson
  0 siblings, 0 replies; 21+ messages in thread
From: David Gibson @ 2019-01-28  9:19 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: qemu-devel, qemu-ppc, richard.henderson

[-- Attachment #1: Type: text/plain, Size: 1568 bytes --]

On Sun, Jan 27, 2019 at 09:02:59AM +0000, Mark Cave-Ayland wrote:
> This prepares us for eliminating the use of direct array access within the VMX
> instruction implementations.
> 
> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

Applied to ppc-for-4.0.

> ---
>  target/ppc/internal.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/target/ppc/internal.h b/target/ppc/internal.h
> index c7c0f77dd6..f26a71ffcf 100644
> --- a/target/ppc/internal.h
> +++ b/target/ppc/internal.h
> @@ -206,16 +206,23 @@ EXTRACT_HELPER_SPLIT_3(DCMX_XV, 5, 16, 0, 1, 2, 5, 1, 6, 6);
>  
>  #if defined(HOST_WORDS_BIGENDIAN)
>  #define VsrB(i) u8[i]
> +#define VsrSB(i) s8[i]
>  #define VsrH(i) u16[i]
> +#define VsrSH(i) s16[i]
>  #define VsrW(i) u32[i]
> +#define VsrSW(i) s32[i]
>  #define VsrD(i) u64[i]
> +#define VsrSD(i) s64[i]
>  #else
>  #define VsrB(i) u8[15 - (i)]
> +#define VsrSB(i) s8[15 - (i)]
>  #define VsrH(i) u16[7 - (i)]
> +#define VsrSH(i) s16[7 - (i)]
>  #define VsrW(i) u32[3 - (i)]
> +#define VsrSW(i) s32[3 - (i)]
>  #define VsrD(i) u64[1 - (i)]
> +#define VsrSD(i) s64[1 - (i)]
>  #endif
> -
>  static inline void getVSR(int n, ppc_vsr_t *vsr, CPUPPCState *env)
>  {
>      vsr->VsrD(0) = env->vsr[n].u64[0];

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27 18:07           ` Richard Henderson
@ 2019-01-29  2:28             ` David Gibson
  2019-01-29  6:36               ` Mark Cave-Ayland
  2019-01-29 18:49             ` Mark Cave-Ayland
  1 sibling, 1 reply; 21+ messages in thread
From: David Gibson @ 2019-01-29  2:28 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Mark Cave-Ayland, BALATON Zoltan, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1436 bytes --]

On Sun, Jan 27, 2019 at 10:07:12AM -0800, Richard Henderson wrote:
> On 1/27/19 9:45 AM, Mark Cave-Ayland wrote:
> >> I would expect the i < n/2 loop to be faster, because the assignments are
> >> unconditional.  FWIW.
> > 
> > Do you have any idea as to how much faster? Is it something that would show
> > up as significant within the context of QEMU?
> 
> I don't have any numbers on that, no.
> 
> > As well as eliminating the HI_IDX/LO_IDX constants I do find the updated
> > version much easier to read, so I would prefer to keep it if possible.
> > What about unrolling the loop into 2 separate ones...
> 
> I doubt that would be helpful.
> 
> I would think that
> 
> #define VMRG_DO(name, access, ofs)
> ...
>     int i, half = ARRAY_SIZE(r->access(0)) / 2;
> ...
>     for (i = 0; i < half; i++) {
>         result.access(2 * i + 0) = a->access(i + ofs);
>         result.access(2 * i + 1) = b->access(i + ofs);
>     }
> 
> where OFS = 0 for HI and half for LO is best.  I find it quite readable, and it
> avoids duplicating code between LO and HI as you're currently doing.

Marc, Richard, where are we at with this?

Should I wait on a revised version of this patch before applying the
series?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-29  2:28             ` David Gibson
@ 2019-01-29  6:36               ` Mark Cave-Ayland
  0 siblings, 0 replies; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-29  6:36 UTC (permalink / raw)
  To: David Gibson, Richard Henderson; +Cc: qemu-ppc, qemu-devel

On 29/01/2019 02:28, David Gibson wrote:

> On Sun, Jan 27, 2019 at 10:07:12AM -0800, Richard Henderson wrote:
>> On 1/27/19 9:45 AM, Mark Cave-Ayland wrote:
>>>> I would expect the i < n/2 loop to be faster, because the assignments are
>>>> unconditional.  FWIW.
>>>
>>> Do you have any idea as to how much faster? Is it something that would show
>>> up as significant within the context of QEMU?
>>
>> I don't have any numbers on that, no.
>>
>>> As well as eliminating the HI_IDX/LO_IDX constants I do find the updated
>>> version much easier to read, so I would prefer to keep it if possible.
>>> What about unrolling the loop into 2 separate ones...
>>
>> I doubt that would be helpful.
>>
>> I would think that
>>
>> #define VMRG_DO(name, access, ofs)
>> ...
>>     int i, half = ARRAY_SIZE(r->access(0)) / 2;
>> ...
>>     for (i = 0; i < half; i++) {
>>         result.access(2 * i + 0) = a->access(i + ofs);
>>         result.access(2 * i + 1) = b->access(i + ofs);
>>     }
>>
>> where OFS = 0 for HI and half for LO is best.  I find it quite readable, and it
>> avoids duplicating code between LO and HI as you're currently doing.
> 
> Marc, Richard, where are we at with this?
> 
> Should I wait on a revised version of this patch before applying the
> series?

Certainly the v3 as posted is correct (I've tested this particular patch on both big
and small endian machines), so I believe the only question is whether this introduces
any noticeable performance penalty.

Let me try and run a few simple tests and report back.

BTW are you able to take my qemu-macppc queue posted yesterday at
https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg07263.html? There's no
functional change except for PPC MacOS users who explicitly enable the new QEMU EDID
support on the command line.


ATB,

Mark.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use Vsr* macros
  2019-01-27 18:07           ` Richard Henderson
  2019-01-29  2:28             ` David Gibson
@ 2019-01-29 18:49             ` Mark Cave-Ayland
  1 sibling, 0 replies; 21+ messages in thread
From: Mark Cave-Ayland @ 2019-01-29 18:49 UTC (permalink / raw)
  To: Richard Henderson, BALATON Zoltan; +Cc: qemu-ppc, qemu-devel, david

[-- Attachment #1: Type: text/plain, Size: 2088 bytes --]

On 27/01/2019 18:07, Richard Henderson wrote:

> On 1/27/19 9:45 AM, Mark Cave-Ayland wrote:
>>> I would expect the i < n/2 loop to be faster, because the assignments are
>>> unconditional.  FWIW.
>>
>> Do you have any idea as to how much faster? Is it something that would show
>> up as significant within the context of QEMU?
> 
> I don't have any numbers on that, no.
> 
>> As well as eliminating the HI_IDX/LO_IDX constants I do find the updated
>> version much easier to read, so I would prefer to keep it if possible.
>> What about unrolling the loop into 2 separate ones...
> 
> I doubt that would be helpful.
> 
> I would think that
> 
> #define VMRG_DO(name, access, ofs)
> ...
>     int i, half = ARRAY_SIZE(r->access(0)) / 2;
> ...
>     for (i = 0; i < half; i++) {
>         result.access(2 * i + 0) = a->access(i + ofs);
>         result.access(2 * i + 1) = b->access(i + ofs);
>     }
> 
> where OFS = 0 for HI and half for LO is best.  I find it quite readable, and it
> avoids duplicating code between LO and HI as you're currently doing.

Attached is my test program which benchmarks the different approaches across
0x8000000 iterations and gives the following sample output here on an i7 when
compiled with gcc -O2:

$ ./mcatest
Benchmark 1 - existing merge high
Elapsed time: 1434735 us

Benchmark 2 - v3 merge high
Elapsed time: 2603553 us

Benchmark 3 - 2 loops merge high
Elapsed time: 2395434 us

Benchmark 4 - Richard's merge high
Elapsed time: 1318369 us


These indicate that the proposed v3 merge algorithm is nearly 50% slower than the
existing implementation - it wasn't something noticeable during emulation, but in a
benchmark situation the additional overhead is clearly visible.

TLDR: after playing around with the different approaches, Richard's proposed
algorithm is the fastest, and is actually slightly quicker than the current
implementation. Please go to the foyer after class where you can collect your prize :)

On this basis I'll redo v3 using Richard's algorithm and post a v4 later assuming
that it passes my local tests again.


ATB,

Mark.

[-- Attachment #2: mcatest.c --]
[-- Type: text/x-csrc, Size: 3400 bytes --]

#include <stdio.h>
#include <stdint.h>
#include <float.h>
#include <sys/time.h>

/* VSX/Altivec registers (128 bits) */
typedef union _ppc_vsr_t {
    uint8_t u8[16];
    uint16_t u16[8];
    uint32_t u32[4];
    uint64_t u64[2];
    int8_t s8[16];
    int16_t s16[8];
    int32_t s32[4];
    int64_t s64[2];
    float f32[4];
    double f64[2];
} ppc_vsr_t;

typedef ppc_vsr_t ppc_avr_t;

#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
#define IS_LITTLE_ENDIAN (1 == *(unsigned char *)&(const int){1})

#if defined(IS_LITTLE_ENDIAN)
#define HI_IDX 1
#define LO_IDX 0
#define VsrB(i) u8[15 - (i)]
#else
#define HI_IDX 0
#define LO_IDX 1
#define VsrB(i) u8[i]
#endif

#define ITERATIONS 0x8000000

void benchmark1(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
{
    /* Existing algorithm */
    ppc_avr_t result;
    int i;
    size_t n_elems = ARRAY_SIZE(r->u8);

    for (i = 0; i < n_elems / 2; i++) {
        result.u8[i*2+HI_IDX] = a->u8[i];
        result.u8[i*2+LO_IDX] = b->u8[i];
    }
    *r = result;    
}

void benchmark2(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
{
    /* v3 patchset algorithm */
    ppc_avr_t result;
    int i;

    for (i = 0; i < ARRAY_SIZE(r->u8); i++) {
        result.VsrB(i) = (i & 1) ? b->VsrB(i >> 1)
                                 : a->VsrB(i >> 1);
    }
    *r = result;
}

void benchmark3(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
{
    /* Split loop into 2 halves */
    ppc_avr_t result;
    int i;

    for (i = 0; i < ARRAY_SIZE(r->u8); i+=2) {
        result.VsrB(i) = a->VsrB(i >> 1);
    }

    for (i = 1; i < ARRAY_SIZE(r->u8); i+=2) {
        result.VsrB(i) = b->VsrB(i >> 1);
    }
    *r = result;
}

void benchmark4(ppc_avr_t *r, ppc_avr_t *a, ppc_avr_t *b)
{
    /* Richard's merge high */
    ppc_avr_t result;    
    int i;
    int half = ARRAY_SIZE(r->u8) / 2;
    int ofs = 0;

    for (i = 0; i < half; i++) {
        result.VsrB(i * 2 + 0) = a->VsrB(i + ofs);
        result.VsrB(i * 2 + 1) = b->VsrB(i + ofs);
    }
    *r = result;
}

int elapsed_time(struct timeval *st, struct timeval *et)
{
    return ((et->tv_sec - st->tv_sec) * 1000000) + (et->tv_usec - st->tv_usec);    
}

int main(int argc, char *argv[]) 
{
    ppc_avr_t a, b, r;
    int i;
    struct timeval start_time, end_time;

    printf("Benchmark 1 - existing merge high\n");
    gettimeofday(&start_time, NULL);

    for (i = 0; i < ITERATIONS; i++) {
        benchmark1(&r, &a, &b);
    }
    
    gettimeofday(&end_time ,NULL);
    printf("Elapsed time: %d us\n\n", elapsed_time(&start_time, &end_time));
    

    printf("Benchmark 2 - v3 merge high\n");
    gettimeofday(&start_time, NULL);

    for (i = 0; i < ITERATIONS; i++) {
        benchmark2(&r, &a, &b);
    }
    
    gettimeofday(&end_time ,NULL);
    printf("Elapsed time: %d us\n\n", elapsed_time(&start_time, &end_time));
    

    printf("Benchmark 3 - 2 loops merge high\n");
    gettimeofday(&start_time, NULL);

    for (i = 0; i < ITERATIONS; i++) {
        benchmark3(&r, &a, &b);
    }

    gettimeofday(&end_time ,NULL);
    printf("Elapsed time: %d us\n\n", elapsed_time(&start_time, &end_time));


    printf("Benchmark 4 - Richard's merge high\n");
    gettimeofday(&start_time, NULL);

    for (i = 0; i < ITERATIONS; i++) {
        benchmark4(&r, &a, &b);
    }
    
    gettimeofday(&end_time ,NULL);
    printf("Elapsed time: %d us\n\n", elapsed_time(&start_time, &end_time));
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-01-29 18:49 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-27  9:02 [Qemu-devel] [PATCH v3 0/8] target/ppc: remove various endian hacks from int_helper.c Mark Cave-Ayland
2019-01-27  9:02 ` [Qemu-devel] [PATCH v3 1/8] target/ppc: implement complete set of Vsr* macros Mark Cave-Ayland
2019-01-28  9:19   ` David Gibson
2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 2/8] target/ppc: rework vmrg{l, h}{b, h, w} instructions to use " Mark Cave-Ayland
2019-01-27 12:07   ` [Qemu-devel] [Qemu-ppc] " BALATON Zoltan
2019-01-27 15:19     ` Mark Cave-Ayland
2019-01-27 17:26       ` Richard Henderson
2019-01-27 17:45         ` Mark Cave-Ayland
2019-01-27 18:07           ` Richard Henderson
2019-01-29  2:28             ` David Gibson
2019-01-29  6:36               ` Mark Cave-Ayland
2019-01-29 18:49             ` Mark Cave-Ayland
2019-01-27 20:31           ` BALATON Zoltan
2019-01-27 20:47             ` BALATON Zoltan
2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 3/8] target/ppc: rework vmul{e, o}{s, u}{b, " Mark Cave-Ayland
2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 4/8] target/ppc: eliminate use of HI_IDX and LO_IDX macros from int_helper.c Mark Cave-Ayland
2019-01-27 19:13   ` Richard Henderson
2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 5/8] target/ppc: eliminate use of EL_IDX " Mark Cave-Ayland
2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 6/8] target/ppc: simplify VEXT_SIGNED macro in int_helper.c Mark Cave-Ayland
2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 7/8] target/ppc: remove ROTRu32 and ROTRu64 macros from int_helper.c Mark Cave-Ayland
2019-01-27  9:03 ` [Qemu-devel] [PATCH v3 8/8] target/ppc: remove various HOST_WORDS_BIGENDIAN hacks in int_helper.c Mark Cave-Ayland

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.