All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
@ 2018-12-07  8:56 Mark Cave-Ayland
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access Mark Cave-Ayland
                   ` (7 more replies)
  0 siblings, 8 replies; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-07  8:56 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, david, richard.henderson

This patchset is an attempt at trying to improve the VMX (Altivec) instruction
performance by making use of the new TCG vector operations where possible.

In order to use TCG vector operations, the registers must be accessible from cpu_env
whilst currently they are accessed via arrays of static TCG globals. Patches 1-3
are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR
registers using the supplied TCGv_i64 parameter.

Once this is done, patch 4 enables us to remove the static TCG global arrays and updates
the access helpers to read/write to the relevant fields in cpu_env directly.

The final patches 5 and 6 convert the VMX logical instructions and addition/subtraction
instructions respectively over to the TCG vector operations.

NOTE: there are a lot of instructions that cannot (yet) be optimised to use TCG vector
operations, however it struck me that there may be some potential for converting
saturating add/sub and cmp instructions if there were a mechanism to return a set of
flags indicating the result of the saturation/comparison.

Finally thanks to Richard for taking the time to answer some of my (mostly beginner)
questions related to TCG.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>


Mark Cave-Ayland (6):
  target/ppc: introduce get_fpr() and set_fpr() helpers for FP register
    access
  target/ppc: introduce get_avr64() and set_avr64() helpers for VMX
    register access
  target/ppc: introduce get_cpu_vsr{l,h}() and set_cpu_vsr{l,h}()
    helpers for VSR register access
  target/ppc: switch FPR, VMX and VSX helpers to access data directly
    from cpu_env
  target/ppc: convert VMX logical instructions to use vector operations
  target/ppc: convert vaddu[b,h,w,d] and vsubu[b,h,w,d] over to use
    vector operations

 target/ppc/helper.h                 |   8 -
 target/ppc/int_helper.c             |   7 -
 target/ppc/translate.c              |  72 ++--
 target/ppc/translate/fp-impl.inc.c  | 492 ++++++++++++++++++-----
 target/ppc/translate/vmx-impl.inc.c | 182 ++++++---
 target/ppc/translate/vsx-impl.inc.c | 782 ++++++++++++++++++++++++++----------
 6 files changed, 1110 insertions(+), 433 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access
  2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
@ 2018-12-07  8:56 ` Mark Cave-Ayland
  2018-12-10  5:17   ` David Gibson
  2018-12-10 18:43   ` Richard Henderson
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 2/6] target/ppc: introduce get_avr64() and set_avr64() helpers for VMX " Mark Cave-Ayland
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-07  8:56 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, david, richard.henderson

These helpers allow us to move FP register values to/from the specified TCGv_i64
argument.

To prevent FP helpers accessing the cpu_fpr array directly, add extra TCG
temporaries as required.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
---
 target/ppc/translate.c             |  10 +
 target/ppc/translate/fp-impl.inc.c | 492 ++++++++++++++++++++++++++++---------
 2 files changed, 392 insertions(+), 110 deletions(-)

diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index 2b37910248..1d4bf624a3 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -6694,6 +6694,16 @@ static inline void gen_##name(DisasContext *ctx)               \
 GEN_TM_PRIV_NOOP(treclaim);
 GEN_TM_PRIV_NOOP(trechkpt);
 
+static inline void get_fpr(TCGv_i64 dst, int regno)
+{
+    tcg_gen_mov_i64(dst, cpu_fpr[regno]);
+}
+
+static inline void set_fpr(int regno, TCGv_i64 src)
+{
+    tcg_gen_mov_i64(cpu_fpr[regno], src);
+}
+
 #include "translate/fp-impl.inc.c"
 
 #include "translate/vmx-impl.inc.c"
diff --git a/target/ppc/translate/fp-impl.inc.c b/target/ppc/translate/fp-impl.inc.c
index 08770ba9f5..923fb7550f 100644
--- a/target/ppc/translate/fp-impl.inc.c
+++ b/target/ppc/translate/fp-impl.inc.c
@@ -34,24 +34,39 @@ static void gen_set_cr1_from_fpscr(DisasContext *ctx)
 #define _GEN_FLOAT_ACB(name, op, op1, op2, isfloat, set_fprf, type)           \
 static void gen_f##name(DisasContext *ctx)                                    \
 {                                                                             \
+    TCGv_i64 t0;                                                              \
+    TCGv_i64 t1;                                                              \
+    TCGv_i64 t2;                                                              \
+    TCGv_i64 t3;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
+    t0 = tcg_temp_new_i64();                                                  \
+    t1 = tcg_temp_new_i64();                                                  \
+    t2 = tcg_temp_new_i64();                                                  \
+    t3 = tcg_temp_new_i64();                                                  \
     gen_reset_fpstatus();                                                     \
-    gen_helper_f##op(cpu_fpr[rD(ctx->opcode)], cpu_env,                       \
-                     cpu_fpr[rA(ctx->opcode)],                                \
-                     cpu_fpr[rC(ctx->opcode)], cpu_fpr[rB(ctx->opcode)]);     \
+    get_fpr(t0, rA(ctx->opcode));                                             \
+    get_fpr(t1, rC(ctx->opcode));                                             \
+    get_fpr(t2, rB(ctx->opcode));                                             \
+    gen_helper_f##op(t3, cpu_env, t0, t1, t2);                                \
+    set_fpr(rD(ctx->opcode), t3);                                             \
     if (isfloat) {                                                            \
-        gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,                    \
-                        cpu_fpr[rD(ctx->opcode)]);                            \
+        get_fpr(t0, rD(ctx->opcode));                                         \
+        gen_helper_frsp(t3, cpu_env, t0);                                     \
+        set_fpr(rD(ctx->opcode), t3);                                         \
     }                                                                         \
     if (set_fprf) {                                                           \
-        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
+        gen_compute_fprf_float64(t3);                                         \
     }                                                                         \
     if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
         gen_set_cr1_from_fpscr(ctx);                                          \
     }                                                                         \
+    tcg_temp_free_i64(t0);                                                    \
+    tcg_temp_free_i64(t1);                                                    \
+    tcg_temp_free_i64(t2);                                                    \
+    tcg_temp_free_i64(t3);                                                    \
 }
 
 #define GEN_FLOAT_ACB(name, op2, set_fprf, type)                              \
@@ -61,24 +76,35 @@ _GEN_FLOAT_ACB(name##s, name, 0x3B, op2, 1, set_fprf, type);
 #define _GEN_FLOAT_AB(name, op, op1, op2, inval, isfloat, set_fprf, type)     \
 static void gen_f##name(DisasContext *ctx)                                    \
 {                                                                             \
+    TCGv_i64 t0;                                                              \
+    TCGv_i64 t1;                                                              \
+    TCGv_i64 t2;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
+    t0 = tcg_temp_new_i64();                                                  \
+    t1 = tcg_temp_new_i64();                                                  \
+    t2 = tcg_temp_new_i64();                                                  \
     gen_reset_fpstatus();                                                     \
-    gen_helper_f##op(cpu_fpr[rD(ctx->opcode)], cpu_env,                       \
-                     cpu_fpr[rA(ctx->opcode)],                                \
-                     cpu_fpr[rB(ctx->opcode)]);                               \
+    get_fpr(t0, rA(ctx->opcode));                                             \
+    get_fpr(t1, rB(ctx->opcode));                                             \
+    gen_helper_f##op(t2, cpu_env, t0, t1);                                    \
+    set_fpr(rD(ctx->opcode), t2);                                             \
     if (isfloat) {                                                            \
-        gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,                    \
-                        cpu_fpr[rD(ctx->opcode)]);                            \
+        get_fpr(t0, rD(ctx->opcode));                                         \
+        gen_helper_frsp(t2, cpu_env, t0);                                     \
+        set_fpr(rD(ctx->opcode), t2);                                         \
     }                                                                         \
     if (set_fprf) {                                                           \
-        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
+        gen_compute_fprf_float64(t2);                                         \
     }                                                                         \
     if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
         gen_set_cr1_from_fpscr(ctx);                                          \
     }                                                                         \
+    tcg_temp_free_i64(t0);                                                    \
+    tcg_temp_free_i64(t1);                                                    \
+    tcg_temp_free_i64(t2);                                                    \
 }
 #define GEN_FLOAT_AB(name, op2, inval, set_fprf, type)                        \
 _GEN_FLOAT_AB(name, name, 0x3F, op2, inval, 0, set_fprf, type);               \
@@ -87,24 +113,35 @@ _GEN_FLOAT_AB(name##s, name, 0x3B, op2, inval, 1, set_fprf, type);
 #define _GEN_FLOAT_AC(name, op, op1, op2, inval, isfloat, set_fprf, type)     \
 static void gen_f##name(DisasContext *ctx)                                    \
 {                                                                             \
+    TCGv_i64 t0;                                                              \
+    TCGv_i64 t1;                                                              \
+    TCGv_i64 t2;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
+    t0 = tcg_temp_new_i64();                                                  \
+    t1 = tcg_temp_new_i64();                                                  \
+    t2 = tcg_temp_new_i64();                                                  \
     gen_reset_fpstatus();                                                     \
-    gen_helper_f##op(cpu_fpr[rD(ctx->opcode)], cpu_env,                       \
-                     cpu_fpr[rA(ctx->opcode)],                                \
-                     cpu_fpr[rC(ctx->opcode)]);                               \
+    get_fpr(t0, rA(ctx->opcode));                                             \
+    get_fpr(t1, rC(ctx->opcode));                                             \
+    gen_helper_f##op(t2, cpu_env, t0, t1);                                    \
+    set_fpr(rD(ctx->opcode), t2);                                             \
     if (isfloat) {                                                            \
-        gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,                    \
-                        cpu_fpr[rD(ctx->opcode)]);                            \
+        get_fpr(t0, rD(ctx->opcode));                                         \
+        gen_helper_frsp(t2, cpu_env, t0);                                     \
+        set_fpr(rD(ctx->opcode), t2);                                         \
     }                                                                         \
     if (set_fprf) {                                                           \
-        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
+        gen_compute_fprf_float64(t2);                                         \
     }                                                                         \
     if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
         gen_set_cr1_from_fpscr(ctx);                                          \
     }                                                                         \
+    tcg_temp_free_i64(t0);                                                    \
+    tcg_temp_free_i64(t1);                                                    \
+    tcg_temp_free_i64(t2);                                                    \
 }
 #define GEN_FLOAT_AC(name, op2, inval, set_fprf, type)                        \
 _GEN_FLOAT_AC(name, name, 0x3F, op2, inval, 0, set_fprf, type);               \
@@ -113,37 +150,51 @@ _GEN_FLOAT_AC(name##s, name, 0x3B, op2, inval, 1, set_fprf, type);
 #define GEN_FLOAT_B(name, op2, op3, set_fprf, type)                           \
 static void gen_f##name(DisasContext *ctx)                                    \
 {                                                                             \
+    TCGv_i64 t0;                                                              \
+    TCGv_i64 t1;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
+    t0 = tcg_temp_new_i64();                                                  \
+    t1 = tcg_temp_new_i64();                                                  \
     gen_reset_fpstatus();                                                     \
-    gen_helper_f##name(cpu_fpr[rD(ctx->opcode)], cpu_env,                     \
-                       cpu_fpr[rB(ctx->opcode)]);                             \
+    get_fpr(t0, rB(ctx->opcode));                                             \
+    gen_helper_f##name(t1, cpu_env, t0);                                      \
+    set_fpr(rD(ctx->opcode), t1);                                             \
     if (set_fprf) {                                                           \
-        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
+        gen_compute_fprf_float64(t1);                                         \
     }                                                                         \
     if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
         gen_set_cr1_from_fpscr(ctx);                                          \
     }                                                                         \
+    tcg_temp_free_i64(t0);                                                    \
+    tcg_temp_free_i64(t1);                                                    \
 }
 
 #define GEN_FLOAT_BS(name, op1, op2, set_fprf, type)                          \
 static void gen_f##name(DisasContext *ctx)                                    \
 {                                                                             \
+    TCGv_i64 t0;                                                              \
+    TCGv_i64 t1;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
+    t0 = tcg_temp_new_i64();                                                  \
+    t1 = tcg_temp_new_i64();                                                  \
     gen_reset_fpstatus();                                                     \
-    gen_helper_f##name(cpu_fpr[rD(ctx->opcode)], cpu_env,                     \
-                       cpu_fpr[rB(ctx->opcode)]);                             \
+    get_fpr(t0, rB(ctx->opcode));                                             \
+    gen_helper_f##name(t1, cpu_env, t0);                                      \
+    set_fpr(rD(ctx->opcode), t1);                                             \
     if (set_fprf) {                                                           \
-        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
+        gen_compute_fprf_float64(t1);                                         \
     }                                                                         \
     if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
         gen_set_cr1_from_fpscr(ctx);                                          \
     }                                                                         \
+    tcg_temp_free_i64(t0);                                                    \
+    tcg_temp_free_i64(t1);                                                    \
 }
 
 /* fadd - fadds */
@@ -165,19 +216,25 @@ GEN_FLOAT_BS(rsqrte, 0x3F, 0x1A, 1, PPC_FLOAT_FRSQRTE);
 /* frsqrtes */
 static void gen_frsqrtes(DisasContext *ctx)
 {
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
     gen_reset_fpstatus();
-    gen_helper_frsqrte(cpu_fpr[rD(ctx->opcode)], cpu_env,
-                       cpu_fpr[rB(ctx->opcode)]);
-    gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,
-                    cpu_fpr[rD(ctx->opcode)]);
-    gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);
+    get_fpr(t0, rB(ctx->opcode));
+    gen_helper_frsqrte(t1, cpu_env, t0);
+    set_fpr(rD(ctx->opcode), t1);
+    gen_helper_frsp(t1, cpu_env, t1);
+    gen_compute_fprf_float64(t1);
     if (unlikely(Rc(ctx->opcode) != 0)) {
         gen_set_cr1_from_fpscr(ctx);
     }
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /* fsel */
@@ -189,34 +246,47 @@ GEN_FLOAT_AB(sub, 0x14, 0x000007C0, 1, PPC_FLOAT);
 /* fsqrt */
 static void gen_fsqrt(DisasContext *ctx)
 {
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
     gen_reset_fpstatus();
-    gen_helper_fsqrt(cpu_fpr[rD(ctx->opcode)], cpu_env,
-                     cpu_fpr[rB(ctx->opcode)]);
-    gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);
+    get_fpr(t0, rB(ctx->opcode));
+    gen_helper_fsqrt(t1, cpu_env, t0);
+    set_fpr(rD(ctx->opcode), t1);
+    gen_compute_fprf_float64(t1);
     if (unlikely(Rc(ctx->opcode) != 0)) {
         gen_set_cr1_from_fpscr(ctx);
     }
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 static void gen_fsqrts(DisasContext *ctx)
 {
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
     gen_reset_fpstatus();
-    gen_helper_fsqrt(cpu_fpr[rD(ctx->opcode)], cpu_env,
-                     cpu_fpr[rB(ctx->opcode)]);
-    gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,
-                    cpu_fpr[rD(ctx->opcode)]);
-    gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);
+    get_fpr(t0, rB(ctx->opcode));
+    gen_helper_fsqrt(t1, cpu_env, t0);
+    set_fpr(rD(ctx->opcode), t1);
+    gen_helper_frsp(t1, cpu_env, t1);
+    gen_compute_fprf_float64(t1);
     if (unlikely(Rc(ctx->opcode) != 0)) {
         gen_set_cr1_from_fpscr(ctx);
     }
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /***                     Floating-Point multiply-and-add                   ***/
@@ -268,21 +338,32 @@ GEN_FLOAT_B(rim, 0x08, 0x0F, 1, PPC_FLOAT_EXT);
 
 static void gen_ftdiv(DisasContext *ctx)
 {
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
-    gen_helper_ftdiv(cpu_crf[crfD(ctx->opcode)], cpu_fpr[rA(ctx->opcode)],
-                     cpu_fpr[rB(ctx->opcode)]);
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
+    get_fpr(t0, rA(ctx->opcode));
+    get_fpr(t1, rB(ctx->opcode));
+    gen_helper_ftdiv(cpu_crf[crfD(ctx->opcode)], t0, t1);
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 static void gen_ftsqrt(DisasContext *ctx)
 {
+    TCGv_i64 t0;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
-    gen_helper_ftsqrt(cpu_crf[crfD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)]);
+    t0 = tcg_temp_new_i64();
+    get_fpr(t0, rB(ctx->opcode));
+    gen_helper_ftsqrt(cpu_crf[crfD(ctx->opcode)], t0);
+    tcg_temp_free_i64(t0);
 }
 
 
@@ -293,32 +374,46 @@ static void gen_ftsqrt(DisasContext *ctx)
 static void gen_fcmpo(DisasContext *ctx)
 {
     TCGv_i32 crf;
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
     gen_reset_fpstatus();
     crf = tcg_const_i32(crfD(ctx->opcode));
-    gen_helper_fcmpo(cpu_env, cpu_fpr[rA(ctx->opcode)],
-                     cpu_fpr[rB(ctx->opcode)], crf);
+    get_fpr(t0, rA(ctx->opcode));
+    get_fpr(t1, rB(ctx->opcode));
+    gen_helper_fcmpo(cpu_env, t0, t1, crf);
     tcg_temp_free_i32(crf);
     gen_helper_float_check_status(cpu_env);
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /* fcmpu */
 static void gen_fcmpu(DisasContext *ctx)
 {
     TCGv_i32 crf;
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
     gen_reset_fpstatus();
     crf = tcg_const_i32(crfD(ctx->opcode));
-    gen_helper_fcmpu(cpu_env, cpu_fpr[rA(ctx->opcode)],
-                     cpu_fpr[rB(ctx->opcode)], crf);
+    get_fpr(t0, rA(ctx->opcode));
+    get_fpr(t1, rB(ctx->opcode));
+    gen_helper_fcmpu(cpu_env, t0, t1, crf);
     tcg_temp_free_i32(crf);
     gen_helper_float_check_status(cpu_env);
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /***                         Floating-point move                           ***/
@@ -326,100 +421,153 @@ static void gen_fcmpu(DisasContext *ctx)
 /* XXX: beware that fabs never checks for NaNs nor update FPSCR */
 static void gen_fabs(DisasContext *ctx)
 {
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
-    tcg_gen_andi_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)],
-                     ~(1ULL << 63));
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
+    get_fpr(t0, rB(ctx->opcode));
+    tcg_gen_andi_i64(t1, t0, ~(1ULL << 63));
+    set_fpr(rD(ctx->opcode), t1);
     if (unlikely(Rc(ctx->opcode))) {
         gen_set_cr1_from_fpscr(ctx);
     }
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /* fmr  - fmr. */
 /* XXX: beware that fmr never checks for NaNs nor update FPSCR */
 static void gen_fmr(DisasContext *ctx)
 {
+    TCGv_i64 t0;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
-    tcg_gen_mov_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)]);
+    t0 = tcg_temp_new_i64();
+    get_fpr(t0, rB(ctx->opcode));
+    set_fpr(rD(ctx->opcode), t0);
     if (unlikely(Rc(ctx->opcode))) {
         gen_set_cr1_from_fpscr(ctx);
     }
+    tcg_temp_free_i64(t0);
 }
 
 /* fnabs */
 /* XXX: beware that fnabs never checks for NaNs nor update FPSCR */
 static void gen_fnabs(DisasContext *ctx)
 {
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
-    tcg_gen_ori_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)],
-                    1ULL << 63);
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
+    get_fpr(t0, rB(ctx->opcode));
+    tcg_gen_ori_i64(t1, t0, 1ULL << 63);
+    set_fpr(rD(ctx->opcode), t1);
     if (unlikely(Rc(ctx->opcode))) {
         gen_set_cr1_from_fpscr(ctx);
     }
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /* fneg */
 /* XXX: beware that fneg never checks for NaNs nor update FPSCR */
 static void gen_fneg(DisasContext *ctx)
 {
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
-    tcg_gen_xori_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)],
-                     1ULL << 63);
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
+    get_fpr(t0, rB(ctx->opcode));
+    tcg_gen_xori_i64(t1, t0, 1ULL << 63);
+    set_fpr(rD(ctx->opcode), t1);
     if (unlikely(Rc(ctx->opcode))) {
         gen_set_cr1_from_fpscr(ctx);
     }
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /* fcpsgn: PowerPC 2.05 specification */
 /* XXX: beware that fcpsgn never checks for NaNs nor update FPSCR */
 static void gen_fcpsgn(DisasContext *ctx)
 {
+    TCGv_i64 t0;
+    TCGv_i64 t1;
+    TCGv_i64 t2;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
-    tcg_gen_deposit_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rA(ctx->opcode)],
-                        cpu_fpr[rB(ctx->opcode)], 0, 63);
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
+    t2 = tcg_temp_new_i64();
+    get_fpr(t0, rA(ctx->opcode));
+    get_fpr(t1, rB(ctx->opcode));
+    tcg_gen_deposit_i64(t2, t0, t1, 0, 63);
+    set_fpr(rD(ctx->opcode), t2);
     if (unlikely(Rc(ctx->opcode))) {
         gen_set_cr1_from_fpscr(ctx);
     }
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
 }
 
 static void gen_fmrgew(DisasContext *ctx)
 {
     TCGv_i64 b0;
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
     b0 = tcg_temp_new_i64();
-    tcg_gen_shri_i64(b0, cpu_fpr[rB(ctx->opcode)], 32);
-    tcg_gen_deposit_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rA(ctx->opcode)],
-                        b0, 0, 32);
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
+    get_fpr(t0, rB(ctx->opcode));
+    tcg_gen_shri_i64(b0, t0, 32);
+    get_fpr(t0, rA(ctx->opcode));
+    tcg_gen_deposit_i64(t1, t0, b0, 0, 32);
+    set_fpr(rD(ctx->opcode), t1);
     tcg_temp_free_i64(b0);
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 static void gen_fmrgow(DisasContext *ctx)
 {
+    TCGv_i64 t0;
+    TCGv_i64 t1;
+    TCGv_i64 t2;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
-    tcg_gen_deposit_i64(cpu_fpr[rD(ctx->opcode)],
-                        cpu_fpr[rB(ctx->opcode)],
-                        cpu_fpr[rA(ctx->opcode)],
-                        32, 32);
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
+    t2 = tcg_temp_new_i64();
+    get_fpr(t0, rB(ctx->opcode));
+    get_fpr(t1, rA(ctx->opcode));
+    tcg_gen_deposit_i64(t2, t0, t1, 32, 32);
+    set_fpr(rD(ctx->opcode), t2);
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(t2);
 }
 
 /***                  Floating-Point status & ctrl register                ***/
@@ -458,15 +606,19 @@ static void gen_mcrfs(DisasContext *ctx)
 /* mffs */
 static void gen_mffs(DisasContext *ctx)
 {
+    TCGv_i64 t0;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
+    t0 = tcg_temp_new_i64();
     gen_reset_fpstatus();
-    tcg_gen_extu_tl_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpscr);
+    tcg_gen_extu_tl_i64(t0, cpu_fpscr);
+    set_fpr(rD(ctx->opcode), t0);
     if (unlikely(Rc(ctx->opcode))) {
         gen_set_cr1_from_fpscr(ctx);
     }
+    tcg_temp_free_i64(t0);
 }
 
 /* mtfsb0 */
@@ -522,6 +674,7 @@ static void gen_mtfsb1(DisasContext *ctx)
 static void gen_mtfsf(DisasContext *ctx)
 {
     TCGv_i32 t0;
+    TCGv_i64 t1;
     int flm, l, w;
 
     if (unlikely(!ctx->fpu_enabled)) {
@@ -541,7 +694,9 @@ static void gen_mtfsf(DisasContext *ctx)
     } else {
         t0 = tcg_const_i32(flm << (w * 8));
     }
-    gen_helper_store_fpscr(cpu_env, cpu_fpr[rB(ctx->opcode)], t0);
+    t1 = tcg_temp_new_i64();
+    get_fpr(t1, rB(ctx->opcode));
+    gen_helper_store_fpscr(cpu_env, t1, t0);
     tcg_temp_free_i32(t0);
     if (unlikely(Rc(ctx->opcode) != 0)) {
         tcg_gen_trunc_tl_i32(cpu_crf[1], cpu_fpscr);
@@ -549,6 +704,7 @@ static void gen_mtfsf(DisasContext *ctx)
     }
     /* We can raise a differed exception */
     gen_helper_float_check_status(cpu_env);
+    tcg_temp_free_i64(t1);
 }
 
 /* mtfsfi */
@@ -588,21 +744,26 @@ static void gen_mtfsfi(DisasContext *ctx)
 static void glue(gen_, name)(DisasContext *ctx)                                       \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 t0;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
     EA = tcg_temp_new();                                                      \
+    t0 = tcg_temp_new_i64();                                                  \
     gen_addr_imm_index(ctx, EA, 0);                                           \
-    gen_qemu_##ldop(ctx, cpu_fpr[rD(ctx->opcode)], EA);                       \
+    gen_qemu_##ldop(ctx, t0, EA);                                             \
+    set_fpr(rD(ctx->opcode), t0);                                             \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(t0);                                                    \
 }
 
 #define GEN_LDUF(name, ldop, opc, type)                                       \
 static void glue(gen_, name##u)(DisasContext *ctx)                                    \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 t0;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
@@ -613,20 +774,25 @@ static void glue(gen_, name##u)(DisasContext *ctx)
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
     EA = tcg_temp_new();                                                      \
+    t0 = tcg_temp_new_i64();                                                  \
     gen_addr_imm_index(ctx, EA, 0);                                           \
-    gen_qemu_##ldop(ctx, cpu_fpr[rD(ctx->opcode)], EA);                       \
+    gen_qemu_##ldop(ctx, t0, EA);                                             \
+    set_fpr(rD(ctx->opcode), t0);                                             \
     tcg_gen_mov_tl(cpu_gpr[rA(ctx->opcode)], EA);                             \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(t0);                                                    \
 }
 
 #define GEN_LDUXF(name, ldop, opc, type)                                      \
 static void glue(gen_, name##ux)(DisasContext *ctx)                                   \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 t0;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
+    t0 = tcg_temp_new_i64();                                                  \
     if (unlikely(rA(ctx->opcode) == 0)) {                                     \
         gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);                   \
         return;                                                               \
@@ -634,24 +800,30 @@ static void glue(gen_, name##ux)(DisasContext *ctx)
     gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
     EA = tcg_temp_new();                                                      \
     gen_addr_reg_index(ctx, EA);                                              \
-    gen_qemu_##ldop(ctx, cpu_fpr[rD(ctx->opcode)], EA);                       \
+    gen_qemu_##ldop(ctx, t0, EA);                                             \
+    set_fpr(rD(ctx->opcode), t0);                                             \
     tcg_gen_mov_tl(cpu_gpr[rA(ctx->opcode)], EA);                             \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(t0);                                                    \
 }
 
 #define GEN_LDXF(name, ldop, opc2, opc3, type)                                \
 static void glue(gen_, name##x)(DisasContext *ctx)                                    \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 t0;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
     EA = tcg_temp_new();                                                      \
+    t0 = tcg_temp_new_i64();                                                  \
     gen_addr_reg_index(ctx, EA);                                              \
-    gen_qemu_##ldop(ctx, cpu_fpr[rD(ctx->opcode)], EA);                       \
+    gen_qemu_##ldop(ctx, t0, EA);                                             \
+    set_fpr(rD(ctx->opcode), t0);                                             \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(t0);                                                    \
 }
 
 #define GEN_LDFS(name, ldop, op, type)                                        \
@@ -677,6 +849,7 @@ GEN_LDFS(lfs, ld32fs, 0x10, PPC_FLOAT);
 static void gen_lfdepx(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0;
     CHK_SV;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
@@ -684,16 +857,19 @@ static void gen_lfdepx(DisasContext *ctx)
     }
     gen_set_access_type(ctx, ACCESS_FLOAT);
     EA = tcg_temp_new();
+    t0 = tcg_temp_new_i64();
     gen_addr_reg_index(ctx, EA);
-    tcg_gen_qemu_ld_i64(cpu_fpr[rD(ctx->opcode)], EA, PPC_TLB_EPID_LOAD,
-        DEF_MEMOP(MO_Q));
+    tcg_gen_qemu_ld_i64(t0, EA, PPC_TLB_EPID_LOAD, DEF_MEMOP(MO_Q));
+    set_fpr(rD(ctx->opcode), t0);
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
 }
 
 /* lfdp */
 static void gen_lfdp(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
@@ -701,24 +877,31 @@ static void gen_lfdp(DisasContext *ctx)
     gen_set_access_type(ctx, ACCESS_FLOAT);
     EA = tcg_temp_new();
     gen_addr_imm_index(ctx, EA, 0);
+    t0 = tcg_temp_new_i64();
     /* We only need to swap high and low halves. gen_qemu_ld64_i64 does
        necessary 64-bit byteswap already. */
     if (unlikely(ctx->le_mode)) {
-        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
+        gen_qemu_ld64_i64(ctx, t0, EA);
+        set_fpr(rD(ctx->opcode) + 1, t0);
         tcg_gen_addi_tl(EA, EA, 8);
-        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
+        gen_qemu_ld64_i64(ctx, t0, EA);
+        set_fpr(rD(ctx->opcode), t0);
     } else {
-        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
+        gen_qemu_ld64_i64(ctx, t0, EA);
+        set_fpr(rD(ctx->opcode), t0);
         tcg_gen_addi_tl(EA, EA, 8);
-        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
+        gen_qemu_ld64_i64(ctx, t0, EA);
+        set_fpr(rD(ctx->opcode) + 1, t0);
     }
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
 }
 
 /* lfdpx */
 static void gen_lfdpx(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
@@ -726,18 +909,24 @@ static void gen_lfdpx(DisasContext *ctx)
     gen_set_access_type(ctx, ACCESS_FLOAT);
     EA = tcg_temp_new();
     gen_addr_reg_index(ctx, EA);
+    t0 = tcg_temp_new_i64();
     /* We only need to swap high and low halves. gen_qemu_ld64_i64 does
        necessary 64-bit byteswap already. */
     if (unlikely(ctx->le_mode)) {
-        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
+        gen_qemu_ld64_i64(ctx, t0, EA);
+        set_fpr(rD(ctx->opcode) + 1, t0);
         tcg_gen_addi_tl(EA, EA, 8);
-        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
+        gen_qemu_ld64_i64(ctx, t0, EA);
+        set_fpr(rD(ctx->opcode), t0);
     } else {
-        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
+        gen_qemu_ld64_i64(ctx, t0, EA);
+        set_fpr(rD(ctx->opcode), t0);
         tcg_gen_addi_tl(EA, EA, 8);
-        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
+        gen_qemu_ld64_i64(ctx, t0, EA);
+        set_fpr(rD(ctx->opcode) + 1, t0);
     }
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
 }
 
 /* lfiwax */
@@ -745,6 +934,7 @@ static void gen_lfiwax(DisasContext *ctx)
 {
     TCGv EA;
     TCGv t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
@@ -752,47 +942,59 @@ static void gen_lfiwax(DisasContext *ctx)
     gen_set_access_type(ctx, ACCESS_FLOAT);
     EA = tcg_temp_new();
     t0 = tcg_temp_new();
+    t1 = tcg_temp_new_i64();
     gen_addr_reg_index(ctx, EA);
     gen_qemu_ld32s(ctx, t0, EA);
-    tcg_gen_ext_tl_i64(cpu_fpr[rD(ctx->opcode)], t0);
+    tcg_gen_ext_tl_i64(t1, t0);
+    set_fpr(rD(ctx->opcode), t1);
     tcg_temp_free(EA);
     tcg_temp_free(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /* lfiwzx */
 static void gen_lfiwzx(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
     gen_set_access_type(ctx, ACCESS_FLOAT);
     EA = tcg_temp_new();
+    t0 = tcg_temp_new_i64();
     gen_addr_reg_index(ctx, EA);
-    gen_qemu_ld32u_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
+    gen_qemu_ld32u_i64(ctx, t0, EA);
+    set_fpr(rD(ctx->opcode), t0);
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
 }
 /***                         Floating-point store                          ***/
 #define GEN_STF(name, stop, opc, type)                                        \
 static void glue(gen_, name)(DisasContext *ctx)                                       \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 t0;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
     EA = tcg_temp_new();                                                      \
+    t0 = tcg_temp_new_i64();                                                  \
     gen_addr_imm_index(ctx, EA, 0);                                           \
-    gen_qemu_##stop(ctx, cpu_fpr[rS(ctx->opcode)], EA);                       \
+    get_fpr(t0, rS(ctx->opcode));                                             \
+    gen_qemu_##stop(ctx, t0, EA);                                             \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(t0);                                                    \
 }
 
 #define GEN_STUF(name, stop, opc, type)                                       \
 static void glue(gen_, name##u)(DisasContext *ctx)                                    \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 t0;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
@@ -803,16 +1005,20 @@ static void glue(gen_, name##u)(DisasContext *ctx)
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
     EA = tcg_temp_new();                                                      \
+    t0 = tcg_temp_new_i64();                                                  \
     gen_addr_imm_index(ctx, EA, 0);                                           \
-    gen_qemu_##stop(ctx, cpu_fpr[rS(ctx->opcode)], EA);                       \
+    get_fpr(t0, rS(ctx->opcode));                                             \
+    gen_qemu_##stop(ctx, t0, EA);                                             \
     tcg_gen_mov_tl(cpu_gpr[rA(ctx->opcode)], EA);                             \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(t0);                                                    \
 }
 
 #define GEN_STUXF(name, stop, opc, type)                                      \
 static void glue(gen_, name##ux)(DisasContext *ctx)                                   \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 t0;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
@@ -823,25 +1029,32 @@ static void glue(gen_, name##ux)(DisasContext *ctx)
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
     EA = tcg_temp_new();                                                      \
+    t0 = tcg_temp_new_i64();                                                  \
     gen_addr_reg_index(ctx, EA);                                              \
-    gen_qemu_##stop(ctx, cpu_fpr[rS(ctx->opcode)], EA);                       \
+    get_fpr(t0, rS(ctx->opcode));                                             \
+    gen_qemu_##stop(ctx, t0, EA);                                             \
     tcg_gen_mov_tl(cpu_gpr[rA(ctx->opcode)], EA);                             \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(t0);                                                    \
 }
 
 #define GEN_STXF(name, stop, opc2, opc3, type)                                \
 static void glue(gen_, name##x)(DisasContext *ctx)                                    \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 t0;                                                              \
     if (unlikely(!ctx->fpu_enabled)) {                                        \
         gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
         return;                                                               \
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
     EA = tcg_temp_new();                                                      \
+    t0 = tcg_temp_new_i64();                                                  \
     gen_addr_reg_index(ctx, EA);                                              \
-    gen_qemu_##stop(ctx, cpu_fpr[rS(ctx->opcode)], EA);                       \
+    get_fpr(t0, rS(ctx->opcode));                                             \
+    gen_qemu_##stop(ctx, t0, EA);                                             \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(t0);                                                    \
 }
 
 #define GEN_STFS(name, stop, op, type)                                        \
@@ -867,6 +1080,7 @@ GEN_STFS(stfs, st32fs, 0x14, PPC_FLOAT);
 static void gen_stfdepx(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0;
     CHK_SV;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
@@ -874,60 +1088,76 @@ static void gen_stfdepx(DisasContext *ctx)
     }
     gen_set_access_type(ctx, ACCESS_FLOAT);
     EA = tcg_temp_new();
+    t0 = tcg_temp_new_i64();
     gen_addr_reg_index(ctx, EA);
-    tcg_gen_qemu_st_i64(cpu_fpr[rD(ctx->opcode)], EA, PPC_TLB_EPID_STORE,
-                       DEF_MEMOP(MO_Q));
+    get_fpr(t0, rD(ctx->opcode));
+    tcg_gen_qemu_st_i64(t0, EA, PPC_TLB_EPID_STORE, DEF_MEMOP(MO_Q));
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
 }
 
 /* stfdp */
 static void gen_stfdp(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
     gen_set_access_type(ctx, ACCESS_FLOAT);
     EA = tcg_temp_new();
+    t0 = tcg_temp_new_i64();
     gen_addr_imm_index(ctx, EA, 0);
     /* We only need to swap high and low halves. gen_qemu_st64_i64 does
        necessary 64-bit byteswap already. */
     if (unlikely(ctx->le_mode)) {
-        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
+        get_fpr(t0, rD(ctx->opcode) + 1);
+        gen_qemu_st64_i64(ctx, t0, EA);
         tcg_gen_addi_tl(EA, EA, 8);
-        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
+        get_fpr(t0, rD(ctx->opcode));
+        gen_qemu_st64_i64(ctx, t0, EA);
     } else {
-        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
+        get_fpr(t0, rD(ctx->opcode));
+        gen_qemu_st64_i64(ctx, t0, EA);
         tcg_gen_addi_tl(EA, EA, 8);
-        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
+        get_fpr(t0, rD(ctx->opcode) + 1);
+        gen_qemu_st64_i64(ctx, t0, EA);
     }
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
 }
 
 /* stfdpx */
 static void gen_stfdpx(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0;
     if (unlikely(!ctx->fpu_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_FPU);
         return;
     }
     gen_set_access_type(ctx, ACCESS_FLOAT);
     EA = tcg_temp_new();
+    t0 = tcg_temp_new_i64();
     gen_addr_reg_index(ctx, EA);
     /* We only need to swap high and low halves. gen_qemu_st64_i64 does
        necessary 64-bit byteswap already. */
     if (unlikely(ctx->le_mode)) {
-        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
+        get_fpr(t0, rD(ctx->opcode) + 1);
+        gen_qemu_st64_i64(ctx, t0, EA);
         tcg_gen_addi_tl(EA, EA, 8);
-        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
+        get_fpr(t0, rD(ctx->opcode));
+        gen_qemu_st64_i64(ctx, t0, EA);
     } else {
-        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
+        get_fpr(t0, rD(ctx->opcode));
+        gen_qemu_st64_i64(ctx, t0, EA);
         tcg_gen_addi_tl(EA, EA, 8);
-        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
+        get_fpr(t0, rD(ctx->opcode) + 1);
+        gen_qemu_st64_i64(ctx, t0, EA);
     }
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
 }
 
 /* Optional: */
@@ -949,13 +1179,18 @@ static void gen_lfq(DisasContext *ctx)
 {
     int rd = rD(ctx->opcode);
     TCGv t0;
+    TCGv_i64 t1;
     gen_set_access_type(ctx, ACCESS_FLOAT);
     t0 = tcg_temp_new();
+    t1 = tcg_temp_new_i64();
     gen_addr_imm_index(ctx, t0, 0);
-    gen_qemu_ld64_i64(ctx, cpu_fpr[rd], t0);
+    gen_qemu_ld64_i64(ctx, t1, t0);
+    set_fpr(rd, t1);
     gen_addr_add(ctx, t0, t0, 8);
-    gen_qemu_ld64_i64(ctx, cpu_fpr[(rd + 1) % 32], t0);
+    gen_qemu_ld64_i64(ctx, t1, t0);
+    set_fpr((rd + 1) % 32, t1);
     tcg_temp_free(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /* lfqu */
@@ -964,17 +1199,22 @@ static void gen_lfqu(DisasContext *ctx)
     int ra = rA(ctx->opcode);
     int rd = rD(ctx->opcode);
     TCGv t0, t1;
+    TCGv_i64 t2;
     gen_set_access_type(ctx, ACCESS_FLOAT);
     t0 = tcg_temp_new();
     t1 = tcg_temp_new();
+    t2 = tcg_temp_new_i64();
     gen_addr_imm_index(ctx, t0, 0);
-    gen_qemu_ld64_i64(ctx, cpu_fpr[rd], t0);
+    gen_qemu_ld64_i64(ctx, t2, t0);
+    set_fpr(rd, t2);
     gen_addr_add(ctx, t1, t0, 8);
-    gen_qemu_ld64_i64(ctx, cpu_fpr[(rd + 1) % 32], t1);
+    gen_qemu_ld64_i64(ctx, t2, t1);
+    set_fpr((rd + 1) % 32, t2);
     if (ra != 0)
         tcg_gen_mov_tl(cpu_gpr[ra], t0);
     tcg_temp_free(t0);
     tcg_temp_free(t1);
+    tcg_temp_free_i64(t2);
 }
 
 /* lfqux */
@@ -984,16 +1224,21 @@ static void gen_lfqux(DisasContext *ctx)
     int rd = rD(ctx->opcode);
     gen_set_access_type(ctx, ACCESS_FLOAT);
     TCGv t0, t1;
+    TCGv_i64 t2;
+    t2 = tcg_temp_new_i64();
     t0 = tcg_temp_new();
     gen_addr_reg_index(ctx, t0);
-    gen_qemu_ld64_i64(ctx, cpu_fpr[rd], t0);
+    gen_qemu_ld64_i64(ctx, t2, t0);
+    set_fpr(rd, t2);
     t1 = tcg_temp_new();
     gen_addr_add(ctx, t1, t0, 8);
-    gen_qemu_ld64_i64(ctx, cpu_fpr[(rd + 1) % 32], t1);
+    gen_qemu_ld64_i64(ctx, t2, t1);
+    set_fpr((rd + 1) % 32, t2);
     tcg_temp_free(t1);
     if (ra != 0)
         tcg_gen_mov_tl(cpu_gpr[ra], t0);
     tcg_temp_free(t0);
+    tcg_temp_free_i64(t2);
 }
 
 /* lfqx */
@@ -1001,13 +1246,18 @@ static void gen_lfqx(DisasContext *ctx)
 {
     int rd = rD(ctx->opcode);
     TCGv t0;
+    TCGv_i64 t1;
     gen_set_access_type(ctx, ACCESS_FLOAT);
     t0 = tcg_temp_new();
+    t1 = tcg_temp_new_i64();
     gen_addr_reg_index(ctx, t0);
-    gen_qemu_ld64_i64(ctx, cpu_fpr[rd], t0);
+    gen_qemu_ld64_i64(ctx, t1, t0);
+    set_fpr(rd, t1);
     gen_addr_add(ctx, t0, t0, 8);
-    gen_qemu_ld64_i64(ctx, cpu_fpr[(rd + 1) % 32], t0);
+    gen_qemu_ld64_i64(ctx, t1, t0);
+    set_fpr((rd + 1) % 32, t1);
     tcg_temp_free(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /* stfq */
@@ -1015,13 +1265,18 @@ static void gen_stfq(DisasContext *ctx)
 {
     int rd = rD(ctx->opcode);
     TCGv t0;
+    TCGv_i64 t1;
     gen_set_access_type(ctx, ACCESS_FLOAT);
     t0 = tcg_temp_new();
+    t1 = tcg_temp_new_i64();
     gen_addr_imm_index(ctx, t0, 0);
-    gen_qemu_st64_i64(ctx, cpu_fpr[rd], t0);
+    get_fpr(t1, rd);
+    gen_qemu_st64_i64(ctx, t1, t0);
     gen_addr_add(ctx, t0, t0, 8);
-    gen_qemu_st64_i64(ctx, cpu_fpr[(rd + 1) % 32], t0);
+    get_fpr(t1, (rd + 1) % 32);
+    gen_qemu_st64_i64(ctx, t1, t0);
     tcg_temp_free(t0);
+    tcg_temp_free_i64(t1);
 }
 
 /* stfqu */
@@ -1030,17 +1285,23 @@ static void gen_stfqu(DisasContext *ctx)
     int ra = rA(ctx->opcode);
     int rd = rD(ctx->opcode);
     TCGv t0, t1;
+    TCGv_i64 t2;
     gen_set_access_type(ctx, ACCESS_FLOAT);
+    t2 = tcg_temp_new_i64();
     t0 = tcg_temp_new();
     gen_addr_imm_index(ctx, t0, 0);
-    gen_qemu_st64_i64(ctx, cpu_fpr[rd], t0);
+    get_fpr(t2, rd);
+    gen_qemu_st64_i64(ctx, t2, t0);
     t1 = tcg_temp_new();
     gen_addr_add(ctx, t1, t0, 8);
-    gen_qemu_st64_i64(ctx, cpu_fpr[(rd + 1) % 32], t1);
+    get_fpr(t2, (rd + 1) % 32);
+    gen_qemu_st64_i64(ctx, t2, t1);
     tcg_temp_free(t1);
-    if (ra != 0)
+    if (ra != 0) {
         tcg_gen_mov_tl(cpu_gpr[ra], t0);
+    }
     tcg_temp_free(t0);
+    tcg_temp_free_i64(t2);
 }
 
 /* stfqux */
@@ -1049,17 +1310,23 @@ static void gen_stfqux(DisasContext *ctx)
     int ra = rA(ctx->opcode);
     int rd = rD(ctx->opcode);
     TCGv t0, t1;
+    TCGv_i64 t2;
     gen_set_access_type(ctx, ACCESS_FLOAT);
+    t2 = tcg_temp_new_i64();
     t0 = tcg_temp_new();
     gen_addr_reg_index(ctx, t0);
-    gen_qemu_st64_i64(ctx, cpu_fpr[rd], t0);
+    get_fpr(t2, rd);
+    gen_qemu_st64_i64(ctx, t2, t0);
     t1 = tcg_temp_new();
     gen_addr_add(ctx, t1, t0, 8);
-    gen_qemu_st64_i64(ctx, cpu_fpr[(rd + 1) % 32], t1);
+    get_fpr(t2, (rd + 1) % 32);
+    gen_qemu_st64_i64(ctx, t2, t1);
     tcg_temp_free(t1);
-    if (ra != 0)
+    if (ra != 0) {
         tcg_gen_mov_tl(cpu_gpr[ra], t0);
+    }
     tcg_temp_free(t0);
+    tcg_temp_free_i64(t2);
 }
 
 /* stfqx */
@@ -1067,13 +1334,18 @@ static void gen_stfqx(DisasContext *ctx)
 {
     int rd = rD(ctx->opcode);
     TCGv t0;
+    TCGv_i64 t1;
     gen_set_access_type(ctx, ACCESS_FLOAT);
+    t1 = tcg_temp_new_i64();
     t0 = tcg_temp_new();
     gen_addr_reg_index(ctx, t0);
-    gen_qemu_st64_i64(ctx, cpu_fpr[rd], t0);
+    get_fpr(t1, rd);
+    gen_qemu_st64_i64(ctx, t1, t0);
     gen_addr_add(ctx, t0, t0, 8);
-    gen_qemu_st64_i64(ctx, cpu_fpr[(rd + 1) % 32], t0);
+    get_fpr(t1, (rd + 1) % 32);
+    gen_qemu_st64_i64(ctx, t1, t0);
     tcg_temp_free(t0);
+    tcg_temp_free_i64(t1);
 }
 
 #undef _GEN_FLOAT_ACB
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 2/6] target/ppc: introduce get_avr64() and set_avr64() helpers for VMX register access
  2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access Mark Cave-Ayland
@ 2018-12-07  8:56 ` Mark Cave-Ayland
  2018-12-10 18:49   ` Richard Henderson
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 3/6] target/ppc: introduce get_cpu_vsr{l, h}() and set_cpu_vsr{l, h}() helpers for VSR " Mark Cave-Ayland
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-07  8:56 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, david, richard.henderson

These helpers allow us to move AVR register values to/from the specified TCGv_i64
argument.

To prevent VMX helpers accessing the cpu_avr{l,h} arrays directly, add extra TCG
temporaries as required.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
---
 target/ppc/translate.c              |  10 +++
 target/ppc/translate/vmx-impl.inc.c | 130 ++++++++++++++++++++++++++++--------
 2 files changed, 111 insertions(+), 29 deletions(-)

diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index 1d4bf624a3..fa3e8dc114 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -6704,6 +6704,16 @@ static inline void set_fpr(int regno, TCGv_i64 src)
     tcg_gen_mov_i64(cpu_fpr[regno], src);
 }
 
+static inline void get_avr64(TCGv_i64 dst, int regno, bool high)
+{
+    tcg_gen_mov_i64(dst, (high ? cpu_avrh : cpu_avrl)[regno]);
+}
+
+static inline void set_avr64(int regno, TCGv_i64 src, bool high)
+{
+    tcg_gen_mov_i64((high ? cpu_avrh : cpu_avrl)[regno], src);
+}
+
 #include "translate/fp-impl.inc.c"
 
 #include "translate/vmx-impl.inc.c"
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 3cb6fc2926..30046c6e31 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -18,52 +18,66 @@ static inline TCGv_ptr gen_avr_ptr(int reg)
 static void glue(gen_, name)(DisasContext *ctx)                                       \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 avr;                                                             \
     if (unlikely(!ctx->altivec_enabled)) {                                    \
         gen_exception(ctx, POWERPC_EXCP_VPU);                                 \
         return;                                                               \
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_INT);                                     \
+    avr = tcg_temp_new_i64();                                                 \
     EA = tcg_temp_new();                                                      \
     gen_addr_reg_index(ctx, EA);                                              \
     tcg_gen_andi_tl(EA, EA, ~0xf);                                            \
     /* We only need to swap high and low halves. gen_qemu_ld64_i64 does       \
        necessary 64-bit byteswap already. */                                  \
     if (ctx->le_mode) {                                                       \
-        gen_qemu_ld64_i64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                \
+        gen_qemu_ld64_i64(ctx, avr, EA);                                      \
+        set_avr64(rD(ctx->opcode), avr, false);                               \
         tcg_gen_addi_tl(EA, EA, 8);                                           \
-        gen_qemu_ld64_i64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                \
+        gen_qemu_ld64_i64(ctx, avr, EA);                                      \
+        set_avr64(rD(ctx->opcode), avr, true);                                \
     } else {                                                                  \
-        gen_qemu_ld64_i64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                \
+        gen_qemu_ld64_i64(ctx, avr, EA);                                      \
+        set_avr64(rD(ctx->opcode), avr, true);                                \
         tcg_gen_addi_tl(EA, EA, 8);                                           \
-        gen_qemu_ld64_i64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                \
+        gen_qemu_ld64_i64(ctx, avr, EA);                                      \
+        set_avr64(rD(ctx->opcode), avr, false);                               \
     }                                                                         \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(avr);                                                   \
 }
 
 #define GEN_VR_STX(name, opc2, opc3)                                          \
 static void gen_st##name(DisasContext *ctx)                                   \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i64 avr;                                                             \
     if (unlikely(!ctx->altivec_enabled)) {                                    \
         gen_exception(ctx, POWERPC_EXCP_VPU);                                 \
         return;                                                               \
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_INT);                                     \
+    avr = tcg_temp_new_i64();                                                 \
     EA = tcg_temp_new();                                                      \
     gen_addr_reg_index(ctx, EA);                                              \
     tcg_gen_andi_tl(EA, EA, ~0xf);                                            \
     /* We only need to swap high and low halves. gen_qemu_st64_i64 does       \
        necessary 64-bit byteswap already. */                                  \
     if (ctx->le_mode) {                                                       \
-        gen_qemu_st64_i64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                \
+        get_avr64(avr, rD(ctx->opcode), false);                               \
+        gen_qemu_st64_i64(ctx, avr, EA);                                      \
         tcg_gen_addi_tl(EA, EA, 8);                                           \
-        gen_qemu_st64_i64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                \
+        get_avr64(avr, rD(ctx->opcode), true);                                \
+        gen_qemu_st64_i64(ctx, avr, EA);                                      \
     } else {                                                                  \
-        gen_qemu_st64_i64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                \
+        get_avr64(avr, rD(ctx->opcode), true);                                \
+        gen_qemu_st64_i64(ctx, avr, EA);                                      \
         tcg_gen_addi_tl(EA, EA, 8);                                           \
-        gen_qemu_st64_i64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                \
+        get_avr64(avr, rD(ctx->opcode), false);                               \
+        gen_qemu_st64_i64(ctx, avr, EA);                                      \
     }                                                                         \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i64(avr);                                                   \
 }
 
 #define GEN_VR_LVE(name, opc2, opc3, size)                              \
@@ -159,15 +173,20 @@ static void gen_lvsr(DisasContext *ctx)
 static void gen_mfvscr(DisasContext *ctx)
 {
     TCGv_i32 t;
+    TCGv_i64 avr;
     if (unlikely(!ctx->altivec_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VPU);
         return;
     }
-    tcg_gen_movi_i64(cpu_avrh[rD(ctx->opcode)], 0);
+    avr = tcg_temp_new_i64();
+    tcg_gen_movi_i64(avr, 0);
+    set_avr64(rD(ctx->opcode), avr, true);
     t = tcg_temp_new_i32();
     tcg_gen_ld_i32(t, cpu_env, offsetof(CPUPPCState, vscr));
-    tcg_gen_extu_i32_i64(cpu_avrl[rD(ctx->opcode)], t);
+    tcg_gen_extu_i32_i64(avr, t);
+    set_avr64(rD(ctx->opcode), avr, false);
     tcg_temp_free_i32(t);
+    tcg_temp_free_i64(avr);
 }
 
 static void gen_mtvscr(DisasContext *ctx)
@@ -188,6 +207,7 @@ static void glue(gen_, name)(DisasContext *ctx)                         \
     TCGv_i64 t0 = tcg_temp_new_i64();                                   \
     TCGv_i64 t1 = tcg_temp_new_i64();                                   \
     TCGv_i64 t2 = tcg_temp_new_i64();                                   \
+    TCGv_i64 avr = tcg_temp_new_i64();                                  \
     TCGv_i64 ten, z;                                                    \
                                                                         \
     if (unlikely(!ctx->altivec_enabled)) {                              \
@@ -199,26 +219,35 @@ static void glue(gen_, name)(DisasContext *ctx)                         \
     z = tcg_const_i64(0);                                               \
                                                                         \
     if (add_cin) {                                                      \
-        tcg_gen_mulu2_i64(t0, t1, cpu_avrl[rA(ctx->opcode)], ten);      \
-        tcg_gen_andi_i64(t2, cpu_avrl[rB(ctx->opcode)], 0xF);           \
-        tcg_gen_add2_i64(cpu_avrl[rD(ctx->opcode)], t2, t0, t1, t2, z); \
+        get_avr64(avr, rA(ctx->opcode), false);                         \
+        tcg_gen_mulu2_i64(t0, t1, avr, ten);                            \
+        get_avr64(avr, rB(ctx->opcode), false);                         \
+        tcg_gen_andi_i64(t2, avr, 0xF);                                 \
+        tcg_gen_add2_i64(avr, t2, t0, t1, t2, z);                       \
+        set_avr64(rD(ctx->opcode), avr, false);                         \
     } else {                                                            \
-        tcg_gen_mulu2_i64(cpu_avrl[rD(ctx->opcode)], t2,                \
-                          cpu_avrl[rA(ctx->opcode)], ten);              \
+        get_avr64(avr, rA(ctx->opcode), false);                         \
+        tcg_gen_mulu2_i64(avr, t2, avr, ten);                           \
+        set_avr64(rD(ctx->opcode), avr, false);                         \
     }                                                                   \
                                                                         \
     if (ret_carry) {                                                    \
-        tcg_gen_mulu2_i64(t0, t1, cpu_avrh[rA(ctx->opcode)], ten);      \
-        tcg_gen_add2_i64(t0, cpu_avrl[rD(ctx->opcode)], t0, t1, t2, z); \
-        tcg_gen_movi_i64(cpu_avrh[rD(ctx->opcode)], 0);                 \
+        get_avr64(avr, rA(ctx->opcode), true);                          \
+        tcg_gen_mulu2_i64(t0, t1, avr, ten);                            \
+        tcg_gen_add2_i64(t0, avr, t0, t1, t2, z);                       \
+        set_avr64(rD(ctx->opcode), avr, false);                         \
+        set_avr64(rD(ctx->opcode), z, true);                            \
     } else {                                                            \
-        tcg_gen_mul_i64(t0, cpu_avrh[rA(ctx->opcode)], ten);            \
-        tcg_gen_add_i64(cpu_avrh[rD(ctx->opcode)], t0, t2);             \
+        get_avr64(avr, rA(ctx->opcode), true);                          \
+        tcg_gen_mul_i64(t0, avr, ten);                                  \
+        tcg_gen_add_i64(avr, t0, t2);                                   \
+        set_avr64(rD(ctx->opcode), avr, true);                          \
     }                                                                   \
                                                                         \
     tcg_temp_free_i64(t0);                                              \
     tcg_temp_free_i64(t1);                                              \
     tcg_temp_free_i64(t2);                                              \
+    tcg_temp_free_i64(avr);                                             \
     tcg_temp_free_i64(ten);                                             \
     tcg_temp_free_i64(z);                                               \
 }                                                                       \
@@ -232,12 +261,27 @@ GEN_VX_VMUL10(vmul10ecuq, 1, 1);
 #define GEN_VX_LOGICAL(name, tcg_op, opc2, opc3)                        \
 static void glue(gen_, name)(DisasContext *ctx)                                 \
 {                                                                       \
+    TCGv_i64 t0 = tcg_temp_new_i64();                                   \
+    TCGv_i64 t1 = tcg_temp_new_i64();                                   \
+    TCGv_i64 avr = tcg_temp_new_i64();                                  \
+                                                                        \
     if (unlikely(!ctx->altivec_enabled)) {                              \
         gen_exception(ctx, POWERPC_EXCP_VPU);                           \
         return;                                                         \
     }                                                                   \
-    tcg_op(cpu_avrh[rD(ctx->opcode)], cpu_avrh[rA(ctx->opcode)], cpu_avrh[rB(ctx->opcode)]); \
-    tcg_op(cpu_avrl[rD(ctx->opcode)], cpu_avrl[rA(ctx->opcode)], cpu_avrl[rB(ctx->opcode)]); \
+    get_avr64(t0, rA(ctx->opcode), true);                               \
+    get_avr64(t1, rB(ctx->opcode), true);                               \
+    tcg_op(avr, t0, t1);                                                \
+    set_avr64(rD(ctx->opcode), avr, true);                              \
+                                                                        \
+    get_avr64(t0, rA(ctx->opcode), false);                              \
+    get_avr64(t1, rB(ctx->opcode), false);                              \
+    tcg_op(avr, t0, t1);                                                \
+    set_avr64(rD(ctx->opcode), avr, false);                             \
+                                                                        \
+    tcg_temp_free_i64(t0);                                              \
+    tcg_temp_free_i64(t1);                                              \
+    tcg_temp_free_i64(avr);                                             \
 }
 
 GEN_VX_LOGICAL(vand, tcg_gen_and_i64, 2, 16);
@@ -406,6 +450,7 @@ GEN_VXFORM(vmrglw, 6, 6);
 static void gen_vmrgew(DisasContext *ctx)
 {
     TCGv_i64 tmp;
+    TCGv_i64 avr;
     int VT, VA, VB;
     if (unlikely(!ctx->altivec_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VPU);
@@ -415,15 +460,28 @@ static void gen_vmrgew(DisasContext *ctx)
     VA = rA(ctx->opcode);
     VB = rB(ctx->opcode);
     tmp = tcg_temp_new_i64();
-    tcg_gen_shri_i64(tmp, cpu_avrh[VB], 32);
-    tcg_gen_deposit_i64(cpu_avrh[VT], cpu_avrh[VA], tmp, 0, 32);
-    tcg_gen_shri_i64(tmp, cpu_avrl[VB], 32);
-    tcg_gen_deposit_i64(cpu_avrl[VT], cpu_avrl[VA], tmp, 0, 32);
+    avr = tcg_temp_new_i64();
+
+    get_avr64(avr, VB, true);
+    tcg_gen_shri_i64(tmp, avr, 32);
+    get_avr64(avr, VA, true);
+    tcg_gen_deposit_i64(avr, avr, tmp, 0, 32);
+    set_avr64(VT, avr, true);
+
+    get_avr64(avr, VB, false);
+    tcg_gen_shri_i64(tmp, avr, 32);
+    get_avr64(avr, VA, false);
+    tcg_gen_deposit_i64(avr, avr, tmp, 0, 32);
+    set_avr64(VT, avr, false);
+
     tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(avr);
 }
 
 static void gen_vmrgow(DisasContext *ctx)
 {
+    TCGv_i64 t0, t1;
+    TCGv_i64 avr;
     int VT, VA, VB;
     if (unlikely(!ctx->altivec_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VPU);
@@ -432,9 +490,23 @@ static void gen_vmrgow(DisasContext *ctx)
     VT = rD(ctx->opcode);
     VA = rA(ctx->opcode);
     VB = rB(ctx->opcode);
-
-    tcg_gen_deposit_i64(cpu_avrh[VT], cpu_avrh[VB], cpu_avrh[VA], 32, 32);
-    tcg_gen_deposit_i64(cpu_avrl[VT], cpu_avrl[VB], cpu_avrl[VA], 32, 32);
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
+    avr = tcg_temp_new_i64();
+
+    get_avr64(t0, VB, true);
+    get_avr64(t1, VA, true);
+    tcg_gen_deposit_i64(avr, t0, t1, 32, 32);
+    set_avr64(VT, avr, true);
+
+    get_avr64(t0, VB, false);
+    get_avr64(t1, VA, false);
+    tcg_gen_deposit_i64(avr, t0, t1, 32, 32);
+    set_avr64(VT, avr, false);
+
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
+    tcg_temp_free_i64(avr);
 }
 
 GEN_VXFORM(vmuloub, 4, 0);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 3/6] target/ppc: introduce get_cpu_vsr{l, h}() and set_cpu_vsr{l, h}() helpers for VSR register access
  2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access Mark Cave-Ayland
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 2/6] target/ppc: introduce get_avr64() and set_avr64() helpers for VMX " Mark Cave-Ayland
@ 2018-12-07  8:56 ` Mark Cave-Ayland
  2018-12-10 19:16   ` Richard Henderson
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 4/6] target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env Mark Cave-Ayland
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-07  8:56 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, david, richard.henderson

These helpers allow us to move VSR register values to/from the specified TCGv_i64
argument.

To prevent VSX helpers accessing the cpu_vsr array directly, add extra TCG
temporaries as required.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
---
 target/ppc/translate/vsx-impl.inc.c | 782 ++++++++++++++++++++++++++----------
 1 file changed, 561 insertions(+), 221 deletions(-)

diff --git a/target/ppc/translate/vsx-impl.inc.c b/target/ppc/translate/vsx-impl.inc.c
index 85ed135d44..20e1fd9324 100644
--- a/target/ppc/translate/vsx-impl.inc.c
+++ b/target/ppc/translate/vsx-impl.inc.c
@@ -1,20 +1,48 @@
 /***                           VSX extension                               ***/
 
-static inline TCGv_i64 cpu_vsrh(int n)
+static inline void get_vsr(TCGv_i64 dst, int n)
+{
+    tcg_gen_ld_i64(dst, cpu_env, offsetof(CPUPPCState, vsr[n]));
+}
+
+static inline void set_vsr(int n, TCGv_i64 src)
+{
+    tcg_gen_st_i64(src, cpu_env, offsetof(CPUPPCState, vsr[n]));
+}
+
+static inline void get_cpu_vsrh(TCGv_i64 dst, int n)
+{
+    if (n < 32) {
+        get_fpr(dst, n);
+    } else {
+        get_avr64(dst, n - 32, true);
+    }
+}
+
+static inline void get_cpu_vsrl(TCGv_i64 dst, int n)
 {
     if (n < 32) {
-        return cpu_fpr[n];
+        get_vsr(dst, n);
     } else {
-        return cpu_avrh[n-32];
+        get_avr64(dst, n - 32, false);
     }
 }
 
-static inline TCGv_i64 cpu_vsrl(int n)
+static inline void set_cpu_vsrh(int n, TCGv_i64 src)
 {
     if (n < 32) {
-        return cpu_vsr[n];
+        set_fpr(n, src);
     } else {
-        return cpu_avrl[n-32];
+        set_avr64(n - 32, src, true);
+    }
+}
+
+static inline void set_cpu_vsrl(int n, TCGv_i64 src)
+{
+    if (n < 32) {
+        set_vsr(n, src);
+    } else {
+        set_avr64(n - 32, src, false);
     }
 }
 
@@ -22,16 +50,20 @@ static inline TCGv_i64 cpu_vsrl(int n)
 static void gen_##name(DisasContext *ctx)                     \
 {                                                             \
     TCGv EA;                                                  \
+    TCGv_i64 t0;                                              \
     if (unlikely(!ctx->vsx_enabled)) {                        \
         gen_exception(ctx, POWERPC_EXCP_VSXU);                \
         return;                                               \
     }                                                         \
+    t0 = tcg_temp_new_i64();                                  \
     gen_set_access_type(ctx, ACCESS_INT);                     \
     EA = tcg_temp_new();                                      \
     gen_addr_reg_index(ctx, EA);                              \
-    gen_qemu_##operation(ctx, cpu_vsrh(xT(ctx->opcode)), EA); \
+    gen_qemu_##operation(ctx, t0, EA);                        \
+    set_cpu_vsrh(xT(ctx->opcode), t0);                        \
     /* NOTE: cpu_vsrl is undefined */                         \
     tcg_temp_free(EA);                                        \
+    tcg_temp_free_i64(t0);                                    \
 }
 
 VSX_LOAD_SCALAR(lxsdx, ld64_i64)
@@ -44,39 +76,54 @@ VSX_LOAD_SCALAR(lxsspx, ld32fs)
 static void gen_lxvd2x(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0;
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
     }
+    t0 = tcg_temp_new_i64();
     gen_set_access_type(ctx, ACCESS_INT);
     EA = tcg_temp_new();
     gen_addr_reg_index(ctx, EA);
-    gen_qemu_ld64_i64(ctx, cpu_vsrh(xT(ctx->opcode)), EA);
+    gen_qemu_ld64_i64(ctx, t0, EA);
+    set_cpu_vsrh(xT(ctx->opcode), t0);
     tcg_gen_addi_tl(EA, EA, 8);
-    gen_qemu_ld64_i64(ctx, cpu_vsrl(xT(ctx->opcode)), EA);
+    gen_qemu_ld64_i64(ctx, t0, EA);
+    set_cpu_vsrl(xT(ctx->opcode), t0);
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
 }
 
 static void gen_lxvdsx(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0;
+    TCGv_i64 t1;
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
     }
+    t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
     gen_set_access_type(ctx, ACCESS_INT);
     EA = tcg_temp_new();
     gen_addr_reg_index(ctx, EA);
-    gen_qemu_ld64_i64(ctx, cpu_vsrh(xT(ctx->opcode)), EA);
-    tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), cpu_vsrh(xT(ctx->opcode)));
+    gen_qemu_ld64_i64(ctx, t0, EA);
+    set_cpu_vsrh(xT(ctx->opcode), t0);
+    tcg_gen_mov_i64(t1, t0);
+    set_cpu_vsrl(xT(ctx->opcode), t1);
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
 }
 
 static void gen_lxvw4x(DisasContext *ctx)
 {
     TCGv EA;
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+    get_cpu_vsrh(xth, xT(ctx->opcode));
+    get_cpu_vsrh(xtl, xT(ctx->opcode));
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
@@ -104,6 +151,8 @@ static void gen_lxvw4x(DisasContext *ctx)
         tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
     }
     tcg_temp_free(EA);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
 }
 
 static void gen_bswap16x8(TCGv_i64 outh, TCGv_i64 outl,
@@ -151,8 +200,10 @@ static void gen_bswap32x4(TCGv_i64 outh, TCGv_i64 outl,
 static void gen_lxvh8x(DisasContext *ctx)
 {
     TCGv EA;
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+    get_cpu_vsrh(xth, xT(ctx->opcode));
+    get_cpu_vsrh(xtl, xT(ctx->opcode));
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -169,13 +220,17 @@ static void gen_lxvh8x(DisasContext *ctx)
         gen_bswap16x8(xth, xtl, xth, xtl);
     }
     tcg_temp_free(EA);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
 }
 
 static void gen_lxvb16x(DisasContext *ctx)
 {
     TCGv EA;
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+    get_cpu_vsrh(xth, xT(ctx->opcode));
+    get_cpu_vsrh(xtl, xT(ctx->opcode));
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -188,6 +243,8 @@ static void gen_lxvb16x(DisasContext *ctx)
     tcg_gen_addi_tl(EA, EA, 8);
     tcg_gen_qemu_ld_i64(xtl, EA, ctx->mem_idx, MO_BEQ);
     tcg_temp_free(EA);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
 }
 
 #define VSX_VECTOR_LOAD_STORE(name, op, indexed)            \
@@ -195,15 +252,16 @@ static void gen_##name(DisasContext *ctx)                   \
 {                                                           \
     int xt;                                                 \
     TCGv EA;                                                \
-    TCGv_i64 xth, xtl;                                      \
+    TCGv_i64 xth = tcg_temp_new_i64();                      \
+    TCGv_i64 xtl = tcg_temp_new_i64();                      \
                                                             \
     if (indexed) {                                          \
         xt = xT(ctx->opcode);                               \
     } else {                                                \
         xt = DQxT(ctx->opcode);                             \
     }                                                       \
-    xth = cpu_vsrh(xt);                                     \
-    xtl = cpu_vsrl(xt);                                     \
+    get_cpu_vsrh(xth, xt);                                  \
+    get_cpu_vsrl(xtl, xt);                                  \
                                                             \
     if (xt < 32) {                                          \
         if (unlikely(!ctx->vsx_enabled)) {                  \
@@ -225,14 +283,20 @@ static void gen_##name(DisasContext *ctx)                   \
     }                                                       \
     if (ctx->le_mode) {                                     \
         tcg_gen_qemu_##op(xtl, EA, ctx->mem_idx, MO_LEQ);   \
+        set_cpu_vsrl(xt, xtl);                              \
         tcg_gen_addi_tl(EA, EA, 8);                         \
         tcg_gen_qemu_##op(xth, EA, ctx->mem_idx, MO_LEQ);   \
+        set_cpu_vsrh(xt, xth);                              \
     } else {                                                \
         tcg_gen_qemu_##op(xth, EA, ctx->mem_idx, MO_BEQ);   \
+        set_cpu_vsrh(xt, xth);                              \
         tcg_gen_addi_tl(EA, EA, 8);                         \
         tcg_gen_qemu_##op(xtl, EA, ctx->mem_idx, MO_BEQ);   \
+        set_cpu_vsrl(xt, xtl);                              \
     }                                                       \
     tcg_temp_free(EA);                                      \
+    tcg_temp_free_i64(xth);                                 \
+    tcg_temp_free_i64(xtl);                                 \
 }
 
 VSX_VECTOR_LOAD_STORE(lxv, ld_i64, 0)
@@ -276,7 +340,8 @@ VSX_VECTOR_LOAD_STORE_LENGTH(stxvll)
 static void gen_##name(DisasContext *ctx)                         \
 {                                                                 \
     TCGv EA;                                                      \
-    TCGv_i64 xth = cpu_vsrh(rD(ctx->opcode) + 32);                \
+    TCGv_i64 xth = tcg_temp_new_i64();                            \
+    get_cpu_vsrh(xth, rD(ctx->opcode) + 32);                      \
                                                                   \
     if (unlikely(!ctx->altivec_enabled)) {                        \
         gen_exception(ctx, POWERPC_EXCP_VPU);                     \
@@ -286,8 +351,10 @@ static void gen_##name(DisasContext *ctx)                         \
     EA = tcg_temp_new();                                          \
     gen_addr_imm_index(ctx, EA, 0x03);                            \
     gen_qemu_##operation(ctx, xth, EA);                           \
+    set_cpu_vsrh(rD(ctx->opcode) + 32, xth);                      \
     /* NOTE: cpu_vsrl is undefined */                             \
     tcg_temp_free(EA);                                            \
+    tcg_temp_free_i64(xth);                                       \
 }
 
 VSX_LOAD_SCALAR_DS(lxsd, ld64_i64)
@@ -297,15 +364,19 @@ VSX_LOAD_SCALAR_DS(lxssp, ld32fs)
 static void gen_##name(DisasContext *ctx)                     \
 {                                                             \
     TCGv EA;                                                  \
+    TCGv_i64 t0;                                              \
     if (unlikely(!ctx->vsx_enabled)) {                        \
         gen_exception(ctx, POWERPC_EXCP_VSXU);                \
         return;                                               \
     }                                                         \
+    t0 = tcg_temp_new_i64();                                  \
     gen_set_access_type(ctx, ACCESS_INT);                     \
     EA = tcg_temp_new();                                      \
     gen_addr_reg_index(ctx, EA);                              \
-    gen_qemu_##operation(ctx, cpu_vsrh(xS(ctx->opcode)), EA); \
+    gen_qemu_##operation(ctx, t0, EA);                        \
+    set_cpu_vsrh(xS(ctx->opcode), t0);                        \
     tcg_temp_free(EA);                                        \
+    tcg_temp_free_i64(t0);                                    \
 }
 
 VSX_STORE_SCALAR(stxsdx, st64_i64)
@@ -318,6 +389,7 @@ VSX_STORE_SCALAR(stxsspx, st32fs)
 static void gen_stxvd2x(DisasContext *ctx)
 {
     TCGv EA;
+    TCGv_i64 t0 = tcg_temp_new_i64();
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
@@ -325,17 +397,23 @@ static void gen_stxvd2x(DisasContext *ctx)
     gen_set_access_type(ctx, ACCESS_INT);
     EA = tcg_temp_new();
     gen_addr_reg_index(ctx, EA);
-    gen_qemu_st64_i64(ctx, cpu_vsrh(xS(ctx->opcode)), EA);
+    get_cpu_vsrh(t0, xS(ctx->opcode));
+    gen_qemu_st64_i64(ctx, t0, EA);
     tcg_gen_addi_tl(EA, EA, 8);
-    gen_qemu_st64_i64(ctx, cpu_vsrl(xS(ctx->opcode)), EA);
+    get_cpu_vsrl(t0, xS(ctx->opcode));
+    gen_qemu_st64_i64(ctx, t0, EA);
     tcg_temp_free(EA);
+    tcg_temp_free_i64(t0);
 }
 
 static void gen_stxvw4x(DisasContext *ctx)
 {
-    TCGv_i64 xsh = cpu_vsrh(xS(ctx->opcode));
-    TCGv_i64 xsl = cpu_vsrl(xS(ctx->opcode));
     TCGv EA;
+    TCGv_i64 xsh = tcg_temp_new_i64();
+    TCGv_i64 xsl = tcg_temp_new_i64();
+    get_cpu_vsrh(xsh, xS(ctx->opcode));
+    get_cpu_vsrl(xsl, xS(ctx->opcode));
+
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
@@ -362,13 +440,17 @@ static void gen_stxvw4x(DisasContext *ctx)
         tcg_gen_qemu_st_i64(xsl, EA, ctx->mem_idx, MO_BEQ);
     }
     tcg_temp_free(EA);
+    tcg_temp_free_i64(xsh);
+    tcg_temp_free_i64(xsl);
 }
 
 static void gen_stxvh8x(DisasContext *ctx)
 {
-    TCGv_i64 xsh = cpu_vsrh(xS(ctx->opcode));
-    TCGv_i64 xsl = cpu_vsrl(xS(ctx->opcode));
     TCGv EA;
+    TCGv_i64 xsh = tcg_temp_new_i64();
+    TCGv_i64 xsl = tcg_temp_new_i64();
+    get_cpu_vsrh(xsh, xS(ctx->opcode));
+    get_cpu_vsrl(xsl, xS(ctx->opcode));
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -393,13 +475,17 @@ static void gen_stxvh8x(DisasContext *ctx)
         tcg_gen_qemu_st_i64(xsl, EA, ctx->mem_idx, MO_BEQ);
     }
     tcg_temp_free(EA);
+    tcg_temp_free_i64(xsh);
+    tcg_temp_free_i64(xsl);
 }
 
 static void gen_stxvb16x(DisasContext *ctx)
 {
-    TCGv_i64 xsh = cpu_vsrh(xS(ctx->opcode));
-    TCGv_i64 xsl = cpu_vsrl(xS(ctx->opcode));
     TCGv EA;
+    TCGv_i64 xsh = tcg_temp_new_i64();
+    TCGv_i64 xsl = tcg_temp_new_i64();
+    get_cpu_vsrh(xsh, xS(ctx->opcode));
+    get_cpu_vsrl(xsl, xS(ctx->opcode));
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -412,13 +498,16 @@ static void gen_stxvb16x(DisasContext *ctx)
     tcg_gen_addi_tl(EA, EA, 8);
     tcg_gen_qemu_st_i64(xsl, EA, ctx->mem_idx, MO_BEQ);
     tcg_temp_free(EA);
+    tcg_temp_free_i64(xsh);
+    tcg_temp_free_i64(xsl);
 }
 
 #define VSX_STORE_SCALAR_DS(name, operation)                      \
 static void gen_##name(DisasContext *ctx)                         \
 {                                                                 \
     TCGv EA;                                                      \
-    TCGv_i64 xth = cpu_vsrh(rD(ctx->opcode) + 32);                \
+    TCGv_i64 xth = tcg_temp_new_i64();                            \
+    get_cpu_vsrh(xth, rD(ctx->opcode) + 32);                      \
                                                                   \
     if (unlikely(!ctx->altivec_enabled)) {                        \
         gen_exception(ctx, POWERPC_EXCP_VPU);                     \
@@ -430,62 +519,119 @@ static void gen_##name(DisasContext *ctx)                         \
     gen_qemu_##operation(ctx, xth, EA);                           \
     /* NOTE: cpu_vsrl is undefined */                             \
     tcg_temp_free(EA);                                            \
+    tcg_temp_free_i64(xth);                                       \
 }
 
 VSX_LOAD_SCALAR_DS(stxsd, st64_i64)
 VSX_LOAD_SCALAR_DS(stxssp, st32fs)
 
-#define MV_VSRW(name, tcgop1, tcgop2, target, source)           \
-static void gen_##name(DisasContext *ctx)                       \
-{                                                               \
-    if (xS(ctx->opcode) < 32) {                                 \
-        if (unlikely(!ctx->fpu_enabled)) {                      \
-            gen_exception(ctx, POWERPC_EXCP_FPU);               \
-            return;                                             \
-        }                                                       \
-    } else {                                                    \
-        if (unlikely(!ctx->altivec_enabled)) {                  \
-            gen_exception(ctx, POWERPC_EXCP_VPU);               \
-            return;                                             \
-        }                                                       \
-    }                                                           \
-    TCGv_i64 tmp = tcg_temp_new_i64();                          \
-    tcg_gen_##tcgop1(tmp, source);                              \
-    tcg_gen_##tcgop2(target, tmp);                              \
-    tcg_temp_free_i64(tmp);                                     \
+static void gen_mfvsrwz(DisasContext *ctx)
+{
+    if (xS(ctx->opcode) < 32) {
+        if (unlikely(!ctx->fpu_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_FPU);
+            return;
+        }
+    } else {
+        if (unlikely(!ctx->altivec_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VPU);
+            return;
+        }
+    }
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 xsh = tcg_temp_new_i64();
+    get_cpu_vsrh(xsh, xS(ctx->opcode));
+    tcg_gen_ext32u_i64(tmp, xsh);
+    tcg_gen_trunc_i64_tl(cpu_gpr[rA(ctx->opcode)], tmp);
+    tcg_temp_free_i64(tmp);
 }
 
+static void gen_mtvsrwa(DisasContext *ctx)
+{
+    if (xS(ctx->opcode) < 32) {
+        if (unlikely(!ctx->fpu_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_FPU);
+            return;
+        }
+    } else {
+        if (unlikely(!ctx->altivec_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VPU);
+            return;
+        }
+    }
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 xsh = tcg_temp_new_i64();
+    tcg_gen_extu_tl_i64(tmp, cpu_gpr[rA(ctx->opcode)]);
+    tcg_gen_ext32s_i64(xsh, tmp);
+    set_cpu_vsrh(xT(ctx->opcode), xsh);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(xsh);
+}
 
-MV_VSRW(mfvsrwz, ext32u_i64, trunc_i64_tl, cpu_gpr[rA(ctx->opcode)], \
-        cpu_vsrh(xS(ctx->opcode)))
-MV_VSRW(mtvsrwa, extu_tl_i64, ext32s_i64, cpu_vsrh(xT(ctx->opcode)), \
-        cpu_gpr[rA(ctx->opcode)])
-MV_VSRW(mtvsrwz, extu_tl_i64, ext32u_i64, cpu_vsrh(xT(ctx->opcode)), \
-        cpu_gpr[rA(ctx->opcode)])
+static void gen_mtvsrwz(DisasContext *ctx)
+{
+    if (xS(ctx->opcode) < 32) {
+        if (unlikely(!ctx->fpu_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_FPU);
+            return;
+        }
+    } else {
+        if (unlikely(!ctx->altivec_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VPU);
+            return;
+        }
+    }
+    TCGv_i64 tmp = tcg_temp_new_i64();
+    TCGv_i64 xsh = tcg_temp_new_i64();
+    tcg_gen_extu_tl_i64(tmp, cpu_gpr[rA(ctx->opcode)]);
+    tcg_gen_ext32u_i64(xsh, tmp);
+    set_cpu_vsrh(xT(ctx->opcode), xsh);
+    tcg_temp_free_i64(tmp);
+    tcg_temp_free_i64(xsh);
+}
 
 #if defined(TARGET_PPC64)
-#define MV_VSRD(name, target, source)                           \
-static void gen_##name(DisasContext *ctx)                       \
-{                                                               \
-    if (xS(ctx->opcode) < 32) {                                 \
-        if (unlikely(!ctx->fpu_enabled)) {                      \
-            gen_exception(ctx, POWERPC_EXCP_FPU);               \
-            return;                                             \
-        }                                                       \
-    } else {                                                    \
-        if (unlikely(!ctx->altivec_enabled)) {                  \
-            gen_exception(ctx, POWERPC_EXCP_VPU);               \
-            return;                                             \
-        }                                                       \
-    }                                                           \
-    tcg_gen_mov_i64(target, source);                            \
+static void gen_mfvsrd(DisasContext *ctx)
+{
+    TCGv_i64 t0 = tcg_temp_new_i64();
+    if (xS(ctx->opcode) < 32) {
+        if (unlikely(!ctx->fpu_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_FPU);
+            return;
+        }
+    } else {
+        if (unlikely(!ctx->altivec_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VPU);
+            return;
+        }
+    }
+    get_cpu_vsrh(t0, xS(ctx->opcode));
+    tcg_gen_mov_i64(cpu_gpr[rA(ctx->opcode)], t0);
+    tcg_temp_free_i64(t0);
 }
 
-MV_VSRD(mfvsrd, cpu_gpr[rA(ctx->opcode)], cpu_vsrh(xS(ctx->opcode)))
-MV_VSRD(mtvsrd, cpu_vsrh(xT(ctx->opcode)), cpu_gpr[rA(ctx->opcode)])
+static void gen_mtvsrd(DisasContext *ctx)
+{
+    TCGv_i64 t0 = tcg_temp_new_i64();
+    if (xS(ctx->opcode) < 32) {
+        if (unlikely(!ctx->fpu_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_FPU);
+            return;
+        }
+    } else {
+        if (unlikely(!ctx->altivec_enabled)) {
+            gen_exception(ctx, POWERPC_EXCP_VPU);
+            return;
+        }
+    }
+    tcg_gen_mov_i64(t0, cpu_gpr[rA(ctx->opcode)]);
+    set_cpu_vsrh(xT(ctx->opcode), t0);
+    tcg_temp_free_i64(t0);
+}
 
 static void gen_mfvsrld(DisasContext *ctx)
 {
+    TCGv_i64 t0 = tcg_temp_new_i64();
     if (xS(ctx->opcode) < 32) {
         if (unlikely(!ctx->vsx_enabled)) {
             gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -497,12 +643,14 @@ static void gen_mfvsrld(DisasContext *ctx)
             return;
         }
     }
-
-    tcg_gen_mov_i64(cpu_gpr[rA(ctx->opcode)], cpu_vsrl(xS(ctx->opcode)));
+    get_cpu_vsrl(t0, xS(ctx->opcode));
+    tcg_gen_mov_i64(cpu_gpr[rA(ctx->opcode)], t0);
+    tcg_temp_free_i64(t0);
 }
 
 static void gen_mtvsrdd(DisasContext *ctx)
 {
+    TCGv_i64 t0 = tcg_temp_new_i64();
     if (xT(ctx->opcode) < 32) {
         if (unlikely(!ctx->vsx_enabled)) {
             gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -516,16 +664,20 @@ static void gen_mtvsrdd(DisasContext *ctx)
     }
 
     if (!rA(ctx->opcode)) {
-        tcg_gen_movi_i64(cpu_vsrh(xT(ctx->opcode)), 0);
+        tcg_gen_movi_i64(t0, 0);
     } else {
-        tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_gpr[rA(ctx->opcode)]);
+        tcg_gen_mov_i64(t0, cpu_gpr[rA(ctx->opcode)]);
     }
+    set_cpu_vsrh(xT(ctx->opcode), t0);
 
-    tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), cpu_gpr[rB(ctx->opcode)]);
+    tcg_gen_mov_i64(t0, cpu_gpr[rB(ctx->opcode)]);
+    set_cpu_vsrl(xT(ctx->opcode), t0);
+    tcg_temp_free_i64(t0);
 }
 
 static void gen_mtvsrws(DisasContext *ctx)
 {
+    TCGv_i64 t0 = tcg_temp_new_i64();
     if (xT(ctx->opcode) < 32) {
         if (unlikely(!ctx->vsx_enabled)) {
             gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -538,55 +690,60 @@ static void gen_mtvsrws(DisasContext *ctx)
         }
     }
 
-    tcg_gen_deposit_i64(cpu_vsrl(xT(ctx->opcode)), cpu_gpr[rA(ctx->opcode)],
+    tcg_gen_deposit_i64(t0, cpu_gpr[rA(ctx->opcode)],
                         cpu_gpr[rA(ctx->opcode)], 32, 32);
-    tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_vsrl(xT(ctx->opcode)));
+    set_cpu_vsrl(xT(ctx->opcode), t0);
+    set_cpu_vsrh(xT(ctx->opcode), t0);
+    tcg_temp_free_i64(t0);
 }
 
 #endif
 
 static void gen_xxpermdi(DisasContext *ctx)
 {
+    TCGv_i64 xh, xl;
+
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
     }
 
+    xh = tcg_temp_new_i64();
+    xl = tcg_temp_new_i64();
+
     if (unlikely((xT(ctx->opcode) == xA(ctx->opcode)) ||
                  (xT(ctx->opcode) == xB(ctx->opcode)))) {
-        TCGv_i64 xh, xl;
-
-        xh = tcg_temp_new_i64();
-        xl = tcg_temp_new_i64();
-
         if ((DM(ctx->opcode) & 2) == 0) {
-            tcg_gen_mov_i64(xh, cpu_vsrh(xA(ctx->opcode)));
+            get_cpu_vsrh(xh, xA(ctx->opcode));
         } else {
-            tcg_gen_mov_i64(xh, cpu_vsrl(xA(ctx->opcode)));
+            get_cpu_vsrl(xh, xA(ctx->opcode));
         }
         if ((DM(ctx->opcode) & 1) == 0) {
-            tcg_gen_mov_i64(xl, cpu_vsrh(xB(ctx->opcode)));
+            get_cpu_vsrh(xl, xB(ctx->opcode));
         } else {
-            tcg_gen_mov_i64(xl, cpu_vsrl(xB(ctx->opcode)));
+            get_cpu_vsrl(xl, xB(ctx->opcode));
         }
 
-        tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), xh);
-        tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), xl);
-
-        tcg_temp_free_i64(xh);
-        tcg_temp_free_i64(xl);
+        set_cpu_vsrh(xT(ctx->opcode), xh);
+        set_cpu_vsrl(xT(ctx->opcode), xl);
     } else {
         if ((DM(ctx->opcode) & 2) == 0) {
-            tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_vsrh(xA(ctx->opcode)));
+            get_cpu_vsrh(xh, xA(ctx->opcode));
+            set_cpu_vsrh(xT(ctx->opcode), xh);
         } else {
-            tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), cpu_vsrl(xA(ctx->opcode)));
+            get_cpu_vsrl(xh, xA(ctx->opcode));
+            set_cpu_vsrh(xT(ctx->opcode), xh);
         }
         if ((DM(ctx->opcode) & 1) == 0) {
-            tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), cpu_vsrh(xB(ctx->opcode)));
+            get_cpu_vsrh(xl, xB(ctx->opcode));
+            set_cpu_vsrl(xT(ctx->opcode), xl);
         } else {
-            tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), cpu_vsrl(xB(ctx->opcode)));
+            get_cpu_vsrl(xl, xB(ctx->opcode));
+            set_cpu_vsrl(xT(ctx->opcode), xl);
         }
     }
+    tcg_temp_free_i64(xh);
+    tcg_temp_free_i64(xl);
 }
 
 #define OP_ABS 1
@@ -606,7 +763,7 @@ static void glue(gen_, name)(DisasContext * ctx)                  \
         }                                                         \
         xb = tcg_temp_new_i64();                                  \
         sgm = tcg_temp_new_i64();                                 \
-        tcg_gen_mov_i64(xb, cpu_vsrh(xB(ctx->opcode)));           \
+        get_cpu_vsrh(xb, xB(ctx->opcode));                        \
         tcg_gen_movi_i64(sgm, sgn_mask);                          \
         switch (op) {                                             \
             case OP_ABS: {                                        \
@@ -623,7 +780,7 @@ static void glue(gen_, name)(DisasContext * ctx)                  \
             }                                                     \
             case OP_CPSGN: {                                      \
                 TCGv_i64 xa = tcg_temp_new_i64();                 \
-                tcg_gen_mov_i64(xa, cpu_vsrh(xA(ctx->opcode)));   \
+                get_cpu_vsrh(xa, xA(ctx->opcode));                \
                 tcg_gen_and_i64(xa, xa, sgm);                     \
                 tcg_gen_andc_i64(xb, xb, sgm);                    \
                 tcg_gen_or_i64(xb, xb, xa);                       \
@@ -631,7 +788,7 @@ static void glue(gen_, name)(DisasContext * ctx)                  \
                 break;                                            \
             }                                                     \
         }                                                         \
-        tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), xb);           \
+        set_cpu_vsrh(xT(ctx->opcode), xb);                        \
         tcg_temp_free_i64(xb);                                    \
         tcg_temp_free_i64(sgm);                                   \
     }
@@ -647,7 +804,7 @@ static void glue(gen_, name)(DisasContext *ctx)                   \
     int xa;                                                       \
     int xt = rD(ctx->opcode) + 32;                                \
     int xb = rB(ctx->opcode) + 32;                                \
-    TCGv_i64 xah, xbh, xbl, sgm;                                  \
+    TCGv_i64 xah, xbh, xbl, sgm, tmp;                             \
                                                                   \
     if (unlikely(!ctx->vsx_enabled)) {                            \
         gen_exception(ctx, POWERPC_EXCP_VSXU);                    \
@@ -656,8 +813,9 @@ static void glue(gen_, name)(DisasContext *ctx)                   \
     xbh = tcg_temp_new_i64();                                     \
     xbl = tcg_temp_new_i64();                                     \
     sgm = tcg_temp_new_i64();                                     \
-    tcg_gen_mov_i64(xbh, cpu_vsrh(xb));                           \
-    tcg_gen_mov_i64(xbl, cpu_vsrl(xb));                           \
+    tmp = tcg_temp_new_i64();                                     \
+    get_cpu_vsrh(xbh, xb);                                        \
+    get_cpu_vsrl(xbl, xb);                                        \
     tcg_gen_movi_i64(sgm, sgn_mask);                              \
     switch (op) {                                                 \
     case OP_ABS:                                                  \
@@ -672,17 +830,19 @@ static void glue(gen_, name)(DisasContext *ctx)                   \
     case OP_CPSGN:                                                \
         xah = tcg_temp_new_i64();                                 \
         xa = rA(ctx->opcode) + 32;                                \
-        tcg_gen_and_i64(xah, cpu_vsrh(xa), sgm);                  \
+        get_cpu_vsrh(tmp, xa);                                    \
+        tcg_gen_and_i64(xah, tmp, sgm);                           \
         tcg_gen_andc_i64(xbh, xbh, sgm);                          \
         tcg_gen_or_i64(xbh, xbh, xah);                            \
         tcg_temp_free_i64(xah);                                   \
         break;                                                    \
     }                                                             \
-    tcg_gen_mov_i64(cpu_vsrh(xt), xbh);                           \
-    tcg_gen_mov_i64(cpu_vsrl(xt), xbl);                           \
+    set_cpu_vsrh(xt, xbh);                                        \
+    set_cpu_vsrl(xt, xbl);                                        \
     tcg_temp_free_i64(xbl);                                       \
     tcg_temp_free_i64(xbh);                                       \
     tcg_temp_free_i64(sgm);                                       \
+    tcg_temp_free_i64(tmp);                                       \
 }
 
 VSX_SCALAR_MOVE_QP(xsabsqp, OP_ABS, SGN_MASK_DP)
@@ -701,8 +861,8 @@ static void glue(gen_, name)(DisasContext * ctx)                 \
         xbh = tcg_temp_new_i64();                                \
         xbl = tcg_temp_new_i64();                                \
         sgm = tcg_temp_new_i64();                                \
-        tcg_gen_mov_i64(xbh, cpu_vsrh(xB(ctx->opcode)));         \
-        tcg_gen_mov_i64(xbl, cpu_vsrl(xB(ctx->opcode)));         \
+        set_cpu_vsrh(xB(ctx->opcode), xbh);                      \
+        set_cpu_vsrl(xB(ctx->opcode), xbl);                      \
         tcg_gen_movi_i64(sgm, sgn_mask);                         \
         switch (op) {                                            \
             case OP_ABS: {                                       \
@@ -723,8 +883,8 @@ static void glue(gen_, name)(DisasContext * ctx)                 \
             case OP_CPSGN: {                                     \
                 TCGv_i64 xah = tcg_temp_new_i64();               \
                 TCGv_i64 xal = tcg_temp_new_i64();               \
-                tcg_gen_mov_i64(xah, cpu_vsrh(xA(ctx->opcode))); \
-                tcg_gen_mov_i64(xal, cpu_vsrl(xA(ctx->opcode))); \
+                get_cpu_vsrh(xah, xA(ctx->opcode));              \
+                get_cpu_vsrl(xal, xA(ctx->opcode));              \
                 tcg_gen_and_i64(xah, xah, sgm);                  \
                 tcg_gen_and_i64(xal, xal, sgm);                  \
                 tcg_gen_andc_i64(xbh, xbh, sgm);                 \
@@ -736,8 +896,8 @@ static void glue(gen_, name)(DisasContext * ctx)                 \
                 break;                                           \
             }                                                    \
         }                                                        \
-        tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), xbh);         \
-        tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), xbl);         \
+        set_cpu_vsrh(xT(ctx->opcode), xbh);                      \
+        set_cpu_vsrl(xT(ctx->opcode), xbl);                      \
         tcg_temp_free_i64(xbh);                                  \
         tcg_temp_free_i64(xbl);                                  \
         tcg_temp_free_i64(sgm);                                  \
@@ -768,12 +928,17 @@ static void gen_##name(DisasContext * ctx)                                    \
 #define GEN_VSX_HELPER_XT_XB_ENV(name, op1, op2, inval, type) \
 static void gen_##name(DisasContext * ctx)                    \
 {                                                             \
+    TCGv_i64 t0 = tcg_temp_new_i64();                         \
+    TCGv_i64 t1 = tcg_temp_new_i64();                         \
     if (unlikely(!ctx->vsx_enabled)) {                        \
         gen_exception(ctx, POWERPC_EXCP_VSXU);                \
         return;                                               \
     }                                                         \
-    gen_helper_##name(cpu_vsrh(xT(ctx->opcode)), cpu_env,     \
-                      cpu_vsrh(xB(ctx->opcode)));             \
+    get_cpu_vsrh(t0, xB(ctx->opcode));                        \
+    gen_helper_##name(t1, cpu_env, t0);                       \
+    set_cpu_vsrh(xT(ctx->opcode), t1);                        \
+    tcg_temp_free_i64(t0);                                    \
+    tcg_temp_free_i64(t1);                                    \
 }
 
 GEN_VSX_HELPER_2(xsadddp, 0x00, 0x04, 0, PPC2_VSX)
@@ -949,10 +1114,13 @@ GEN_VSX_HELPER_2(xxpermr, 0x08, 0x07, 0, PPC2_ISA300)
 
 static void gen_xxbrd(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
-    TCGv_i64 xbh = cpu_vsrh(xB(ctx->opcode));
-    TCGv_i64 xbl = cpu_vsrl(xB(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, xB(ctx->opcode));
+    get_cpu_vsrl(xbl, xB(ctx->opcode));
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -960,28 +1128,49 @@ static void gen_xxbrd(DisasContext *ctx)
     }
     tcg_gen_bswap64_i64(xth, xbh);
     tcg_gen_bswap64_i64(xtl, xbl);
+    set_cpu_vsrh(xT(ctx->opcode), xth);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
+
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 
 static void gen_xxbrh(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
-    TCGv_i64 xbh = cpu_vsrh(xB(ctx->opcode));
-    TCGv_i64 xbl = cpu_vsrl(xB(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, xB(ctx->opcode));
+    get_cpu_vsrl(xbl, xB(ctx->opcode));
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
     }
     gen_bswap16x8(xth, xtl, xbh, xbl);
+    set_cpu_vsrh(xT(ctx->opcode), xth);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
+
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 
 static void gen_xxbrq(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
-    TCGv_i64 xbh = cpu_vsrh(xB(ctx->opcode));
-    TCGv_i64 xbl = cpu_vsrl(xB(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, xB(ctx->opcode));
+    get_cpu_vsrl(xbl, xB(ctx->opcode));
+
     TCGv_i64 t0 = tcg_temp_new_i64();
 
     if (unlikely(!ctx->vsx_enabled)) {
@@ -990,35 +1179,65 @@ static void gen_xxbrq(DisasContext *ctx)
     }
     tcg_gen_bswap64_i64(t0, xbl);
     tcg_gen_bswap64_i64(xtl, xbh);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
     tcg_gen_mov_i64(xth, t0);
+    set_cpu_vsrl(xT(ctx->opcode), xth);
+
     tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 
 static void gen_xxbrw(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
-    TCGv_i64 xbh = cpu_vsrh(xB(ctx->opcode));
-    TCGv_i64 xbl = cpu_vsrl(xB(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, xB(ctx->opcode));
+    get_cpu_vsrl(xbl, xB(ctx->opcode));
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
     }
     gen_bswap32x4(xth, xtl, xbh, xbl);
+    set_cpu_vsrl(xT(ctx->opcode), xth);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
+
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 
 #define VSX_LOGICAL(name, tcg_op)                                    \
 static void glue(gen_, name)(DisasContext * ctx)                     \
     {                                                                \
+        TCGv_i64 t0;                                                 \
+        TCGv_i64 t1;                                                 \
+        TCGv_i64 t2;                                                 \
         if (unlikely(!ctx->vsx_enabled)) {                           \
             gen_exception(ctx, POWERPC_EXCP_VSXU);                   \
             return;                                                  \
         }                                                            \
-        tcg_op(cpu_vsrh(xT(ctx->opcode)), cpu_vsrh(xA(ctx->opcode)), \
-            cpu_vsrh(xB(ctx->opcode)));                              \
-        tcg_op(cpu_vsrl(xT(ctx->opcode)), cpu_vsrl(xA(ctx->opcode)), \
-            cpu_vsrl(xB(ctx->opcode)));                              \
+        t0 = tcg_temp_new_i64();                                     \
+        t1 = tcg_temp_new_i64();                                     \
+        t2 = tcg_temp_new_i64();                                     \
+        get_cpu_vsrh(t0, xA(ctx->opcode));                           \
+        get_cpu_vsrh(t1, xB(ctx->opcode));                           \
+        tcg_op(t2, t0, t1);                                          \
+        set_cpu_vsrh(xT(ctx->opcode), t2);                           \
+        get_cpu_vsrl(t0, xA(ctx->opcode));                           \
+        get_cpu_vsrl(t1, xB(ctx->opcode));                           \
+        tcg_op(t2, t0, t1);                                          \
+        set_cpu_vsrl(xT(ctx->opcode), t2);                           \
+        tcg_temp_free_i64(t0);                                       \
+        tcg_temp_free_i64(t1);                                       \
+        tcg_temp_free_i64(t2);                                       \
     }
 
 VSX_LOGICAL(xxland, tcg_gen_and_i64)
@@ -1033,7 +1252,7 @@ VSX_LOGICAL(xxlorc, tcg_gen_orc_i64)
 #define VSX_XXMRG(name, high)                               \
 static void glue(gen_, name)(DisasContext * ctx)            \
     {                                                       \
-        TCGv_i64 a0, a1, b0, b1;                            \
+        TCGv_i64 a0, a1, b0, b1, tmp;                       \
         if (unlikely(!ctx->vsx_enabled)) {                  \
             gen_exception(ctx, POWERPC_EXCP_VSXU);          \
             return;                                         \
@@ -1042,27 +1261,29 @@ static void glue(gen_, name)(DisasContext * ctx)            \
         a1 = tcg_temp_new_i64();                            \
         b0 = tcg_temp_new_i64();                            \
         b1 = tcg_temp_new_i64();                            \
+        tmp = tcg_temp_new_i64();                           \
         if (high) {                                         \
-            tcg_gen_mov_i64(a0, cpu_vsrh(xA(ctx->opcode))); \
-            tcg_gen_mov_i64(a1, cpu_vsrh(xA(ctx->opcode))); \
-            tcg_gen_mov_i64(b0, cpu_vsrh(xB(ctx->opcode))); \
-            tcg_gen_mov_i64(b1, cpu_vsrh(xB(ctx->opcode))); \
+            get_cpu_vsrh(a0, xA(ctx->opcode));              \
+            get_cpu_vsrh(a1, xA(ctx->opcode));              \
+            get_cpu_vsrh(b0, xB(ctx->opcode));              \
+            get_cpu_vsrh(b1, xB(ctx->opcode));              \
         } else {                                            \
-            tcg_gen_mov_i64(a0, cpu_vsrl(xA(ctx->opcode))); \
-            tcg_gen_mov_i64(a1, cpu_vsrl(xA(ctx->opcode))); \
-            tcg_gen_mov_i64(b0, cpu_vsrl(xB(ctx->opcode))); \
-            tcg_gen_mov_i64(b1, cpu_vsrl(xB(ctx->opcode))); \
+            get_cpu_vsrl(a0, xA(ctx->opcode));              \
+            get_cpu_vsrl(a1, xA(ctx->opcode));              \
+            get_cpu_vsrl(b0, xB(ctx->opcode));              \
+            get_cpu_vsrl(b1, xB(ctx->opcode));              \
         }                                                   \
         tcg_gen_shri_i64(a0, a0, 32);                       \
         tcg_gen_shri_i64(b0, b0, 32);                       \
-        tcg_gen_deposit_i64(cpu_vsrh(xT(ctx->opcode)),      \
-                            b0, a0, 32, 32);                \
-        tcg_gen_deposit_i64(cpu_vsrl(xT(ctx->opcode)),      \
-                            b1, a1, 32, 32);                \
+        tcg_gen_deposit_i64(tmp, b0, a0, 32, 32);           \
+        set_cpu_vsrh(xT(ctx->opcode), tmp);                 \
+        tcg_gen_deposit_i64(tmp, b1, a1, 32, 32);           \
+        set_cpu_vsrl(xT(ctx->opcode), tmp);                 \
         tcg_temp_free_i64(a0);                              \
         tcg_temp_free_i64(a1);                              \
         tcg_temp_free_i64(b0);                              \
         tcg_temp_free_i64(b1);                              \
+        tcg_temp_free_i64(tmp);                             \
     }
 
 VSX_XXMRG(xxmrghw, 1)
@@ -1070,7 +1291,7 @@ VSX_XXMRG(xxmrglw, 0)
 
 static void gen_xxsel(DisasContext * ctx)
 {
-    TCGv_i64 a, b, c;
+    TCGv_i64 a, b, c, tmp;
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
@@ -1078,34 +1299,43 @@ static void gen_xxsel(DisasContext * ctx)
     a = tcg_temp_new_i64();
     b = tcg_temp_new_i64();
     c = tcg_temp_new_i64();
+    tmp = tcg_temp_new_i64();
 
-    tcg_gen_mov_i64(a, cpu_vsrh(xA(ctx->opcode)));
-    tcg_gen_mov_i64(b, cpu_vsrh(xB(ctx->opcode)));
-    tcg_gen_mov_i64(c, cpu_vsrh(xC(ctx->opcode)));
+    get_cpu_vsrh(a, xA(ctx->opcode));
+    get_cpu_vsrh(b, xB(ctx->opcode));
+    get_cpu_vsrh(c, xC(ctx->opcode));
 
     tcg_gen_and_i64(b, b, c);
     tcg_gen_andc_i64(a, a, c);
-    tcg_gen_or_i64(cpu_vsrh(xT(ctx->opcode)), a, b);
+    tcg_gen_or_i64(tmp, a, b);
+    set_cpu_vsrh(xT(ctx->opcode), tmp);
 
-    tcg_gen_mov_i64(a, cpu_vsrl(xA(ctx->opcode)));
-    tcg_gen_mov_i64(b, cpu_vsrl(xB(ctx->opcode)));
-    tcg_gen_mov_i64(c, cpu_vsrl(xC(ctx->opcode)));
+    get_cpu_vsrl(a, xA(ctx->opcode));
+    get_cpu_vsrl(b, xB(ctx->opcode));
+    get_cpu_vsrl(c, xC(ctx->opcode));
 
     tcg_gen_and_i64(b, b, c);
     tcg_gen_andc_i64(a, a, c);
-    tcg_gen_or_i64(cpu_vsrl(xT(ctx->opcode)), a, b);
+    tcg_gen_or_i64(tmp, a, b);
+    set_cpu_vsrl(xT(ctx->opcode), tmp);
 
     tcg_temp_free_i64(a);
     tcg_temp_free_i64(b);
     tcg_temp_free_i64(c);
+    tcg_temp_free_i64(tmp);
 }
 
 static void gen_xxspltw(DisasContext *ctx)
 {
     TCGv_i64 b, b2;
-    TCGv_i64 vsr = (UIM(ctx->opcode) & 2) ?
-                   cpu_vsrl(xB(ctx->opcode)) :
-                   cpu_vsrh(xB(ctx->opcode));
+    TCGv_i64 vsr;
+
+    vsr = tcg_temp_new_i64();
+    if (UIM(ctx->opcode) & 2) {
+        get_cpu_vsrl(vsr, xB(ctx->opcode));
+    } else {
+        get_cpu_vsrh(vsr, xB(ctx->opcode));
+    }
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -1122,9 +1352,11 @@ static void gen_xxspltw(DisasContext *ctx)
     }
 
     tcg_gen_shli_i64(b2, b, 32);
-    tcg_gen_or_i64(cpu_vsrh(xT(ctx->opcode)), b, b2);
-    tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), cpu_vsrh(xT(ctx->opcode)));
+    tcg_gen_or_i64(vsr, b, b2);
+    set_cpu_vsrh(xT(ctx->opcode), vsr);
+    set_cpu_vsrl(xT(ctx->opcode), vsr);
 
+    tcg_temp_free_i64(vsr);
     tcg_temp_free_i64(b);
     tcg_temp_free_i64(b2);
 }
@@ -1134,6 +1366,7 @@ static void gen_xxspltw(DisasContext *ctx)
 static void gen_xxspltib(DisasContext *ctx)
 {
     unsigned char uim8 = IMM8(ctx->opcode);
+    TCGv_i64 vsr = tcg_temp_new_i64();
     if (xS(ctx->opcode) < 32) {
         if (unlikely(!ctx->altivec_enabled)) {
             gen_exception(ctx, POWERPC_EXCP_VPU);
@@ -1145,8 +1378,10 @@ static void gen_xxspltib(DisasContext *ctx)
             return;
         }
     }
-    tcg_gen_movi_i64(cpu_vsrh(xT(ctx->opcode)), pattern(uim8));
-    tcg_gen_movi_i64(cpu_vsrl(xT(ctx->opcode)), pattern(uim8));
+    tcg_gen_movi_i64(vsr, pattern(uim8));
+    set_cpu_vsrh(xT(ctx->opcode), vsr);
+    set_cpu_vsrl(xT(ctx->opcode), vsr);
+    tcg_temp_free_i64(vsr);
 }
 
 static void gen_xxsldwi(DisasContext *ctx)
@@ -1161,40 +1396,40 @@ static void gen_xxsldwi(DisasContext *ctx)
 
     switch (SHW(ctx->opcode)) {
         case 0: {
-            tcg_gen_mov_i64(xth, cpu_vsrh(xA(ctx->opcode)));
-            tcg_gen_mov_i64(xtl, cpu_vsrl(xA(ctx->opcode)));
+            get_cpu_vsrh(xth, xA(ctx->opcode));
+            get_cpu_vsrl(xtl, xA(ctx->opcode));
             break;
         }
         case 1: {
             TCGv_i64 t0 = tcg_temp_new_i64();
-            tcg_gen_mov_i64(xth, cpu_vsrh(xA(ctx->opcode)));
+            get_cpu_vsrh(xth, xA(ctx->opcode));
             tcg_gen_shli_i64(xth, xth, 32);
-            tcg_gen_mov_i64(t0, cpu_vsrl(xA(ctx->opcode)));
+            get_cpu_vsrl(t0, xA(ctx->opcode));
             tcg_gen_shri_i64(t0, t0, 32);
             tcg_gen_or_i64(xth, xth, t0);
-            tcg_gen_mov_i64(xtl, cpu_vsrl(xA(ctx->opcode)));
+            get_cpu_vsrl(xtl, xA(ctx->opcode));
             tcg_gen_shli_i64(xtl, xtl, 32);
-            tcg_gen_mov_i64(t0, cpu_vsrh(xB(ctx->opcode)));
+            get_cpu_vsrh(t0, xB(ctx->opcode));
             tcg_gen_shri_i64(t0, t0, 32);
             tcg_gen_or_i64(xtl, xtl, t0);
             tcg_temp_free_i64(t0);
             break;
         }
         case 2: {
-            tcg_gen_mov_i64(xth, cpu_vsrl(xA(ctx->opcode)));
-            tcg_gen_mov_i64(xtl, cpu_vsrh(xB(ctx->opcode)));
+            get_cpu_vsrl(xth, xA(ctx->opcode));
+            get_cpu_vsrh(xtl, xB(ctx->opcode));
             break;
         }
         case 3: {
             TCGv_i64 t0 = tcg_temp_new_i64();
-            tcg_gen_mov_i64(xth, cpu_vsrl(xA(ctx->opcode)));
+            get_cpu_vsrl(xth, xA(ctx->opcode));
             tcg_gen_shli_i64(xth, xth, 32);
-            tcg_gen_mov_i64(t0, cpu_vsrh(xB(ctx->opcode)));
+            get_cpu_vsrh(t0, xB(ctx->opcode));
             tcg_gen_shri_i64(t0, t0, 32);
             tcg_gen_or_i64(xth, xth, t0);
-            tcg_gen_mov_i64(xtl, cpu_vsrh(xB(ctx->opcode)));
+            get_cpu_vsrh(xtl, xB(ctx->opcode));
             tcg_gen_shli_i64(xtl, xtl, 32);
-            tcg_gen_mov_i64(t0, cpu_vsrl(xB(ctx->opcode)));
+            get_cpu_vsrl(t0, xB(ctx->opcode));
             tcg_gen_shri_i64(t0, t0, 32);
             tcg_gen_or_i64(xtl, xtl, t0);
             tcg_temp_free_i64(t0);
@@ -1202,8 +1437,8 @@ static void gen_xxsldwi(DisasContext *ctx)
         }
     }
 
-    tcg_gen_mov_i64(cpu_vsrh(xT(ctx->opcode)), xth);
-    tcg_gen_mov_i64(cpu_vsrl(xT(ctx->opcode)), xtl);
+    set_cpu_vsrh(xT(ctx->opcode), xth);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
 
     tcg_temp_free_i64(xth);
     tcg_temp_free_i64(xtl);
@@ -1214,6 +1449,7 @@ static void gen_##name(DisasContext *ctx)                       \
 {                                                               \
     TCGv xt, xb;                                                \
     TCGv_i32 t0 = tcg_temp_new_i32();                           \
+    TCGv_i64 t1 = tcg_temp_new_i64();                           \
     uint8_t uimm = UIMM4(ctx->opcode);                          \
                                                                 \
     if (unlikely(!ctx->vsx_enabled)) {                          \
@@ -1226,8 +1462,9 @@ static void gen_##name(DisasContext *ctx)                       \
      * uimm > 12 handle as per hardware in helper               \
      */                                                         \
     if (uimm > 15) {                                            \
-        tcg_gen_movi_i64(cpu_vsrh(xT(ctx->opcode)), 0);         \
-        tcg_gen_movi_i64(cpu_vsrl(xT(ctx->opcode)), 0);         \
+        tcg_gen_movi_i64(t1, 0);                                \
+        set_cpu_vsrh(xT(ctx->opcode), t1);                      \
+        set_cpu_vsrl(xT(ctx->opcode), t1);                      \
         return;                                                 \
     }                                                           \
     tcg_gen_movi_i32(t0, uimm);                                 \
@@ -1235,6 +1472,7 @@ static void gen_##name(DisasContext *ctx)                       \
     tcg_temp_free(xb);                                          \
     tcg_temp_free(xt);                                          \
     tcg_temp_free_i32(t0);                                      \
+    tcg_temp_free_i64(t1);                                      \
 }
 
 VSX_EXTRACT_INSERT(xxextractuw)
@@ -1244,30 +1482,41 @@ VSX_EXTRACT_INSERT(xxinsertw)
 static void gen_xsxexpdp(DisasContext *ctx)
 {
     TCGv rt = cpu_gpr[rD(ctx->opcode)];
+    TCGv_i64 t0 = tcg_temp_new_i64();
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
     }
-    tcg_gen_extract_i64(rt, cpu_vsrh(xB(ctx->opcode)), 52, 11);
+    get_cpu_vsrh(t0, xB(ctx->opcode));
+    tcg_gen_extract_i64(rt, t0, 52, 11);
+    tcg_temp_free_i64(t0);
 }
 
 static void gen_xsxexpqp(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(rD(ctx->opcode) + 32);
-    TCGv_i64 xtl = cpu_vsrl(rD(ctx->opcode) + 32);
-    TCGv_i64 xbh = cpu_vsrh(rB(ctx->opcode) + 32);
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, rB(ctx->opcode) + 32);
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
     }
     tcg_gen_extract_i64(xth, xbh, 48, 15);
+    set_cpu_vsrh(rD(ctx->opcode) + 32, xth);
     tcg_gen_movi_i64(xtl, 0);
+    set_cpu_vsrl(rD(ctx->opcode) + 32, xtl);
+
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
 }
 
 static void gen_xsiexpdp(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
+    TCGv_i64 xth;
     TCGv ra = cpu_gpr[rA(ctx->opcode)];
     TCGv rb = cpu_gpr[rB(ctx->opcode)];
     TCGv_i64 t0;
@@ -1277,21 +1526,30 @@ static void gen_xsiexpdp(DisasContext *ctx)
         return;
     }
     t0 = tcg_temp_new_i64();
+    xth = tcg_temp_new_i64();
     tcg_gen_andi_i64(xth, ra, 0x800FFFFFFFFFFFFF);
     tcg_gen_andi_i64(t0, rb, 0x7FF);
     tcg_gen_shli_i64(t0, t0, 52);
     tcg_gen_or_i64(xth, xth, t0);
+    set_cpu_vsrh(xT(ctx->opcode), xth);
     /* dword[1] is undefined */
     tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(xth);
 }
 
 static void gen_xsiexpqp(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(rD(ctx->opcode) + 32);
-    TCGv_i64 xtl = cpu_vsrl(rD(ctx->opcode) + 32);
-    TCGv_i64 xah = cpu_vsrh(rA(ctx->opcode) + 32);
-    TCGv_i64 xal = cpu_vsrl(rA(ctx->opcode) + 32);
-    TCGv_i64 xbh = cpu_vsrh(rB(ctx->opcode) + 32);
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xah = tcg_temp_new_i64();
+    TCGv_i64 xal = tcg_temp_new_i64();
+    get_cpu_vsrh(xah, rA(ctx->opcode) + 32);
+    get_cpu_vsrl(xal, rA(ctx->opcode) + 32);
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, rB(ctx->opcode) + 32);
+
     TCGv_i64 t0;
 
     if (unlikely(!ctx->vsx_enabled)) {
@@ -1303,14 +1561,22 @@ static void gen_xsiexpqp(DisasContext *ctx)
     tcg_gen_andi_i64(t0, xbh, 0x7FFF);
     tcg_gen_shli_i64(t0, t0, 48);
     tcg_gen_or_i64(xth, xth, t0);
+    set_cpu_vsrh(rD(ctx->opcode) + 32, xth);
     tcg_gen_mov_i64(xtl, xal);
+    set_cpu_vsrl(rD(ctx->opcode) + 32, xtl);
+
     tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xah);
+    tcg_temp_free_i64(xal);
+    tcg_temp_free_i64(xbh);
 }
 
 static void gen_xsxsigdp(DisasContext *ctx)
 {
     TCGv rt = cpu_gpr[rD(ctx->opcode)];
-    TCGv_i64 t0, zr, nan, exp;
+    TCGv_i64 t0, t1, zr, nan, exp;
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -1318,17 +1584,21 @@ static void gen_xsxsigdp(DisasContext *ctx)
     }
     exp = tcg_temp_new_i64();
     t0 = tcg_temp_new_i64();
+    t1 = tcg_temp_new_i64();
     zr = tcg_const_i64(0);
     nan = tcg_const_i64(2047);
 
-    tcg_gen_extract_i64(exp, cpu_vsrh(xB(ctx->opcode)), 52, 11);
+    get_cpu_vsrh(t1, xB(ctx->opcode));
+    tcg_gen_extract_i64(exp, t1, 52, 11);
     tcg_gen_movi_i64(t0, 0x0010000000000000);
     tcg_gen_movcond_i64(TCG_COND_EQ, t0, exp, zr, zr, t0);
     tcg_gen_movcond_i64(TCG_COND_EQ, t0, exp, nan, zr, t0);
-    tcg_gen_andi_i64(rt, cpu_vsrh(xB(ctx->opcode)), 0x000FFFFFFFFFFFFF);
+    get_cpu_vsrh(t1, xB(ctx->opcode));
+    tcg_gen_andi_i64(rt, t1, 0x000FFFFFFFFFFFFF);
     tcg_gen_or_i64(rt, rt, t0);
 
     tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(t1);
     tcg_temp_free_i64(exp);
     tcg_temp_free_i64(zr);
     tcg_temp_free_i64(nan);
@@ -1337,8 +1607,13 @@ static void gen_xsxsigdp(DisasContext *ctx)
 static void gen_xsxsigqp(DisasContext *ctx)
 {
     TCGv_i64 t0, zr, nan, exp;
-    TCGv_i64 xth = cpu_vsrh(rD(ctx->opcode) + 32);
-    TCGv_i64 xtl = cpu_vsrl(rD(ctx->opcode) + 32);
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, rB(ctx->opcode) + 32);
+    get_cpu_vsrl(xbl, rB(ctx->opcode) + 32);
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -1349,29 +1624,41 @@ static void gen_xsxsigqp(DisasContext *ctx)
     zr = tcg_const_i64(0);
     nan = tcg_const_i64(32767);
 
-    tcg_gen_extract_i64(exp, cpu_vsrh(rB(ctx->opcode) + 32), 48, 15);
+    tcg_gen_extract_i64(exp, xbh, 48, 15);
     tcg_gen_movi_i64(t0, 0x0001000000000000);
     tcg_gen_movcond_i64(TCG_COND_EQ, t0, exp, zr, zr, t0);
     tcg_gen_movcond_i64(TCG_COND_EQ, t0, exp, nan, zr, t0);
-    tcg_gen_andi_i64(xth, cpu_vsrh(rB(ctx->opcode) + 32), 0x0000FFFFFFFFFFFF);
+    tcg_gen_andi_i64(xth, xbh, 0x0000FFFFFFFFFFFF);
     tcg_gen_or_i64(xth, xth, t0);
-    tcg_gen_mov_i64(xtl, cpu_vsrl(rB(ctx->opcode) + 32));
+    set_cpu_vsrh(rD(ctx->opcode) + 32, xth);
+    tcg_gen_mov_i64(xtl, xbl);
+    set_cpu_vsrl(rD(ctx->opcode) + 32, xtl);
 
     tcg_temp_free_i64(t0);
     tcg_temp_free_i64(exp);
     tcg_temp_free_i64(zr);
     tcg_temp_free_i64(nan);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 #endif
 
 static void gen_xviexpsp(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
-    TCGv_i64 xah = cpu_vsrh(xA(ctx->opcode));
-    TCGv_i64 xal = cpu_vsrl(xA(ctx->opcode));
-    TCGv_i64 xbh = cpu_vsrh(xB(ctx->opcode));
-    TCGv_i64 xbl = cpu_vsrl(xB(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xah = tcg_temp_new_i64();
+    TCGv_i64 xal = tcg_temp_new_i64();
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xah, xA(ctx->opcode));
+    get_cpu_vsrl(xal, xA(ctx->opcode));
+    get_cpu_vsrh(xbh, xB(ctx->opcode));
+    get_cpu_vsrl(xbl, xB(ctx->opcode));
+
     TCGv_i64 t0;
 
     if (unlikely(!ctx->vsx_enabled)) {
@@ -1383,21 +1670,36 @@ static void gen_xviexpsp(DisasContext *ctx)
     tcg_gen_andi_i64(t0, xbh, 0xFF000000FF);
     tcg_gen_shli_i64(t0, t0, 23);
     tcg_gen_or_i64(xth, xth, t0);
+    set_cpu_vsrh(xT(ctx->opcode), xth);
     tcg_gen_andi_i64(xtl, xal, 0x807FFFFF807FFFFF);
     tcg_gen_andi_i64(t0, xbl, 0xFF000000FF);
     tcg_gen_shli_i64(t0, t0, 23);
     tcg_gen_or_i64(xtl, xtl, t0);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
+
     tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xah);
+    tcg_temp_free_i64(xal);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 
 static void gen_xviexpdp(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
-    TCGv_i64 xah = cpu_vsrh(xA(ctx->opcode));
-    TCGv_i64 xal = cpu_vsrl(xA(ctx->opcode));
-    TCGv_i64 xbh = cpu_vsrh(xB(ctx->opcode));
-    TCGv_i64 xbl = cpu_vsrl(xB(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xah = tcg_temp_new_i64();
+    TCGv_i64 xal = tcg_temp_new_i64();
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xah, xA(ctx->opcode));
+    get_cpu_vsrl(xal, xA(ctx->opcode));
+    get_cpu_vsrh(xbh, xB(ctx->opcode));
+    get_cpu_vsrl(xbl, xB(ctx->opcode));
+
     TCGv_i64 t0;
 
     if (unlikely(!ctx->vsx_enabled)) {
@@ -1409,19 +1711,31 @@ static void gen_xviexpdp(DisasContext *ctx)
     tcg_gen_andi_i64(t0, xbh, 0x7FF);
     tcg_gen_shli_i64(t0, t0, 52);
     tcg_gen_or_i64(xth, xth, t0);
+    set_cpu_vsrh(xT(ctx->opcode), xth);
     tcg_gen_andi_i64(xtl, xal, 0x800FFFFFFFFFFFFF);
     tcg_gen_andi_i64(t0, xbl, 0x7FF);
     tcg_gen_shli_i64(t0, t0, 52);
     tcg_gen_or_i64(xtl, xtl, t0);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
+
     tcg_temp_free_i64(t0);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xah);
+    tcg_temp_free_i64(xal);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 
 static void gen_xvxexpsp(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
-    TCGv_i64 xbh = cpu_vsrh(xB(ctx->opcode));
-    TCGv_i64 xbl = cpu_vsrl(xB(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, xB(ctx->opcode));
+    get_cpu_vsrl(xbl, xB(ctx->opcode));
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
@@ -1429,33 +1743,53 @@ static void gen_xvxexpsp(DisasContext *ctx)
     }
     tcg_gen_shri_i64(xth, xbh, 23);
     tcg_gen_andi_i64(xth, xth, 0xFF000000FF);
+    set_cpu_vsrh(xT(ctx->opcode), xth);
     tcg_gen_shri_i64(xtl, xbl, 23);
     tcg_gen_andi_i64(xtl, xtl, 0xFF000000FF);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
+
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 
 static void gen_xvxexpdp(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
-    TCGv_i64 xbh = cpu_vsrh(xB(ctx->opcode));
-    TCGv_i64 xbl = cpu_vsrl(xB(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, xB(ctx->opcode));
+    get_cpu_vsrl(xbl, xB(ctx->opcode));
 
     if (unlikely(!ctx->vsx_enabled)) {
         gen_exception(ctx, POWERPC_EXCP_VSXU);
         return;
     }
     tcg_gen_extract_i64(xth, xbh, 52, 11);
+    set_cpu_vsrh(xT(ctx->opcode), xth);
     tcg_gen_extract_i64(xtl, xbl, 52, 11);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
+
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 
 GEN_VSX_HELPER_2(xvxsigsp, 0x00, 0x04, 0, PPC2_ISA300)
 
 static void gen_xvxsigdp(DisasContext *ctx)
 {
-    TCGv_i64 xth = cpu_vsrh(xT(ctx->opcode));
-    TCGv_i64 xtl = cpu_vsrl(xT(ctx->opcode));
-    TCGv_i64 xbh = cpu_vsrh(xB(ctx->opcode));
-    TCGv_i64 xbl = cpu_vsrl(xB(ctx->opcode));
+    TCGv_i64 xth = tcg_temp_new_i64();
+    TCGv_i64 xtl = tcg_temp_new_i64();
+
+    TCGv_i64 xbh = tcg_temp_new_i64();
+    TCGv_i64 xbl = tcg_temp_new_i64();
+    get_cpu_vsrh(xbh, xB(ctx->opcode));
+    get_cpu_vsrl(xbl, xB(ctx->opcode));
 
     TCGv_i64 t0, zr, nan, exp;
 
@@ -1474,6 +1808,7 @@ static void gen_xvxsigdp(DisasContext *ctx)
     tcg_gen_movcond_i64(TCG_COND_EQ, t0, exp, nan, zr, t0);
     tcg_gen_andi_i64(xth, xbh, 0x000FFFFFFFFFFFFF);
     tcg_gen_or_i64(xth, xth, t0);
+    set_cpu_vsrh(xT(ctx->opcode), xth);
 
     tcg_gen_extract_i64(exp, xbl, 52, 11);
     tcg_gen_movi_i64(t0, 0x0010000000000000);
@@ -1481,11 +1816,16 @@ static void gen_xvxsigdp(DisasContext *ctx)
     tcg_gen_movcond_i64(TCG_COND_EQ, t0, exp, nan, zr, t0);
     tcg_gen_andi_i64(xtl, xbl, 0x000FFFFFFFFFFFFF);
     tcg_gen_or_i64(xtl, xtl, t0);
+    set_cpu_vsrl(xT(ctx->opcode), xtl);
 
     tcg_temp_free_i64(t0);
     tcg_temp_free_i64(exp);
     tcg_temp_free_i64(zr);
     tcg_temp_free_i64(nan);
+    tcg_temp_free_i64(xth);
+    tcg_temp_free_i64(xtl);
+    tcg_temp_free_i64(xbh);
+    tcg_temp_free_i64(xbl);
 }
 
 #undef GEN_XX2FORM
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 4/6] target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env
  2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
                   ` (2 preceding siblings ...)
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 3/6] target/ppc: introduce get_cpu_vsr{l, h}() and set_cpu_vsr{l, h}() helpers for VSR " Mark Cave-Ayland
@ 2018-12-07  8:56 ` Mark Cave-Ayland
  2018-12-10 19:05   ` Richard Henderson
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 5/6] target/ppc: convert VMX logical instructions to use vector operations Mark Cave-Ayland
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-07  8:56 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, david, richard.henderson

Instead of accessing the FPR, VMX and VSX registers through static arrays of
TCGv_i64 globals, remove them and change the helpers to load/store data directly
within cpu_env.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
---
 target/ppc/translate.c | 59 ++++++++++++++------------------------------------
 1 file changed, 16 insertions(+), 43 deletions(-)

diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index fa3e8dc114..5923c688cd 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -55,15 +55,9 @@
 /* global register indexes */
 static char cpu_reg_names[10*3 + 22*4 /* GPR */
     + 10*4 + 22*5 /* SPE GPRh */
-    + 10*4 + 22*5 /* FPR */
-    + 2*(10*6 + 22*7) /* AVRh, AVRl */
-    + 10*5 + 22*6 /* VSR */
     + 8*5 /* CRF */];
 static TCGv cpu_gpr[32];
 static TCGv cpu_gprh[32];
-static TCGv_i64 cpu_fpr[32];
-static TCGv_i64 cpu_avrh[32], cpu_avrl[32];
-static TCGv_i64 cpu_vsr[32];
 static TCGv_i32 cpu_crf[8];
 static TCGv cpu_nip;
 static TCGv cpu_msr;
@@ -108,39 +102,6 @@ void ppc_translate_init(void)
                                          offsetof(CPUPPCState, gprh[i]), p);
         p += (i < 10) ? 4 : 5;
         cpu_reg_names_size -= (i < 10) ? 4 : 5;
-
-        snprintf(p, cpu_reg_names_size, "fp%d", i);
-        cpu_fpr[i] = tcg_global_mem_new_i64(cpu_env,
-                                            offsetof(CPUPPCState, fpr[i]), p);
-        p += (i < 10) ? 4 : 5;
-        cpu_reg_names_size -= (i < 10) ? 4 : 5;
-
-        snprintf(p, cpu_reg_names_size, "avr%dH", i);
-#ifdef HOST_WORDS_BIGENDIAN
-        cpu_avrh[i] = tcg_global_mem_new_i64(cpu_env,
-                                             offsetof(CPUPPCState, avr[i].u64[0]), p);
-#else
-        cpu_avrh[i] = tcg_global_mem_new_i64(cpu_env,
-                                             offsetof(CPUPPCState, avr[i].u64[1]), p);
-#endif
-        p += (i < 10) ? 6 : 7;
-        cpu_reg_names_size -= (i < 10) ? 6 : 7;
-
-        snprintf(p, cpu_reg_names_size, "avr%dL", i);
-#ifdef HOST_WORDS_BIGENDIAN
-        cpu_avrl[i] = tcg_global_mem_new_i64(cpu_env,
-                                             offsetof(CPUPPCState, avr[i].u64[1]), p);
-#else
-        cpu_avrl[i] = tcg_global_mem_new_i64(cpu_env,
-                                             offsetof(CPUPPCState, avr[i].u64[0]), p);
-#endif
-        p += (i < 10) ? 6 : 7;
-        cpu_reg_names_size -= (i < 10) ? 6 : 7;
-        snprintf(p, cpu_reg_names_size, "vsr%d", i);
-        cpu_vsr[i] = tcg_global_mem_new_i64(cpu_env,
-                                            offsetof(CPUPPCState, vsr[i]), p);
-        p += (i < 10) ? 5 : 6;
-        cpu_reg_names_size -= (i < 10) ? 5 : 6;
     }
 
     cpu_nip = tcg_global_mem_new(cpu_env,
@@ -6696,22 +6657,34 @@ GEN_TM_PRIV_NOOP(trechkpt);
 
 static inline void get_fpr(TCGv_i64 dst, int regno)
 {
-    tcg_gen_mov_i64(dst, cpu_fpr[regno]);
+    tcg_gen_ld_i64(dst, cpu_env, offsetof(CPUPPCState, fpr[regno]));
 }
 
 static inline void set_fpr(int regno, TCGv_i64 src)
 {
-    tcg_gen_mov_i64(cpu_fpr[regno], src);
+    tcg_gen_st_i64(src, cpu_env, offsetof(CPUPPCState, fpr[regno]));
 }
 
 static inline void get_avr64(TCGv_i64 dst, int regno, bool high)
 {
-    tcg_gen_mov_i64(dst, (high ? cpu_avrh : cpu_avrl)[regno]);
+#ifdef HOST_WORDS_BIGENDIAN
+    tcg_gen_ld_i64(dst, cpu_env, offsetof(CPUPPCState,
+                                          avr[regno].u64[(high ? 0 : 1)]));
+#else
+    tcg_gen_ld_i64(dst, cpu_env, offsetof(CPUPPCState,
+                                          avr[regno].u64[(high ? 1 : 0)]));
+#endif
 }
 
 static inline void set_avr64(int regno, TCGv_i64 src, bool high)
 {
-    tcg_gen_mov_i64((high ? cpu_avrh : cpu_avrl)[regno], src);
+#ifdef HOST_WORDS_BIGENDIAN
+    tcg_gen_st_i64(src, cpu_env, offsetof(CPUPPCState,
+                                          avr[regno].u64[(high ? 0 : 1)]));
+#else
+    tcg_gen_st_i64(src, cpu_env, offsetof(CPUPPCState,
+                                          avr[regno].u64[(high ? 1 : 0)]));
+#endif
 }
 
 #include "translate/fp-impl.inc.c"
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 5/6] target/ppc: convert VMX logical instructions to use vector operations
  2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
                   ` (3 preceding siblings ...)
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 4/6] target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env Mark Cave-Ayland
@ 2018-12-07  8:56 ` Mark Cave-Ayland
  2018-12-10 19:08   ` Richard Henderson
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 6/6] target/ppc: convert vaddu[b, h, w, d] and vsubu[b, h, w, d] over " Mark Cave-Ayland
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-07  8:56 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, david, richard.henderson

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
---
 target/ppc/translate.c              |  1 +
 target/ppc/translate/vmx-impl.inc.c | 64 ++++++++++++++++++++++---------------
 2 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index 5923c688cd..92d023864e 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -24,6 +24,7 @@
 #include "disas/disas.h"
 #include "exec/exec-all.h"
 #include "tcg-op.h"
+#include "tcg-op-gvec.h"
 #include "qemu/host-utils.h"
 #include "exec/cpu_ldst.h"
 
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index 30046c6e31..b252fce71b 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -14,6 +14,11 @@ static inline TCGv_ptr gen_avr_ptr(int reg)
     return r;
 }
 
+static inline long avr64_offset(int reg, bool high)
+{
+    return offsetof(CPUPPCState, avr[reg].u64[(high ? 0 : 1)]);
+}
+
 #define GEN_VR_LDX(name, opc2, opc3)                                          \
 static void glue(gen_, name)(DisasContext *ctx)                                       \
 {                                                                             \
@@ -257,41 +262,50 @@ GEN_VX_VMUL10(vmul10euq, 1, 0);
 GEN_VX_VMUL10(vmul10cuq, 0, 1);
 GEN_VX_VMUL10(vmul10ecuq, 1, 1);
 
-/* Logical operations */
-#define GEN_VX_LOGICAL(name, tcg_op, opc2, opc3)                        \
-static void glue(gen_, name)(DisasContext *ctx)                                 \
+#define GEN_VXFORM_V(name, vece, tcg_op, opc2, opc3)                    \
+static void glue(gen_, name)(DisasContext *ctx)                         \
 {                                                                       \
-    TCGv_i64 t0 = tcg_temp_new_i64();                                   \
-    TCGv_i64 t1 = tcg_temp_new_i64();                                   \
-    TCGv_i64 avr = tcg_temp_new_i64();                                  \
+    if (unlikely(!ctx->altivec_enabled)) {                              \
+        gen_exception(ctx, POWERPC_EXCP_VPU);                           \
+        return;                                                         \
+    }                                                                   \
                                                                         \
+    tcg_op(vece,                                                        \
+           avr64_offset(rD(ctx->opcode), true),                         \
+           avr64_offset(rA(ctx->opcode), true),                         \
+           avr64_offset(rB(ctx->opcode), true),                         \
+           16, 16);                                                     \
+}
+
+#define GEN_VXFORM_VN(name, vece, tcg_op, opc2, opc3)                   \
+static void glue(gen_, name)(DisasContext *ctx)                         \
+{                                                                       \
     if (unlikely(!ctx->altivec_enabled)) {                              \
         gen_exception(ctx, POWERPC_EXCP_VPU);                           \
         return;                                                         \
     }                                                                   \
-    get_avr64(t0, rA(ctx->opcode), true);                               \
-    get_avr64(t1, rB(ctx->opcode), true);                               \
-    tcg_op(avr, t0, t1);                                                \
-    set_avr64(rD(ctx->opcode), avr, true);                              \
                                                                         \
-    get_avr64(t0, rA(ctx->opcode), false);                              \
-    get_avr64(t1, rB(ctx->opcode), false);                              \
-    tcg_op(avr, t0, t1);                                                \
-    set_avr64(rD(ctx->opcode), avr, false);                             \
+    tcg_op(vece,                                                        \
+           avr64_offset(rD(ctx->opcode), true),                         \
+           avr64_offset(rA(ctx->opcode), true),                         \
+           avr64_offset(rB(ctx->opcode), true),                         \
+           16, 16);                                                     \
                                                                         \
-    tcg_temp_free_i64(t0);                                              \
-    tcg_temp_free_i64(t1);                                              \
-    tcg_temp_free_i64(avr);                                             \
+    tcg_gen_gvec_not(vece,                                              \
+                     avr64_offset(rD(ctx->opcode), true),               \
+                     avr64_offset(rD(ctx->opcode), true),               \
+                     16, 16);                                           \
 }
 
-GEN_VX_LOGICAL(vand, tcg_gen_and_i64, 2, 16);
-GEN_VX_LOGICAL(vandc, tcg_gen_andc_i64, 2, 17);
-GEN_VX_LOGICAL(vor, tcg_gen_or_i64, 2, 18);
-GEN_VX_LOGICAL(vxor, tcg_gen_xor_i64, 2, 19);
-GEN_VX_LOGICAL(vnor, tcg_gen_nor_i64, 2, 20);
-GEN_VX_LOGICAL(veqv, tcg_gen_eqv_i64, 2, 26);
-GEN_VX_LOGICAL(vnand, tcg_gen_nand_i64, 2, 22);
-GEN_VX_LOGICAL(vorc, tcg_gen_orc_i64, 2, 21);
+/* Logical operations */
+GEN_VXFORM_V(vand, MO_64, tcg_gen_gvec_and, 2, 16);
+GEN_VXFORM_V(vandc, MO_64, tcg_gen_gvec_andc, 2, 17);
+GEN_VXFORM_V(vor, MO_64, tcg_gen_gvec_or, 2, 18);
+GEN_VXFORM_V(vxor, MO_64, tcg_gen_gvec_xor, 2, 19);
+GEN_VXFORM_VN(vnor, MO_64, tcg_gen_gvec_or, 2, 20);
+GEN_VXFORM_VN(veqv, MO_64, tcg_gen_gvec_xor, 2, 26);
+GEN_VXFORM_VN(vnand, MO_64, tcg_gen_gvec_and, 2, 22);
+GEN_VXFORM_V(vorc, MO_64, tcg_gen_gvec_orc, 2, 21);
 
 #define GEN_VXFORM(name, opc2, opc3)                                    \
 static void glue(gen_, name)(DisasContext *ctx)                                 \
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 6/6] target/ppc: convert vaddu[b, h, w, d] and vsubu[b, h, w, d] over to use vector operations
  2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
                   ` (4 preceding siblings ...)
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 5/6] target/ppc: convert VMX logical instructions to use vector operations Mark Cave-Ayland
@ 2018-12-07  8:56 ` Mark Cave-Ayland
  2018-12-10 19:09   ` Richard Henderson
  2018-12-10  0:33 ` [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG " BALATON Zoltan
  2018-12-10 13:04 ` [Qemu-devel] " Aleksandar Markovic
  7 siblings, 1 reply; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-07  8:56 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc, david, richard.henderson

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
---
 target/ppc/helper.h                 |  8 --------
 target/ppc/int_helper.c             |  7 -------
 target/ppc/translate/vmx-impl.inc.c | 16 ++++++++--------
 3 files changed, 8 insertions(+), 23 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index c7de04e068..553ff500c8 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -108,14 +108,6 @@ DEF_HELPER_FLAGS_1(ftsqrt, TCG_CALL_NO_RWG_SE, i32, i64)
 #define dh_ctype_avr ppc_avr_t *
 #define dh_is_signed_avr dh_is_signed_ptr
 
-DEF_HELPER_3(vaddubm, void, avr, avr, avr)
-DEF_HELPER_3(vadduhm, void, avr, avr, avr)
-DEF_HELPER_3(vadduwm, void, avr, avr, avr)
-DEF_HELPER_3(vaddudm, void, avr, avr, avr)
-DEF_HELPER_3(vsububm, void, avr, avr, avr)
-DEF_HELPER_3(vsubuhm, void, avr, avr, avr)
-DEF_HELPER_3(vsubuwm, void, avr, avr, avr)
-DEF_HELPER_3(vsubudm, void, avr, avr, avr)
 DEF_HELPER_3(vavgub, void, avr, avr, avr)
 DEF_HELPER_3(vavguh, void, avr, avr, avr)
 DEF_HELPER_3(vavguw, void, avr, avr, avr)
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index fcac90a4a9..d0ef37666c 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -531,13 +531,6 @@ void helper_vprtybq(ppc_avr_t *r, ppc_avr_t *b)
             r->element[i] = a->element[i] op b->element[i];             \
         }                                                               \
     }
-#define VARITH(suffix, element)                 \
-    VARITH_DO(add##suffix, +, element)          \
-    VARITH_DO(sub##suffix, -, element)
-VARITH(ubm, u8)
-VARITH(uhm, u16)
-VARITH(uwm, u32)
-VARITH(udm, u64)
 VARITH_DO(muluwm, *, u32)
 #undef VARITH_DO
 #undef VARITH
diff --git a/target/ppc/translate/vmx-impl.inc.c b/target/ppc/translate/vmx-impl.inc.c
index b252fce71b..3ab381104b 100644
--- a/target/ppc/translate/vmx-impl.inc.c
+++ b/target/ppc/translate/vmx-impl.inc.c
@@ -411,18 +411,18 @@ static void glue(gen_, name)(DisasContext *ctx)                         \
     tcg_temp_free_ptr(rb);                                              \
 }
 
-GEN_VXFORM(vaddubm, 0, 0);
+GEN_VXFORM_V(vaddubm, MO_8, tcg_gen_gvec_add, 0, 0);
 GEN_VXFORM_DUAL_EXT(vaddubm, PPC_ALTIVEC, PPC_NONE, 0,       \
                     vmul10cuq, PPC_NONE, PPC2_ISA300, 0x0000F800)
-GEN_VXFORM(vadduhm, 0, 1);
+GEN_VXFORM_V(vadduhm, MO_16, tcg_gen_gvec_add, 0, 1);
 GEN_VXFORM_DUAL(vadduhm, PPC_ALTIVEC, PPC_NONE,  \
                 vmul10ecuq, PPC_NONE, PPC2_ISA300)
-GEN_VXFORM(vadduwm, 0, 2);
-GEN_VXFORM(vaddudm, 0, 3);
-GEN_VXFORM(vsububm, 0, 16);
-GEN_VXFORM(vsubuhm, 0, 17);
-GEN_VXFORM(vsubuwm, 0, 18);
-GEN_VXFORM(vsubudm, 0, 19);
+GEN_VXFORM_V(vadduwm, MO_32, tcg_gen_gvec_add, 0, 2);
+GEN_VXFORM_V(vaddudm, MO_64, tcg_gen_gvec_add, 0, 3);
+GEN_VXFORM_V(vsububm, MO_8, tcg_gen_gvec_sub, 0, 16);
+GEN_VXFORM_V(vsubuhm, MO_16, tcg_gen_gvec_sub, 0, 17);
+GEN_VXFORM_V(vsubuwm, MO_32, tcg_gen_gvec_sub, 0, 18);
+GEN_VXFORM_V(vsubudm, MO_64, tcg_gen_gvec_sub, 0, 19);
 GEN_VXFORM(vmaxub, 1, 0);
 GEN_VXFORM(vmaxuh, 1, 1);
 GEN_VXFORM(vmaxuw, 1, 2);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
                   ` (5 preceding siblings ...)
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 6/6] target/ppc: convert vaddu[b, h, w, d] and vsubu[b, h, w, d] over " Mark Cave-Ayland
@ 2018-12-10  0:33 ` BALATON Zoltan
  2018-12-10  2:59   ` David Gibson
  2018-12-10 13:04 ` [Qemu-devel] " Aleksandar Markovic
  7 siblings, 1 reply; 33+ messages in thread
From: BALATON Zoltan @ 2018-12-10  0:33 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: qemu-devel, qemu-ppc, david, richard.henderson

On Fri, 7 Dec 2018, Mark Cave-Ayland wrote:
> This patchset is an attempt at trying to improve the VMX (Altivec) instruction
> performance by making use of the new TCG vector operations where possible.

This is very welcome, thanks for doing this.

> In order to use TCG vector operations, the registers must be accessible from cpu_env
> whilst currently they are accessed via arrays of static TCG globals. Patches 1-3
> are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR
> registers using the supplied TCGv_i64 parameter.

Have you tried some benchmarks or tests to measure the impact of these 
changes? I've tried the (very unscientific) benchmarks I've written about 
before here:

http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html

(which seem to use AltiVec/VMX instructions but not sure which) on mac99 
with MorphOS and I could not see any performance increase. I haven't run 
enough tests but results with or without this series on master were mostly 
the same within a few percents, and sometimes even seen lower performance 
with these patches than without. I haven't tried to find out why (no time 
for that now) so can't really draw any conclusions from this. I'm also not 
sure if I've actually tested what you've changed or these use instructions 
that your patches don't optimise yet, or the changes I've seen were just 
normal changes between runs; but I wonder if the increased number of 
temporaries could result in lower performance in some cases?

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-10  0:33 ` [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG " BALATON Zoltan
@ 2018-12-10  2:59   ` David Gibson
  2018-12-10 20:54     ` BALATON Zoltan
  0 siblings, 1 reply; 33+ messages in thread
From: David Gibson @ 2018-12-10  2:59 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Mark Cave-Ayland, qemu-devel, qemu-ppc, richard.henderson

[-- Attachment #1: Type: text/plain, Size: 2252 bytes --]

On Mon, Dec 10, 2018 at 01:33:53AM +0100, BALATON Zoltan wrote:
> On Fri, 7 Dec 2018, Mark Cave-Ayland wrote:
> > This patchset is an attempt at trying to improve the VMX (Altivec) instruction
> > performance by making use of the new TCG vector operations where possible.
> 
> This is very welcome, thanks for doing this.
> 
> > In order to use TCG vector operations, the registers must be accessible from cpu_env
> > whilst currently they are accessed via arrays of static TCG globals. Patches 1-3
> > are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR
> > registers using the supplied TCGv_i64 parameter.
> 
> Have you tried some benchmarks or tests to measure the impact of these
> changes? I've tried the (very unscientific) benchmarks I've written about
> before here:
> 
> http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html
> 
> (which seem to use AltiVec/VMX instructions but not sure which) on mac99
> with MorphOS and I could not see any performance increase. I haven't run
> enough tests but results with or without this series on master were mostly
> the same within a few percents, and sometimes even seen lower performance
> with these patches than without. I haven't tried to find out why (no time
> for that now) so can't really draw any conclusions from this. I'm also not
> sure if I've actually tested what you've changed or these use instructions
> that your patches don't optimise yet, or the changes I've seen were just
> normal changes between runs; but I wonder if the increased number of
> temporaries could result in lower performance in some cases?

What was your host machine.  IIUC this change will only improve
performance if the host tcg backend is able to implement TCG vector
ops in terms of vector ops on the host.

In addition, this series only converts a subset of the integer and
logical vector instructions.  If your testcase is mostly floating
point (vectored or otherwise), it will still be softfloat and so not
see any speedup.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access Mark Cave-Ayland
@ 2018-12-10  5:17   ` David Gibson
  2018-12-10 18:25     ` Richard Henderson
  2018-12-11 19:06     ` Mark Cave-Ayland
  2018-12-10 18:43   ` Richard Henderson
  1 sibling, 2 replies; 33+ messages in thread
From: David Gibson @ 2018-12-10  5:17 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: qemu-devel, qemu-ppc, richard.henderson

[-- Attachment #1: Type: text/plain, Size: 57843 bytes --]

On Fri, Dec 07, 2018 at 08:56:30AM +0000, Mark Cave-Ayland wrote:
> These helpers allow us to move FP register values to/from the specified TCGv_i64
> argument.
> 
> To prevent FP helpers accessing the cpu_fpr array directly, add extra TCG
> temporaries as required.

It's not obvious to me why that's a desirable thing.  I'm assuming
it's somehow necessary for the stuff later in the series, but I think
we need a brief rationale here to explain why this isn't just adding
extra reg copies for the sake of it.

> 
> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
> ---
>  target/ppc/translate.c             |  10 +
>  target/ppc/translate/fp-impl.inc.c | 492 ++++++++++++++++++++++++++++---------
>  2 files changed, 392 insertions(+), 110 deletions(-)
> 
> diff --git a/target/ppc/translate.c b/target/ppc/translate.c
> index 2b37910248..1d4bf624a3 100644
> --- a/target/ppc/translate.c
> +++ b/target/ppc/translate.c
> @@ -6694,6 +6694,16 @@ static inline void gen_##name(DisasContext *ctx)               \
>  GEN_TM_PRIV_NOOP(treclaim);
>  GEN_TM_PRIV_NOOP(trechkpt);
>  
> +static inline void get_fpr(TCGv_i64 dst, int regno)
> +{
> +    tcg_gen_mov_i64(dst, cpu_fpr[regno]);
> +}
> +
> +static inline void set_fpr(int regno, TCGv_i64 src)
> +{
> +    tcg_gen_mov_i64(cpu_fpr[regno], src);
> +}
> +
>  #include "translate/fp-impl.inc.c"
>  
>  #include "translate/vmx-impl.inc.c"
> diff --git a/target/ppc/translate/fp-impl.inc.c b/target/ppc/translate/fp-impl.inc.c
> index 08770ba9f5..923fb7550f 100644
> --- a/target/ppc/translate/fp-impl.inc.c
> +++ b/target/ppc/translate/fp-impl.inc.c
> @@ -34,24 +34,39 @@ static void gen_set_cr1_from_fpscr(DisasContext *ctx)
>  #define _GEN_FLOAT_ACB(name, op, op1, op2, isfloat, set_fprf, type)           \
>  static void gen_f##name(DisasContext *ctx)                                    \
>  {                                                                             \
> +    TCGv_i64 t0;                                                              \
> +    TCGv_i64 t1;                                                              \
> +    TCGv_i64 t2;                                                              \
> +    TCGv_i64 t3;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
> +    t0 = tcg_temp_new_i64();                                                  \
> +    t1 = tcg_temp_new_i64();                                                  \
> +    t2 = tcg_temp_new_i64();                                                  \
> +    t3 = tcg_temp_new_i64();                                                  \
>      gen_reset_fpstatus();                                                     \
> -    gen_helper_f##op(cpu_fpr[rD(ctx->opcode)], cpu_env,                       \
> -                     cpu_fpr[rA(ctx->opcode)],                                \
> -                     cpu_fpr[rC(ctx->opcode)], cpu_fpr[rB(ctx->opcode)]);     \
> +    get_fpr(t0, rA(ctx->opcode));                                             \
> +    get_fpr(t1, rC(ctx->opcode));                                             \
> +    get_fpr(t2, rB(ctx->opcode));                                             \
> +    gen_helper_f##op(t3, cpu_env, t0, t1, t2);                                \
> +    set_fpr(rD(ctx->opcode), t3);                                             \
>      if (isfloat) {                                                            \
> -        gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,                    \
> -                        cpu_fpr[rD(ctx->opcode)]);                            \
> +        get_fpr(t0, rD(ctx->opcode));                                         \
> +        gen_helper_frsp(t3, cpu_env, t0);                                     \
> +        set_fpr(rD(ctx->opcode), t3);                                         \
>      }                                                                         \
>      if (set_fprf) {                                                           \
> -        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
> +        gen_compute_fprf_float64(t3);                                         \
>      }                                                                         \
>      if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
>          gen_set_cr1_from_fpscr(ctx);                                          \
>      }                                                                         \
> +    tcg_temp_free_i64(t0);                                                    \
> +    tcg_temp_free_i64(t1);                                                    \
> +    tcg_temp_free_i64(t2);                                                    \
> +    tcg_temp_free_i64(t3);                                                    \
>  }
>  
>  #define GEN_FLOAT_ACB(name, op2, set_fprf, type)                              \
> @@ -61,24 +76,35 @@ _GEN_FLOAT_ACB(name##s, name, 0x3B, op2, 1, set_fprf, type);
>  #define _GEN_FLOAT_AB(name, op, op1, op2, inval, isfloat, set_fprf, type)     \
>  static void gen_f##name(DisasContext *ctx)                                    \
>  {                                                                             \
> +    TCGv_i64 t0;                                                              \
> +    TCGv_i64 t1;                                                              \
> +    TCGv_i64 t2;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
> +    t0 = tcg_temp_new_i64();                                                  \
> +    t1 = tcg_temp_new_i64();                                                  \
> +    t2 = tcg_temp_new_i64();                                                  \
>      gen_reset_fpstatus();                                                     \
> -    gen_helper_f##op(cpu_fpr[rD(ctx->opcode)], cpu_env,                       \
> -                     cpu_fpr[rA(ctx->opcode)],                                \
> -                     cpu_fpr[rB(ctx->opcode)]);                               \
> +    get_fpr(t0, rA(ctx->opcode));                                             \
> +    get_fpr(t1, rB(ctx->opcode));                                             \
> +    gen_helper_f##op(t2, cpu_env, t0, t1);                                    \
> +    set_fpr(rD(ctx->opcode), t2);                                             \
>      if (isfloat) {                                                            \
> -        gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,                    \
> -                        cpu_fpr[rD(ctx->opcode)]);                            \
> +        get_fpr(t0, rD(ctx->opcode));                                         \
> +        gen_helper_frsp(t2, cpu_env, t0);                                     \
> +        set_fpr(rD(ctx->opcode), t2);                                         \
>      }                                                                         \
>      if (set_fprf) {                                                           \
> -        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
> +        gen_compute_fprf_float64(t2);                                         \
>      }                                                                         \
>      if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
>          gen_set_cr1_from_fpscr(ctx);                                          \
>      }                                                                         \
> +    tcg_temp_free_i64(t0);                                                    \
> +    tcg_temp_free_i64(t1);                                                    \
> +    tcg_temp_free_i64(t2);                                                    \
>  }
>  #define GEN_FLOAT_AB(name, op2, inval, set_fprf, type)                        \
>  _GEN_FLOAT_AB(name, name, 0x3F, op2, inval, 0, set_fprf, type);               \
> @@ -87,24 +113,35 @@ _GEN_FLOAT_AB(name##s, name, 0x3B, op2, inval, 1, set_fprf, type);
>  #define _GEN_FLOAT_AC(name, op, op1, op2, inval, isfloat, set_fprf, type)     \
>  static void gen_f##name(DisasContext *ctx)                                    \
>  {                                                                             \
> +    TCGv_i64 t0;                                                              \
> +    TCGv_i64 t1;                                                              \
> +    TCGv_i64 t2;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
> +    t0 = tcg_temp_new_i64();                                                  \
> +    t1 = tcg_temp_new_i64();                                                  \
> +    t2 = tcg_temp_new_i64();                                                  \
>      gen_reset_fpstatus();                                                     \
> -    gen_helper_f##op(cpu_fpr[rD(ctx->opcode)], cpu_env,                       \
> -                     cpu_fpr[rA(ctx->opcode)],                                \
> -                     cpu_fpr[rC(ctx->opcode)]);                               \
> +    get_fpr(t0, rA(ctx->opcode));                                             \
> +    get_fpr(t1, rC(ctx->opcode));                                             \
> +    gen_helper_f##op(t2, cpu_env, t0, t1);                                    \
> +    set_fpr(rD(ctx->opcode), t2);                                             \
>      if (isfloat) {                                                            \
> -        gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,                    \
> -                        cpu_fpr[rD(ctx->opcode)]);                            \
> +        get_fpr(t0, rD(ctx->opcode));                                         \
> +        gen_helper_frsp(t2, cpu_env, t0);                                     \
> +        set_fpr(rD(ctx->opcode), t2);                                         \
>      }                                                                         \
>      if (set_fprf) {                                                           \
> -        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
> +        gen_compute_fprf_float64(t2);                                         \
>      }                                                                         \
>      if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
>          gen_set_cr1_from_fpscr(ctx);                                          \
>      }                                                                         \
> +    tcg_temp_free_i64(t0);                                                    \
> +    tcg_temp_free_i64(t1);                                                    \
> +    tcg_temp_free_i64(t2);                                                    \
>  }
>  #define GEN_FLOAT_AC(name, op2, inval, set_fprf, type)                        \
>  _GEN_FLOAT_AC(name, name, 0x3F, op2, inval, 0, set_fprf, type);               \
> @@ -113,37 +150,51 @@ _GEN_FLOAT_AC(name##s, name, 0x3B, op2, inval, 1, set_fprf, type);
>  #define GEN_FLOAT_B(name, op2, op3, set_fprf, type)                           \
>  static void gen_f##name(DisasContext *ctx)                                    \
>  {                                                                             \
> +    TCGv_i64 t0;                                                              \
> +    TCGv_i64 t1;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
> +    t0 = tcg_temp_new_i64();                                                  \
> +    t1 = tcg_temp_new_i64();                                                  \
>      gen_reset_fpstatus();                                                     \
> -    gen_helper_f##name(cpu_fpr[rD(ctx->opcode)], cpu_env,                     \
> -                       cpu_fpr[rB(ctx->opcode)]);                             \
> +    get_fpr(t0, rB(ctx->opcode));                                             \
> +    gen_helper_f##name(t1, cpu_env, t0);                                      \
> +    set_fpr(rD(ctx->opcode), t1);                                             \
>      if (set_fprf) {                                                           \
> -        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
> +        gen_compute_fprf_float64(t1);                                         \
>      }                                                                         \
>      if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
>          gen_set_cr1_from_fpscr(ctx);                                          \
>      }                                                                         \
> +    tcg_temp_free_i64(t0);                                                    \
> +    tcg_temp_free_i64(t1);                                                    \
>  }
>  
>  #define GEN_FLOAT_BS(name, op1, op2, set_fprf, type)                          \
>  static void gen_f##name(DisasContext *ctx)                                    \
>  {                                                                             \
> +    TCGv_i64 t0;                                                              \
> +    TCGv_i64 t1;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
> +    t0 = tcg_temp_new_i64();                                                  \
> +    t1 = tcg_temp_new_i64();                                                  \
>      gen_reset_fpstatus();                                                     \
> -    gen_helper_f##name(cpu_fpr[rD(ctx->opcode)], cpu_env,                     \
> -                       cpu_fpr[rB(ctx->opcode)]);                             \
> +    get_fpr(t0, rB(ctx->opcode));                                             \
> +    gen_helper_f##name(t1, cpu_env, t0);                                      \
> +    set_fpr(rD(ctx->opcode), t1);                                             \
>      if (set_fprf) {                                                           \
> -        gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);                   \
> +        gen_compute_fprf_float64(t1);                                         \
>      }                                                                         \
>      if (unlikely(Rc(ctx->opcode) != 0)) {                                     \
>          gen_set_cr1_from_fpscr(ctx);                                          \
>      }                                                                         \
> +    tcg_temp_free_i64(t0);                                                    \
> +    tcg_temp_free_i64(t1);                                                    \
>  }
>  
>  /* fadd - fadds */
> @@ -165,19 +216,25 @@ GEN_FLOAT_BS(rsqrte, 0x3F, 0x1A, 1, PPC_FLOAT_FRSQRTE);
>  /* frsqrtes */
>  static void gen_frsqrtes(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
>      gen_reset_fpstatus();
> -    gen_helper_frsqrte(cpu_fpr[rD(ctx->opcode)], cpu_env,
> -                       cpu_fpr[rB(ctx->opcode)]);
> -    gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,
> -                    cpu_fpr[rD(ctx->opcode)]);
> -    gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);
> +    get_fpr(t0, rB(ctx->opcode));
> +    gen_helper_frsqrte(t1, cpu_env, t0);
> +    set_fpr(rD(ctx->opcode), t1);
> +    gen_helper_frsp(t1, cpu_env, t1);
> +    gen_compute_fprf_float64(t1);
>      if (unlikely(Rc(ctx->opcode) != 0)) {
>          gen_set_cr1_from_fpscr(ctx);
>      }
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* fsel */
> @@ -189,34 +246,47 @@ GEN_FLOAT_AB(sub, 0x14, 0x000007C0, 1, PPC_FLOAT);
>  /* fsqrt */
>  static void gen_fsqrt(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
>      gen_reset_fpstatus();
> -    gen_helper_fsqrt(cpu_fpr[rD(ctx->opcode)], cpu_env,
> -                     cpu_fpr[rB(ctx->opcode)]);
> -    gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);
> +    get_fpr(t0, rB(ctx->opcode));
> +    gen_helper_fsqrt(t1, cpu_env, t0);
> +    set_fpr(rD(ctx->opcode), t1);
> +    gen_compute_fprf_float64(t1);
>      if (unlikely(Rc(ctx->opcode) != 0)) {
>          gen_set_cr1_from_fpscr(ctx);
>      }
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  static void gen_fsqrts(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
>      gen_reset_fpstatus();
> -    gen_helper_fsqrt(cpu_fpr[rD(ctx->opcode)], cpu_env,
> -                     cpu_fpr[rB(ctx->opcode)]);
> -    gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,
> -                    cpu_fpr[rD(ctx->opcode)]);
> -    gen_compute_fprf_float64(cpu_fpr[rD(ctx->opcode)]);
> +    get_fpr(t0, rB(ctx->opcode));
> +    gen_helper_fsqrt(t1, cpu_env, t0);
> +    set_fpr(rD(ctx->opcode), t1);
> +    gen_helper_frsp(t1, cpu_env, t1);
> +    gen_compute_fprf_float64(t1);
>      if (unlikely(Rc(ctx->opcode) != 0)) {
>          gen_set_cr1_from_fpscr(ctx);
>      }
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /***                     Floating-Point multiply-and-add                   ***/
> @@ -268,21 +338,32 @@ GEN_FLOAT_B(rim, 0x08, 0x0F, 1, PPC_FLOAT_EXT);
>  
>  static void gen_ftdiv(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> -    gen_helper_ftdiv(cpu_crf[crfD(ctx->opcode)], cpu_fpr[rA(ctx->opcode)],
> -                     cpu_fpr[rB(ctx->opcode)]);
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
> +    get_fpr(t0, rA(ctx->opcode));
> +    get_fpr(t1, rB(ctx->opcode));
> +    gen_helper_ftdiv(cpu_crf[crfD(ctx->opcode)], t0, t1);
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  static void gen_ftsqrt(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> -    gen_helper_ftsqrt(cpu_crf[crfD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)]);
> +    t0 = tcg_temp_new_i64();
> +    get_fpr(t0, rB(ctx->opcode));
> +    gen_helper_ftsqrt(cpu_crf[crfD(ctx->opcode)], t0);
> +    tcg_temp_free_i64(t0);
>  }
>  
>  
> @@ -293,32 +374,46 @@ static void gen_ftsqrt(DisasContext *ctx)
>  static void gen_fcmpo(DisasContext *ctx)
>  {
>      TCGv_i32 crf;
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
>      gen_reset_fpstatus();
>      crf = tcg_const_i32(crfD(ctx->opcode));
> -    gen_helper_fcmpo(cpu_env, cpu_fpr[rA(ctx->opcode)],
> -                     cpu_fpr[rB(ctx->opcode)], crf);
> +    get_fpr(t0, rA(ctx->opcode));
> +    get_fpr(t1, rB(ctx->opcode));
> +    gen_helper_fcmpo(cpu_env, t0, t1, crf);
>      tcg_temp_free_i32(crf);
>      gen_helper_float_check_status(cpu_env);
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* fcmpu */
>  static void gen_fcmpu(DisasContext *ctx)
>  {
>      TCGv_i32 crf;
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
>      gen_reset_fpstatus();
>      crf = tcg_const_i32(crfD(ctx->opcode));
> -    gen_helper_fcmpu(cpu_env, cpu_fpr[rA(ctx->opcode)],
> -                     cpu_fpr[rB(ctx->opcode)], crf);
> +    get_fpr(t0, rA(ctx->opcode));
> +    get_fpr(t1, rB(ctx->opcode));
> +    gen_helper_fcmpu(cpu_env, t0, t1, crf);
>      tcg_temp_free_i32(crf);
>      gen_helper_float_check_status(cpu_env);
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /***                         Floating-point move                           ***/
> @@ -326,100 +421,153 @@ static void gen_fcmpu(DisasContext *ctx)
>  /* XXX: beware that fabs never checks for NaNs nor update FPSCR */
>  static void gen_fabs(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> -    tcg_gen_andi_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)],
> -                     ~(1ULL << 63));
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
> +    get_fpr(t0, rB(ctx->opcode));
> +    tcg_gen_andi_i64(t1, t0, ~(1ULL << 63));
> +    set_fpr(rD(ctx->opcode), t1);
>      if (unlikely(Rc(ctx->opcode))) {
>          gen_set_cr1_from_fpscr(ctx);
>      }
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* fmr  - fmr. */
>  /* XXX: beware that fmr never checks for NaNs nor update FPSCR */
>  static void gen_fmr(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> -    tcg_gen_mov_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)]);
> +    t0 = tcg_temp_new_i64();
> +    get_fpr(t0, rB(ctx->opcode));
> +    set_fpr(rD(ctx->opcode), t0);
>      if (unlikely(Rc(ctx->opcode))) {
>          gen_set_cr1_from_fpscr(ctx);
>      }
> +    tcg_temp_free_i64(t0);
>  }
>  
>  /* fnabs */
>  /* XXX: beware that fnabs never checks for NaNs nor update FPSCR */
>  static void gen_fnabs(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> -    tcg_gen_ori_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)],
> -                    1ULL << 63);
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
> +    get_fpr(t0, rB(ctx->opcode));
> +    tcg_gen_ori_i64(t1, t0, 1ULL << 63);
> +    set_fpr(rD(ctx->opcode), t1);
>      if (unlikely(Rc(ctx->opcode))) {
>          gen_set_cr1_from_fpscr(ctx);
>      }
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* fneg */
>  /* XXX: beware that fneg never checks for NaNs nor update FPSCR */
>  static void gen_fneg(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> -    tcg_gen_xori_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rB(ctx->opcode)],
> -                     1ULL << 63);
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
> +    get_fpr(t0, rB(ctx->opcode));
> +    tcg_gen_xori_i64(t1, t0, 1ULL << 63);
> +    set_fpr(rD(ctx->opcode), t1);
>      if (unlikely(Rc(ctx->opcode))) {
>          gen_set_cr1_from_fpscr(ctx);
>      }
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* fcpsgn: PowerPC 2.05 specification */
>  /* XXX: beware that fcpsgn never checks for NaNs nor update FPSCR */
>  static void gen_fcpsgn(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
> +    TCGv_i64 t2;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> -    tcg_gen_deposit_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rA(ctx->opcode)],
> -                        cpu_fpr[rB(ctx->opcode)], 0, 63);
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
> +    t2 = tcg_temp_new_i64();
> +    get_fpr(t0, rA(ctx->opcode));
> +    get_fpr(t1, rB(ctx->opcode));
> +    tcg_gen_deposit_i64(t2, t0, t1, 0, 63);
> +    set_fpr(rD(ctx->opcode), t2);
>      if (unlikely(Rc(ctx->opcode))) {
>          gen_set_cr1_from_fpscr(ctx);
>      }
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
>  }
>  
>  static void gen_fmrgew(DisasContext *ctx)
>  {
>      TCGv_i64 b0;
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
>      b0 = tcg_temp_new_i64();
> -    tcg_gen_shri_i64(b0, cpu_fpr[rB(ctx->opcode)], 32);
> -    tcg_gen_deposit_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpr[rA(ctx->opcode)],
> -                        b0, 0, 32);
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
> +    get_fpr(t0, rB(ctx->opcode));
> +    tcg_gen_shri_i64(b0, t0, 32);
> +    get_fpr(t0, rA(ctx->opcode));
> +    tcg_gen_deposit_i64(t1, t0, b0, 0, 32);
> +    set_fpr(rD(ctx->opcode), t1);
>      tcg_temp_free_i64(b0);
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  static void gen_fmrgow(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
> +    TCGv_i64 t1;
> +    TCGv_i64 t2;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> -    tcg_gen_deposit_i64(cpu_fpr[rD(ctx->opcode)],
> -                        cpu_fpr[rB(ctx->opcode)],
> -                        cpu_fpr[rA(ctx->opcode)],
> -                        32, 32);
> +    t0 = tcg_temp_new_i64();
> +    t1 = tcg_temp_new_i64();
> +    t2 = tcg_temp_new_i64();
> +    get_fpr(t0, rB(ctx->opcode));
> +    get_fpr(t1, rA(ctx->opcode));
> +    tcg_gen_deposit_i64(t2, t0, t1, 32, 32);
> +    set_fpr(rD(ctx->opcode), t2);
> +    tcg_temp_free_i64(t0);
> +    tcg_temp_free_i64(t1);
> +    tcg_temp_free_i64(t2);
>  }
>  
>  /***                  Floating-Point status & ctrl register                ***/
> @@ -458,15 +606,19 @@ static void gen_mcrfs(DisasContext *ctx)
>  /* mffs */
>  static void gen_mffs(DisasContext *ctx)
>  {
> +    TCGv_i64 t0;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
> +    t0 = tcg_temp_new_i64();
>      gen_reset_fpstatus();
> -    tcg_gen_extu_tl_i64(cpu_fpr[rD(ctx->opcode)], cpu_fpscr);
> +    tcg_gen_extu_tl_i64(t0, cpu_fpscr);
> +    set_fpr(rD(ctx->opcode), t0);
>      if (unlikely(Rc(ctx->opcode))) {
>          gen_set_cr1_from_fpscr(ctx);
>      }
> +    tcg_temp_free_i64(t0);
>  }
>  
>  /* mtfsb0 */
> @@ -522,6 +674,7 @@ static void gen_mtfsb1(DisasContext *ctx)
>  static void gen_mtfsf(DisasContext *ctx)
>  {
>      TCGv_i32 t0;
> +    TCGv_i64 t1;
>      int flm, l, w;
>  
>      if (unlikely(!ctx->fpu_enabled)) {
> @@ -541,7 +694,9 @@ static void gen_mtfsf(DisasContext *ctx)
>      } else {
>          t0 = tcg_const_i32(flm << (w * 8));
>      }
> -    gen_helper_store_fpscr(cpu_env, cpu_fpr[rB(ctx->opcode)], t0);
> +    t1 = tcg_temp_new_i64();
> +    get_fpr(t1, rB(ctx->opcode));
> +    gen_helper_store_fpscr(cpu_env, t1, t0);
>      tcg_temp_free_i32(t0);
>      if (unlikely(Rc(ctx->opcode) != 0)) {
>          tcg_gen_trunc_tl_i32(cpu_crf[1], cpu_fpscr);
> @@ -549,6 +704,7 @@ static void gen_mtfsf(DisasContext *ctx)
>      }
>      /* We can raise a differed exception */
>      gen_helper_float_check_status(cpu_env);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* mtfsfi */
> @@ -588,21 +744,26 @@ static void gen_mtfsfi(DisasContext *ctx)
>  static void glue(gen_, name)(DisasContext *ctx)                                       \
>  {                                                                             \
>      TCGv EA;                                                                  \
> +    TCGv_i64 t0;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
>      gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
>      EA = tcg_temp_new();                                                      \
> +    t0 = tcg_temp_new_i64();                                                  \
>      gen_addr_imm_index(ctx, EA, 0);                                           \
> -    gen_qemu_##ldop(ctx, cpu_fpr[rD(ctx->opcode)], EA);                       \
> +    gen_qemu_##ldop(ctx, t0, EA);                                             \
> +    set_fpr(rD(ctx->opcode), t0);                                             \
>      tcg_temp_free(EA);                                                        \
> +    tcg_temp_free_i64(t0);                                                    \
>  }
>  
>  #define GEN_LDUF(name, ldop, opc, type)                                       \
>  static void glue(gen_, name##u)(DisasContext *ctx)                                    \
>  {                                                                             \
>      TCGv EA;                                                                  \
> +    TCGv_i64 t0;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
> @@ -613,20 +774,25 @@ static void glue(gen_, name##u)(DisasContext *ctx)
>      }                                                                         \
>      gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
>      EA = tcg_temp_new();                                                      \
> +    t0 = tcg_temp_new_i64();                                                  \
>      gen_addr_imm_index(ctx, EA, 0);                                           \
> -    gen_qemu_##ldop(ctx, cpu_fpr[rD(ctx->opcode)], EA);                       \
> +    gen_qemu_##ldop(ctx, t0, EA);                                             \
> +    set_fpr(rD(ctx->opcode), t0);                                             \
>      tcg_gen_mov_tl(cpu_gpr[rA(ctx->opcode)], EA);                             \
>      tcg_temp_free(EA);                                                        \
> +    tcg_temp_free_i64(t0);                                                    \
>  }
>  
>  #define GEN_LDUXF(name, ldop, opc, type)                                      \
>  static void glue(gen_, name##ux)(DisasContext *ctx)                                   \
>  {                                                                             \
>      TCGv EA;                                                                  \
> +    TCGv_i64 t0;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
> +    t0 = tcg_temp_new_i64();                                                  \
>      if (unlikely(rA(ctx->opcode) == 0)) {                                     \
>          gen_inval_exception(ctx, POWERPC_EXCP_INVAL_INVAL);                   \
>          return;                                                               \
> @@ -634,24 +800,30 @@ static void glue(gen_, name##ux)(DisasContext *ctx)
>      gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
>      EA = tcg_temp_new();                                                      \
>      gen_addr_reg_index(ctx, EA);                                              \
> -    gen_qemu_##ldop(ctx, cpu_fpr[rD(ctx->opcode)], EA);                       \
> +    gen_qemu_##ldop(ctx, t0, EA);                                             \
> +    set_fpr(rD(ctx->opcode), t0);                                             \
>      tcg_gen_mov_tl(cpu_gpr[rA(ctx->opcode)], EA);                             \
>      tcg_temp_free(EA);                                                        \
> +    tcg_temp_free_i64(t0);                                                    \
>  }
>  
>  #define GEN_LDXF(name, ldop, opc2, opc3, type)                                \
>  static void glue(gen_, name##x)(DisasContext *ctx)                                    \
>  {                                                                             \
>      TCGv EA;                                                                  \
> +    TCGv_i64 t0;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
>      gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
>      EA = tcg_temp_new();                                                      \
> +    t0 = tcg_temp_new_i64();                                                  \
>      gen_addr_reg_index(ctx, EA);                                              \
> -    gen_qemu_##ldop(ctx, cpu_fpr[rD(ctx->opcode)], EA);                       \
> +    gen_qemu_##ldop(ctx, t0, EA);                                             \
> +    set_fpr(rD(ctx->opcode), t0);                                             \
>      tcg_temp_free(EA);                                                        \
> +    tcg_temp_free_i64(t0);                                                    \
>  }
>  
>  #define GEN_LDFS(name, ldop, op, type)                                        \
> @@ -677,6 +849,7 @@ GEN_LDFS(lfs, ld32fs, 0x10, PPC_FLOAT);
>  static void gen_lfdepx(DisasContext *ctx)
>  {
>      TCGv EA;
> +    TCGv_i64 t0;
>      CHK_SV;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
> @@ -684,16 +857,19 @@ static void gen_lfdepx(DisasContext *ctx)
>      }
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      EA = tcg_temp_new();
> +    t0 = tcg_temp_new_i64();
>      gen_addr_reg_index(ctx, EA);
> -    tcg_gen_qemu_ld_i64(cpu_fpr[rD(ctx->opcode)], EA, PPC_TLB_EPID_LOAD,
> -        DEF_MEMOP(MO_Q));
> +    tcg_gen_qemu_ld_i64(t0, EA, PPC_TLB_EPID_LOAD, DEF_MEMOP(MO_Q));
> +    set_fpr(rD(ctx->opcode), t0);
>      tcg_temp_free(EA);
> +    tcg_temp_free_i64(t0);
>  }
>  
>  /* lfdp */
>  static void gen_lfdp(DisasContext *ctx)
>  {
>      TCGv EA;
> +    TCGv_i64 t0;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
> @@ -701,24 +877,31 @@ static void gen_lfdp(DisasContext *ctx)
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      EA = tcg_temp_new();
>      gen_addr_imm_index(ctx, EA, 0);
> +    t0 = tcg_temp_new_i64();
>      /* We only need to swap high and low halves. gen_qemu_ld64_i64 does
>         necessary 64-bit byteswap already. */
>      if (unlikely(ctx->le_mode)) {
> -        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
> +        gen_qemu_ld64_i64(ctx, t0, EA);
> +        set_fpr(rD(ctx->opcode) + 1, t0);
>          tcg_gen_addi_tl(EA, EA, 8);
> -        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
> +        gen_qemu_ld64_i64(ctx, t0, EA);
> +        set_fpr(rD(ctx->opcode), t0);
>      } else {
> -        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
> +        gen_qemu_ld64_i64(ctx, t0, EA);
> +        set_fpr(rD(ctx->opcode), t0);
>          tcg_gen_addi_tl(EA, EA, 8);
> -        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
> +        gen_qemu_ld64_i64(ctx, t0, EA);
> +        set_fpr(rD(ctx->opcode) + 1, t0);
>      }
>      tcg_temp_free(EA);
> +    tcg_temp_free_i64(t0);
>  }
>  
>  /* lfdpx */
>  static void gen_lfdpx(DisasContext *ctx)
>  {
>      TCGv EA;
> +    TCGv_i64 t0;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
> @@ -726,18 +909,24 @@ static void gen_lfdpx(DisasContext *ctx)
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      EA = tcg_temp_new();
>      gen_addr_reg_index(ctx, EA);
> +    t0 = tcg_temp_new_i64();
>      /* We only need to swap high and low halves. gen_qemu_ld64_i64 does
>         necessary 64-bit byteswap already. */
>      if (unlikely(ctx->le_mode)) {
> -        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
> +        gen_qemu_ld64_i64(ctx, t0, EA);
> +        set_fpr(rD(ctx->opcode) + 1, t0);
>          tcg_gen_addi_tl(EA, EA, 8);
> -        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
> +        gen_qemu_ld64_i64(ctx, t0, EA);
> +        set_fpr(rD(ctx->opcode), t0);
>      } else {
> -        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
> +        gen_qemu_ld64_i64(ctx, t0, EA);
> +        set_fpr(rD(ctx->opcode), t0);
>          tcg_gen_addi_tl(EA, EA, 8);
> -        gen_qemu_ld64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
> +        gen_qemu_ld64_i64(ctx, t0, EA);
> +        set_fpr(rD(ctx->opcode) + 1, t0);
>      }
>      tcg_temp_free(EA);
> +    tcg_temp_free_i64(t0);
>  }
>  
>  /* lfiwax */
> @@ -745,6 +934,7 @@ static void gen_lfiwax(DisasContext *ctx)
>  {
>      TCGv EA;
>      TCGv t0;
> +    TCGv_i64 t1;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
> @@ -752,47 +942,59 @@ static void gen_lfiwax(DisasContext *ctx)
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      EA = tcg_temp_new();
>      t0 = tcg_temp_new();
> +    t1 = tcg_temp_new_i64();
>      gen_addr_reg_index(ctx, EA);
>      gen_qemu_ld32s(ctx, t0, EA);
> -    tcg_gen_ext_tl_i64(cpu_fpr[rD(ctx->opcode)], t0);
> +    tcg_gen_ext_tl_i64(t1, t0);
> +    set_fpr(rD(ctx->opcode), t1);
>      tcg_temp_free(EA);
>      tcg_temp_free(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* lfiwzx */
>  static void gen_lfiwzx(DisasContext *ctx)
>  {
>      TCGv EA;
> +    TCGv_i64 t0;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      EA = tcg_temp_new();
> +    t0 = tcg_temp_new_i64();
>      gen_addr_reg_index(ctx, EA);
> -    gen_qemu_ld32u_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
> +    gen_qemu_ld32u_i64(ctx, t0, EA);
> +    set_fpr(rD(ctx->opcode), t0);
>      tcg_temp_free(EA);
> +    tcg_temp_free_i64(t0);
>  }
>  /***                         Floating-point store                          ***/
>  #define GEN_STF(name, stop, opc, type)                                        \
>  static void glue(gen_, name)(DisasContext *ctx)                                       \
>  {                                                                             \
>      TCGv EA;                                                                  \
> +    TCGv_i64 t0;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
>      gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
>      EA = tcg_temp_new();                                                      \
> +    t0 = tcg_temp_new_i64();                                                  \
>      gen_addr_imm_index(ctx, EA, 0);                                           \
> -    gen_qemu_##stop(ctx, cpu_fpr[rS(ctx->opcode)], EA);                       \
> +    get_fpr(t0, rS(ctx->opcode));                                             \
> +    gen_qemu_##stop(ctx, t0, EA);                                             \
>      tcg_temp_free(EA);                                                        \
> +    tcg_temp_free_i64(t0);                                                    \
>  }
>  
>  #define GEN_STUF(name, stop, opc, type)                                       \
>  static void glue(gen_, name##u)(DisasContext *ctx)                                    \
>  {                                                                             \
>      TCGv EA;                                                                  \
> +    TCGv_i64 t0;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
> @@ -803,16 +1005,20 @@ static void glue(gen_, name##u)(DisasContext *ctx)
>      }                                                                         \
>      gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
>      EA = tcg_temp_new();                                                      \
> +    t0 = tcg_temp_new_i64();                                                  \
>      gen_addr_imm_index(ctx, EA, 0);                                           \
> -    gen_qemu_##stop(ctx, cpu_fpr[rS(ctx->opcode)], EA);                       \
> +    get_fpr(t0, rS(ctx->opcode));                                             \
> +    gen_qemu_##stop(ctx, t0, EA);                                             \
>      tcg_gen_mov_tl(cpu_gpr[rA(ctx->opcode)], EA);                             \
>      tcg_temp_free(EA);                                                        \
> +    tcg_temp_free_i64(t0);                                                    \
>  }
>  
>  #define GEN_STUXF(name, stop, opc, type)                                      \
>  static void glue(gen_, name##ux)(DisasContext *ctx)                                   \
>  {                                                                             \
>      TCGv EA;                                                                  \
> +    TCGv_i64 t0;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
> @@ -823,25 +1029,32 @@ static void glue(gen_, name##ux)(DisasContext *ctx)
>      }                                                                         \
>      gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
>      EA = tcg_temp_new();                                                      \
> +    t0 = tcg_temp_new_i64();                                                  \
>      gen_addr_reg_index(ctx, EA);                                              \
> -    gen_qemu_##stop(ctx, cpu_fpr[rS(ctx->opcode)], EA);                       \
> +    get_fpr(t0, rS(ctx->opcode));                                             \
> +    gen_qemu_##stop(ctx, t0, EA);                                             \
>      tcg_gen_mov_tl(cpu_gpr[rA(ctx->opcode)], EA);                             \
>      tcg_temp_free(EA);                                                        \
> +    tcg_temp_free_i64(t0);                                                    \
>  }
>  
>  #define GEN_STXF(name, stop, opc2, opc3, type)                                \
>  static void glue(gen_, name##x)(DisasContext *ctx)                                    \
>  {                                                                             \
>      TCGv EA;                                                                  \
> +    TCGv_i64 t0;                                                              \
>      if (unlikely(!ctx->fpu_enabled)) {                                        \
>          gen_exception(ctx, POWERPC_EXCP_FPU);                                 \
>          return;                                                               \
>      }                                                                         \
>      gen_set_access_type(ctx, ACCESS_FLOAT);                                   \
>      EA = tcg_temp_new();                                                      \
> +    t0 = tcg_temp_new_i64();                                                  \
>      gen_addr_reg_index(ctx, EA);                                              \
> -    gen_qemu_##stop(ctx, cpu_fpr[rS(ctx->opcode)], EA);                       \
> +    get_fpr(t0, rS(ctx->opcode));                                             \
> +    gen_qemu_##stop(ctx, t0, EA);                                             \
>      tcg_temp_free(EA);                                                        \
> +    tcg_temp_free_i64(t0);                                                    \
>  }
>  
>  #define GEN_STFS(name, stop, op, type)                                        \
> @@ -867,6 +1080,7 @@ GEN_STFS(stfs, st32fs, 0x14, PPC_FLOAT);
>  static void gen_stfdepx(DisasContext *ctx)
>  {
>      TCGv EA;
> +    TCGv_i64 t0;
>      CHK_SV;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
> @@ -874,60 +1088,76 @@ static void gen_stfdepx(DisasContext *ctx)
>      }
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      EA = tcg_temp_new();
> +    t0 = tcg_temp_new_i64();
>      gen_addr_reg_index(ctx, EA);
> -    tcg_gen_qemu_st_i64(cpu_fpr[rD(ctx->opcode)], EA, PPC_TLB_EPID_STORE,
> -                       DEF_MEMOP(MO_Q));
> +    get_fpr(t0, rD(ctx->opcode));
> +    tcg_gen_qemu_st_i64(t0, EA, PPC_TLB_EPID_STORE, DEF_MEMOP(MO_Q));
>      tcg_temp_free(EA);
> +    tcg_temp_free_i64(t0);
>  }
>  
>  /* stfdp */
>  static void gen_stfdp(DisasContext *ctx)
>  {
>      TCGv EA;
> +    TCGv_i64 t0;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      EA = tcg_temp_new();
> +    t0 = tcg_temp_new_i64();
>      gen_addr_imm_index(ctx, EA, 0);
>      /* We only need to swap high and low halves. gen_qemu_st64_i64 does
>         necessary 64-bit byteswap already. */
>      if (unlikely(ctx->le_mode)) {
> -        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
> +        get_fpr(t0, rD(ctx->opcode) + 1);
> +        gen_qemu_st64_i64(ctx, t0, EA);
>          tcg_gen_addi_tl(EA, EA, 8);
> -        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
> +        get_fpr(t0, rD(ctx->opcode));
> +        gen_qemu_st64_i64(ctx, t0, EA);
>      } else {
> -        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
> +        get_fpr(t0, rD(ctx->opcode));
> +        gen_qemu_st64_i64(ctx, t0, EA);
>          tcg_gen_addi_tl(EA, EA, 8);
> -        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
> +        get_fpr(t0, rD(ctx->opcode) + 1);
> +        gen_qemu_st64_i64(ctx, t0, EA);
>      }
>      tcg_temp_free(EA);
> +    tcg_temp_free_i64(t0);
>  }
>  
>  /* stfdpx */
>  static void gen_stfdpx(DisasContext *ctx)
>  {
>      TCGv EA;
> +    TCGv_i64 t0;
>      if (unlikely(!ctx->fpu_enabled)) {
>          gen_exception(ctx, POWERPC_EXCP_FPU);
>          return;
>      }
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      EA = tcg_temp_new();
> +    t0 = tcg_temp_new_i64();
>      gen_addr_reg_index(ctx, EA);
>      /* We only need to swap high and low halves. gen_qemu_st64_i64 does
>         necessary 64-bit byteswap already. */
>      if (unlikely(ctx->le_mode)) {
> -        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
> +        get_fpr(t0, rD(ctx->opcode) + 1);
> +        gen_qemu_st64_i64(ctx, t0, EA);
>          tcg_gen_addi_tl(EA, EA, 8);
> -        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
> +        get_fpr(t0, rD(ctx->opcode));
> +        gen_qemu_st64_i64(ctx, t0, EA);
>      } else {
> -        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode)], EA);
> +        get_fpr(t0, rD(ctx->opcode));
> +        gen_qemu_st64_i64(ctx, t0, EA);
>          tcg_gen_addi_tl(EA, EA, 8);
> -        gen_qemu_st64_i64(ctx, cpu_fpr[rD(ctx->opcode) + 1], EA);
> +        get_fpr(t0, rD(ctx->opcode) + 1);
> +        gen_qemu_st64_i64(ctx, t0, EA);
>      }
>      tcg_temp_free(EA);
> +    tcg_temp_free_i64(t0);
>  }
>  
>  /* Optional: */
> @@ -949,13 +1179,18 @@ static void gen_lfq(DisasContext *ctx)
>  {
>      int rd = rD(ctx->opcode);
>      TCGv t0;
> +    TCGv_i64 t1;
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      t0 = tcg_temp_new();
> +    t1 = tcg_temp_new_i64();
>      gen_addr_imm_index(ctx, t0, 0);
> -    gen_qemu_ld64_i64(ctx, cpu_fpr[rd], t0);
> +    gen_qemu_ld64_i64(ctx, t1, t0);
> +    set_fpr(rd, t1);
>      gen_addr_add(ctx, t0, t0, 8);
> -    gen_qemu_ld64_i64(ctx, cpu_fpr[(rd + 1) % 32], t0);
> +    gen_qemu_ld64_i64(ctx, t1, t0);
> +    set_fpr((rd + 1) % 32, t1);
>      tcg_temp_free(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* lfqu */
> @@ -964,17 +1199,22 @@ static void gen_lfqu(DisasContext *ctx)
>      int ra = rA(ctx->opcode);
>      int rd = rD(ctx->opcode);
>      TCGv t0, t1;
> +    TCGv_i64 t2;
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      t0 = tcg_temp_new();
>      t1 = tcg_temp_new();
> +    t2 = tcg_temp_new_i64();
>      gen_addr_imm_index(ctx, t0, 0);
> -    gen_qemu_ld64_i64(ctx, cpu_fpr[rd], t0);
> +    gen_qemu_ld64_i64(ctx, t2, t0);
> +    set_fpr(rd, t2);
>      gen_addr_add(ctx, t1, t0, 8);
> -    gen_qemu_ld64_i64(ctx, cpu_fpr[(rd + 1) % 32], t1);
> +    gen_qemu_ld64_i64(ctx, t2, t1);
> +    set_fpr((rd + 1) % 32, t2);
>      if (ra != 0)
>          tcg_gen_mov_tl(cpu_gpr[ra], t0);
>      tcg_temp_free(t0);
>      tcg_temp_free(t1);
> +    tcg_temp_free_i64(t2);
>  }
>  
>  /* lfqux */
> @@ -984,16 +1224,21 @@ static void gen_lfqux(DisasContext *ctx)
>      int rd = rD(ctx->opcode);
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      TCGv t0, t1;
> +    TCGv_i64 t2;
> +    t2 = tcg_temp_new_i64();
>      t0 = tcg_temp_new();
>      gen_addr_reg_index(ctx, t0);
> -    gen_qemu_ld64_i64(ctx, cpu_fpr[rd], t0);
> +    gen_qemu_ld64_i64(ctx, t2, t0);
> +    set_fpr(rd, t2);
>      t1 = tcg_temp_new();
>      gen_addr_add(ctx, t1, t0, 8);
> -    gen_qemu_ld64_i64(ctx, cpu_fpr[(rd + 1) % 32], t1);
> +    gen_qemu_ld64_i64(ctx, t2, t1);
> +    set_fpr((rd + 1) % 32, t2);
>      tcg_temp_free(t1);
>      if (ra != 0)
>          tcg_gen_mov_tl(cpu_gpr[ra], t0);
>      tcg_temp_free(t0);
> +    tcg_temp_free_i64(t2);
>  }
>  
>  /* lfqx */
> @@ -1001,13 +1246,18 @@ static void gen_lfqx(DisasContext *ctx)
>  {
>      int rd = rD(ctx->opcode);
>      TCGv t0;
> +    TCGv_i64 t1;
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      t0 = tcg_temp_new();
> +    t1 = tcg_temp_new_i64();
>      gen_addr_reg_index(ctx, t0);
> -    gen_qemu_ld64_i64(ctx, cpu_fpr[rd], t0);
> +    gen_qemu_ld64_i64(ctx, t1, t0);
> +    set_fpr(rd, t1);
>      gen_addr_add(ctx, t0, t0, 8);
> -    gen_qemu_ld64_i64(ctx, cpu_fpr[(rd + 1) % 32], t0);
> +    gen_qemu_ld64_i64(ctx, t1, t0);
> +    set_fpr((rd + 1) % 32, t1);
>      tcg_temp_free(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* stfq */
> @@ -1015,13 +1265,18 @@ static void gen_stfq(DisasContext *ctx)
>  {
>      int rd = rD(ctx->opcode);
>      TCGv t0;
> +    TCGv_i64 t1;
>      gen_set_access_type(ctx, ACCESS_FLOAT);
>      t0 = tcg_temp_new();
> +    t1 = tcg_temp_new_i64();
>      gen_addr_imm_index(ctx, t0, 0);
> -    gen_qemu_st64_i64(ctx, cpu_fpr[rd], t0);
> +    get_fpr(t1, rd);
> +    gen_qemu_st64_i64(ctx, t1, t0);
>      gen_addr_add(ctx, t0, t0, 8);
> -    gen_qemu_st64_i64(ctx, cpu_fpr[(rd + 1) % 32], t0);
> +    get_fpr(t1, (rd + 1) % 32);
> +    gen_qemu_st64_i64(ctx, t1, t0);
>      tcg_temp_free(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  /* stfqu */
> @@ -1030,17 +1285,23 @@ static void gen_stfqu(DisasContext *ctx)
>      int ra = rA(ctx->opcode);
>      int rd = rD(ctx->opcode);
>      TCGv t0, t1;
> +    TCGv_i64 t2;
>      gen_set_access_type(ctx, ACCESS_FLOAT);
> +    t2 = tcg_temp_new_i64();
>      t0 = tcg_temp_new();
>      gen_addr_imm_index(ctx, t0, 0);
> -    gen_qemu_st64_i64(ctx, cpu_fpr[rd], t0);
> +    get_fpr(t2, rd);
> +    gen_qemu_st64_i64(ctx, t2, t0);
>      t1 = tcg_temp_new();
>      gen_addr_add(ctx, t1, t0, 8);
> -    gen_qemu_st64_i64(ctx, cpu_fpr[(rd + 1) % 32], t1);
> +    get_fpr(t2, (rd + 1) % 32);
> +    gen_qemu_st64_i64(ctx, t2, t1);
>      tcg_temp_free(t1);
> -    if (ra != 0)
> +    if (ra != 0) {
>          tcg_gen_mov_tl(cpu_gpr[ra], t0);
> +    }
>      tcg_temp_free(t0);
> +    tcg_temp_free_i64(t2);
>  }
>  
>  /* stfqux */
> @@ -1049,17 +1310,23 @@ static void gen_stfqux(DisasContext *ctx)
>      int ra = rA(ctx->opcode);
>      int rd = rD(ctx->opcode);
>      TCGv t0, t1;
> +    TCGv_i64 t2;
>      gen_set_access_type(ctx, ACCESS_FLOAT);
> +    t2 = tcg_temp_new_i64();
>      t0 = tcg_temp_new();
>      gen_addr_reg_index(ctx, t0);
> -    gen_qemu_st64_i64(ctx, cpu_fpr[rd], t0);
> +    get_fpr(t2, rd);
> +    gen_qemu_st64_i64(ctx, t2, t0);
>      t1 = tcg_temp_new();
>      gen_addr_add(ctx, t1, t0, 8);
> -    gen_qemu_st64_i64(ctx, cpu_fpr[(rd + 1) % 32], t1);
> +    get_fpr(t2, (rd + 1) % 32);
> +    gen_qemu_st64_i64(ctx, t2, t1);
>      tcg_temp_free(t1);
> -    if (ra != 0)
> +    if (ra != 0) {
>          tcg_gen_mov_tl(cpu_gpr[ra], t0);
> +    }
>      tcg_temp_free(t0);
> +    tcg_temp_free_i64(t2);
>  }
>  
>  /* stfqx */
> @@ -1067,13 +1334,18 @@ static void gen_stfqx(DisasContext *ctx)
>  {
>      int rd = rD(ctx->opcode);
>      TCGv t0;
> +    TCGv_i64 t1;
>      gen_set_access_type(ctx, ACCESS_FLOAT);
> +    t1 = tcg_temp_new_i64();
>      t0 = tcg_temp_new();
>      gen_addr_reg_index(ctx, t0);
> -    gen_qemu_st64_i64(ctx, cpu_fpr[rd], t0);
> +    get_fpr(t1, rd);
> +    gen_qemu_st64_i64(ctx, t1, t0);
>      gen_addr_add(ctx, t0, t0, 8);
> -    gen_qemu_st64_i64(ctx, cpu_fpr[(rd + 1) % 32], t0);
> +    get_fpr(t1, (rd + 1) % 32);
> +    gen_qemu_st64_i64(ctx, t1, t0);
>      tcg_temp_free(t0);
> +    tcg_temp_free_i64(t1);
>  }
>  
>  #undef _GEN_FLOAT_ACB

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
                   ` (6 preceding siblings ...)
  2018-12-10  0:33 ` [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG " BALATON Zoltan
@ 2018-12-10 13:04 ` Aleksandar Markovic
  2018-12-11 19:11   ` Mark Cave-Ayland
  7 siblings, 1 reply; 33+ messages in thread
From: Aleksandar Markovic @ 2018-12-10 13:04 UTC (permalink / raw)
  To: Mark Cave-Ayland; +Cc: david, qemu-ppc, qemu-devel, richard.henderson

On Dec 7, 2018 9:59 AM, "Mark Cave-Ayland" <mark.cave-ayland@ilande.co.uk>
wrote:
>
> This patchset is an attempt at trying to improve the VMX (Altivec)
instruction
> performance by making use of the new TCG vector operations where possible.
>

Hello, Mark.

I just want to say that I support these efforts. Very interesting, it can
bring significant improvements for multimedia intensive emulations. But,
even more important, we can see new tcg vector interface in action,
possibly suggest improvements, extensions.

I would just like to ask you to add some performance comparisons, if
possible. Very simple tests would be sufficient.

Thanks again for this series!

Aleksandar

> In order to use TCG vector operations, the registers must be accessible
from cpu_env
> whilst currently they are accessed via arrays of static TCG globals.
Patches 1-3
> are therefore mechanical patches which introduce access helpers for FPR,
AVR and VSR
> registers using the supplied TCGv_i64 parameter.
>
> Once this is done, patch 4 enables us to remove the static TCG global
arrays and updates
> the access helpers to read/write to the relevant fields in cpu_env
directly.
>
> The final patches 5 and 6 convert the VMX logical instructions and
addition/subtraction
> instructions respectively over to the TCG vector operations.
>
> NOTE: there are a lot of instructions that cannot (yet) be optimised to
use TCG vector
> operations, however it struck me that there may be some potential for
converting
> saturating add/sub and cmp instructions if there were a mechanism to
return a set of
> flags indicating the result of the saturation/comparison.
>
> Finally thanks to Richard for taking the time to answer some of my
(mostly beginner)
> questions related to TCG.
>
> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
>
>
> Mark Cave-Ayland (6):
>   target/ppc: introduce get_fpr() and set_fpr() helpers for FP register
>     access
>   target/ppc: introduce get_avr64() and set_avr64() helpers for VMX
>     register access
>   target/ppc: introduce get_cpu_vsr{l,h}() and set_cpu_vsr{l,h}()
>     helpers for VSR register access
>   target/ppc: switch FPR, VMX and VSX helpers to access data directly
>     from cpu_env
>   target/ppc: convert VMX logical instructions to use vector operations
>   target/ppc: convert vaddu[b,h,w,d] and vsubu[b,h,w,d] over to use
>     vector operations
>
>  target/ppc/helper.h                 |   8 -
>  target/ppc/int_helper.c             |   7 -
>  target/ppc/translate.c              |  72 ++--
>  target/ppc/translate/fp-impl.inc.c  | 492 ++++++++++++++++++-----
>  target/ppc/translate/vmx-impl.inc.c | 182 ++++++---
>  target/ppc/translate/vsx-impl.inc.c | 782
++++++++++++++++++++++++++----------
>  6 files changed, 1110 insertions(+), 433 deletions(-)
>
> --
> 2.11.0
>
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access
  2018-12-10  5:17   ` David Gibson
@ 2018-12-10 18:25     ` Richard Henderson
  2018-12-11  0:23       ` David Gibson
  2018-12-11 19:06     ` Mark Cave-Ayland
  1 sibling, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2018-12-10 18:25 UTC (permalink / raw)
  To: David Gibson, Mark Cave-Ayland; +Cc: qemu-devel, qemu-ppc

On 12/9/18 11:17 PM, David Gibson wrote:
> On Fri, Dec 07, 2018 at 08:56:30AM +0000, Mark Cave-Ayland wrote:
>> These helpers allow us to move FP register values to/from the specified TCGv_i64
>> argument.
>>
>> To prevent FP helpers accessing the cpu_fpr array directly, add extra TCG
>> temporaries as required.
> 
> It's not obvious to me why that's a desirable thing.  I'm assuming
> it's somehow necessary for the stuff later in the series, but I think
> we need a brief rationale here to explain why this isn't just adding
> extra reg copies for the sake of it.

Note that while this introduces extra opcodes, in many cases it does not change
the number of machine instructions that are generated.  Recall that accessing
cpu_fpr[N] implies a load from env.  This change makes the load explicit.

The change does currently prevent caching cpu_fpr[N] in a host register.  That
can and will be fixed by optimizing on memory operations instead.  (There is a
patch that has been outstanding for 13 months to do this.  I intend to finally
get around to merging it during the 4.0 cycle.)


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access Mark Cave-Ayland
  2018-12-10  5:17   ` David Gibson
@ 2018-12-10 18:43   ` Richard Henderson
  2018-12-11 19:15     ` Mark Cave-Ayland
  1 sibling, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2018-12-10 18:43 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel, qemu-ppc, david

On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
> -    gen_helper_f##op(cpu_fpr[rD(ctx->opcode)], cpu_env,                       \
> -                     cpu_fpr[rA(ctx->opcode)],                                \
> -                     cpu_fpr[rC(ctx->opcode)], cpu_fpr[rB(ctx->opcode)]);     \
> +    get_fpr(t0, rA(ctx->opcode));                                             \
> +    get_fpr(t1, rC(ctx->opcode));                                             \
> +    get_fpr(t2, rB(ctx->opcode));                                             \
> +    gen_helper_f##op(t3, cpu_env, t0, t1, t2);                                \
> +    set_fpr(rD(ctx->opcode), t3);                                             \
>      if (isfloat) {                                                            \
> -        gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,                    \
> -                        cpu_fpr[rD(ctx->opcode)]);                            \
> +        get_fpr(t0, rD(ctx->opcode));                                         \
> +        gen_helper_frsp(t3, cpu_env, t0);                                     \
> +        set_fpr(rD(ctx->opcode), t3);                                         \
>      }                                                                         \

This is an accurate conversion, but the writeback to the rD register need not
happen until after helper_frsp.  Just move it below the isfloat block.

I do see that helper_frsp can raise an exception for invalid_op for SNaN.  If
that code were actually reachable, this would have been an existing bug, in
that we should not have written back to rD after the first operation.  However,
any SNaN will already have been eliminated by the first operation (via
squashing to QNaN or by exiting via exception).

Similarly in GEN_FLOAT_AB.

> +    get_fpr(t0, rB(ctx->opcode));
> +    gen_helper_frsqrte(t1, cpu_env, t0);
> +    set_fpr(rD(ctx->opcode), t1);
> +    gen_helper_frsp(t1, cpu_env, t1);
> +    gen_compute_fprf_float64(t1);

gen_frsqrtes has the set_fpr in the wrong place.  Likewise gen_fsqrts.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/6] target/ppc: introduce get_avr64() and set_avr64() helpers for VMX register access
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 2/6] target/ppc: introduce get_avr64() and set_avr64() helpers for VMX " Mark Cave-Ayland
@ 2018-12-10 18:49   ` Richard Henderson
  2018-12-11 19:16     ` Mark Cave-Ayland
  0 siblings, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2018-12-10 18:49 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel, qemu-ppc, david

On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
> +    avr = tcg_temp_new_i64();                                                 \
>      EA = tcg_temp_new();                                                      \
>      gen_addr_reg_index(ctx, EA);                                              \
>      tcg_gen_andi_tl(EA, EA, ~0xf);                                            \
>      /* We only need to swap high and low halves. gen_qemu_ld64_i64 does       \
>         necessary 64-bit byteswap already. */                                  \
>      if (ctx->le_mode) {                                                       \
> -        gen_qemu_ld64_i64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                \
> +        gen_qemu_ld64_i64(ctx, avr, EA);                                      \
> +        set_avr64(rD(ctx->opcode), avr, false);                               \
>          tcg_gen_addi_tl(EA, EA, 8);                                           \
> -        gen_qemu_ld64_i64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                \
> +        gen_qemu_ld64_i64(ctx, avr, EA);                                      \
> +        set_avr64(rD(ctx->opcode), avr, true);  

An accurate conversion, but I'm going to call this an existing bug:

The writeback to both avr{l,h} should be delayed until all exceptions have been
raised.  Thus you should perform the two gen_qemu_ld64_i64 into two temporaries
and only then write them both back via set_avr64.

Otherwise,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 4/6] target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 4/6] target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env Mark Cave-Ayland
@ 2018-12-10 19:05   ` Richard Henderson
  2018-12-11 19:21     ` Mark Cave-Ayland
  0 siblings, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2018-12-10 19:05 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel, qemu-ppc, david

On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
> Instead of accessing the FPR, VMX and VSX registers through static arrays of
> TCGv_i64 globals, remove them and change the helpers to load/store data directly
> within cpu_env.
> 
> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
> ---
>  target/ppc/translate.c | 59 ++++++++++++++------------------------------------
>  1 file changed, 16 insertions(+), 43 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

Note however, that there are other steps that you must add here before using
vector operations in the next patch:

(1a) The fpr and vsr arrays must be merged, since fpr[n] == vsrh[n].
     If this isn't done, then you simply cannot apply one operation
     to two disjoint memory blocks.

(1b) The vsr and avr arrays should be merged, since vsr[32+n] == avr[n].
     This is simply tidiness, matching the layout to the architecture.

These steps will modify gdbstub.c, machine.c, and linux-user/.

(2) The vsr array needs to be QEMU_ALIGN(16).  See target/arm/cpu.h.
    We assert that the host addresses are 16 byte aligned, so that we
    can eventually use Altivec/VSX in tcg/ppc/.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 5/6] target/ppc: convert VMX logical instructions to use vector operations
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 5/6] target/ppc: convert VMX logical instructions to use vector operations Mark Cave-Ayland
@ 2018-12-10 19:08   ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2018-12-10 19:08 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel, qemu-ppc, david

On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
> ---
>  target/ppc/translate.c              |  1 +
>  target/ppc/translate/vmx-impl.inc.c | 64 ++++++++++++++++++++++---------------
>  2 files changed, 40 insertions(+), 25 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 6/6] target/ppc: convert vaddu[b, h, w, d] and vsubu[b, h, w, d] over to use vector operations
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 6/6] target/ppc: convert vaddu[b, h, w, d] and vsubu[b, h, w, d] over " Mark Cave-Ayland
@ 2018-12-10 19:09   ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2018-12-10 19:09 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel, qemu-ppc, david

On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
> ---
>  target/ppc/helper.h                 |  8 --------
>  target/ppc/int_helper.c             |  7 -------
>  target/ppc/translate/vmx-impl.inc.c | 16 ++++++++--------
>  3 files changed, 8 insertions(+), 23 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/6] target/ppc: introduce get_cpu_vsr{l, h}() and set_cpu_vsr{l, h}() helpers for VSR register access
  2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 3/6] target/ppc: introduce get_cpu_vsr{l, h}() and set_cpu_vsr{l, h}() helpers for VSR " Mark Cave-Ayland
@ 2018-12-10 19:16   ` Richard Henderson
  2018-12-11 19:24     ` Mark Cave-Ayland
  0 siblings, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2018-12-10 19:16 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel, qemu-ppc, david

On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
> +static inline void get_vsr(TCGv_i64 dst, int n)
> +{
> +    tcg_gen_ld_i64(dst, cpu_env, offsetof(CPUPPCState, vsr[n]));
> +}
> +
> +static inline void set_vsr(int n, TCGv_i64 src)
> +{
> +    tcg_gen_st_i64(src, cpu_env, offsetof(CPUPPCState, vsr[n]));
> +}

Why isn't this helper still using cpu_vsr[n]?


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-10  2:59   ` David Gibson
@ 2018-12-10 20:54     ` BALATON Zoltan
  2018-12-10 21:09       ` Richard Henderson
  2018-12-11  1:20       ` David Gibson
  0 siblings, 2 replies; 33+ messages in thread
From: BALATON Zoltan @ 2018-12-10 20:54 UTC (permalink / raw)
  To: David Gibson; +Cc: Mark Cave-Ayland, qemu-ppc, richard.henderson, qemu-devel

On Mon, 10 Dec 2018, David Gibson wrote:
> On Mon, Dec 10, 2018 at 01:33:53AM +0100, BALATON Zoltan wrote:
>> On Fri, 7 Dec 2018, Mark Cave-Ayland wrote:
>>> This patchset is an attempt at trying to improve the VMX (Altivec) instruction
>>> performance by making use of the new TCG vector operations where possible.
>>
>> This is very welcome, thanks for doing this.
>>
>>> In order to use TCG vector operations, the registers must be accessible from cpu_env
>>> whilst currently they are accessed via arrays of static TCG globals. Patches 1-3
>>> are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR
>>> registers using the supplied TCGv_i64 parameter.
>>
>> Have you tried some benchmarks or tests to measure the impact of these
>> changes? I've tried the (very unscientific) benchmarks I've written about
>> before here:
>>
>> http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html
>>
>> (which seem to use AltiVec/VMX instructions but not sure which) on mac99
>> with MorphOS and I could not see any performance increase. I haven't run
>> enough tests but results with or without this series on master were mostly
>> the same within a few percents, and sometimes even seen lower performance
>> with these patches than without. I haven't tried to find out why (no time
>> for that now) so can't really draw any conclusions from this. I'm also not
>> sure if I've actually tested what you've changed or these use instructions
>> that your patches don't optimise yet, or the changes I've seen were just
>> normal changes between runs; but I wonder if the increased number of
>> temporaries could result in lower performance in some cases?
>
> What was your host machine.  IIUC this change will only improve
> performance if the host tcg backend is able to implement TCG vector
> ops in terms of vector ops on the host.

Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume 
x86_64 should be supported but not sure what are the CPU requirements.

> In addition, this series only converts a subset of the integer and
> logical vector instructions.  If your testcase is mostly floating
> point (vectored or otherwise), it will still be softfloat and so not
> see any speedup.

Yes, I don't really know what these tests use but I think "lame" test is 
mostly floating point but tried with "lame_vmx" which should at least use 
some vector ops and "mplayer -benchmark" test is more vmx dependent based 
on my previous profiling and testing with hardfloat but I'm not sure. 
(When testing these with hardfloat I've found that lame was benefiting 
from hardfloat but mplayer wasn't and more VMX related functions showed up 
with mplayer so I assumed it's more VMX bound.)

I've tried to do some profiling again to find out what's used but I can't 
get good results with the tools I have (oprofile stopped working since 
I've updated my machine and Linux perf provides results that are hard to 
interpret for me, haven't tried if gprof would work now it didn't before) 
but I've seen some vector related helpers in the profile so at least some 
vector ops are used. The "helper_vperm" came up top at about 11th (not 
sure where is it called from), other vector helpers were lower.

I don't remember details now but previously when testing hardfloat I've 
written this: "I've looked at vperm which came out top in one of the 
profiles I've taken and on little endian hosts it has the loop backwards 
and also accesses vector elements from end to front which I wonder may be 
enough for the compiler to not be able to optimise it? But I haven't 
checked assembly. The altivec dependent mplayer video decoding test did 
not change much with hardfloat, it took 98% compared to master so likely 
altivec is dominating here." (Although this was with the PPC specific 
vector helpers before VMX patch so not sure if this is still relevant.)

The top 10 in profile were still related to low level memory access and 
MMU management stuff as I've found before:

http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03609.html
http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03704.html

I think implementing i2c for mac99 may help this and some other 
optimisations may also be possible but I don't know enough about these to 
try that.

It also looks like with --enable-debug something is always flusing tlb and 
blowing away tb caches so these will be top in profile and likely dominate 
runtime so can't really use profile to measure impact of VMX patch. 
Without --enable-debug I can't get call graphs so can't get useful 
profile. I think I've looked at this before as well but can't remember now 
which check enabled by --enable-debug is responsible for constant tb 
cache flush and if that could be avoided. I just don't use --enable-debug 
since unless need to debug somthing.

Maybe the PPC softmmu should be reviewed and optimised by someone who 
knows it...

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-10 20:54     ` BALATON Zoltan
@ 2018-12-10 21:09       ` Richard Henderson
  2018-12-10 23:01         ` BALATON Zoltan
  2018-12-11  1:20       ` David Gibson
  1 sibling, 1 reply; 33+ messages in thread
From: Richard Henderson @ 2018-12-10 21:09 UTC (permalink / raw)
  To: BALATON Zoltan, David Gibson; +Cc: Mark Cave-Ayland, qemu-ppc, qemu-devel

On 12/10/18 2:54 PM, BALATON Zoltan wrote:
>> What was your host machine.  IIUC this change will only improve
>> performance if the host tcg backend is able to implement TCG vector
>> ops in terms of vector ops on the host.
> 
> Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64
> should be supported but not sure what are the CPU requirements.

Not quite.  I only support avx1 and later.

I thought about supporting sse4 and later (that's the minimum with all of the
instructions that do what we need), but there is only one cpu generation with
sse4 and without avx1, and avx1 is already 8 years old.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-10 21:09       ` Richard Henderson
@ 2018-12-10 23:01         ` BALATON Zoltan
  0 siblings, 0 replies; 33+ messages in thread
From: BALATON Zoltan @ 2018-12-10 23:01 UTC (permalink / raw)
  To: Richard Henderson; +Cc: David Gibson, Mark Cave-Ayland, qemu-ppc, qemu-devel

On Mon, 10 Dec 2018, Richard Henderson wrote:
> On 12/10/18 2:54 PM, BALATON Zoltan wrote:
>> Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64
>> should be supported but not sure what are the CPU requirements.
>
> Not quite.  I only support avx1 and later.
>
> I thought about supporting sse4 and later (that's the minimum with all of the
> instructions that do what we need), but there is only one cpu generation with
> sse4 and without avx1, and avx1 is already 8 years old.

OK that explains why I haven't seen any improvements. My CPU just predates 
AVX and happens to be in the generation you mention. But I agree this 
probably does not worth the effort. Maybe I should test on something newer 
instead.

Thank you,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access
  2018-12-10 18:25     ` Richard Henderson
@ 2018-12-11  0:23       ` David Gibson
  0 siblings, 0 replies; 33+ messages in thread
From: David Gibson @ 2018-12-11  0:23 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Mark Cave-Ayland, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 1572 bytes --]

On Mon, Dec 10, 2018 at 12:25:26PM -0600, Richard Henderson wrote:
> On 12/9/18 11:17 PM, David Gibson wrote:
> > On Fri, Dec 07, 2018 at 08:56:30AM +0000, Mark Cave-Ayland wrote:
> >> These helpers allow us to move FP register values to/from the specified TCGv_i64
> >> argument.
> >>
> >> To prevent FP helpers accessing the cpu_fpr array directly, add extra TCG
> >> temporaries as required.
> > 
> > It's not obvious to me why that's a desirable thing.  I'm assuming
> > it's somehow necessary for the stuff later in the series, but I think
> > we need a brief rationale here to explain why this isn't just adding
> > extra reg copies for the sake of it.
> 
> Note that while this introduces extra opcodes, in many cases it does not change
> the number of machine instructions that are generated.  Recall that accessing
> cpu_fpr[N] implies a load from env.  This change makes the load explicit.

I realised that a bit later in looking at the series.  I think a
paraphrasing of the above in the commit message would still be
helpful.

> The change does currently prevent caching cpu_fpr[N] in a host register.  That
> can and will be fixed by optimizing on memory operations instead.  (There is a
> patch that has been outstanding for 13 months to do this.  I intend to finally
> get around to merging it during the 4.0 cycle.)
> 
> 
> r~
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-10 20:54     ` BALATON Zoltan
  2018-12-10 21:09       ` Richard Henderson
@ 2018-12-11  1:20       ` David Gibson
  2018-12-11  3:03         ` BALATON Zoltan
  2018-12-11 19:35         ` Mark Cave-Ayland
  1 sibling, 2 replies; 33+ messages in thread
From: David Gibson @ 2018-12-11  1:20 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Mark Cave-Ayland, qemu-ppc, richard.henderson, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6251 bytes --]

On Mon, Dec 10, 2018 at 09:54:51PM +0100, BALATON Zoltan wrote:
> On Mon, 10 Dec 2018, David Gibson wrote:
> > On Mon, Dec 10, 2018 at 01:33:53AM +0100, BALATON Zoltan wrote:
> > > On Fri, 7 Dec 2018, Mark Cave-Ayland wrote:
> > > > This patchset is an attempt at trying to improve the VMX (Altivec) instruction
> > > > performance by making use of the new TCG vector operations where possible.
> > > 
> > > This is very welcome, thanks for doing this.
> > > 
> > > > In order to use TCG vector operations, the registers must be accessible from cpu_env
> > > > whilst currently they are accessed via arrays of static TCG globals. Patches 1-3
> > > > are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR
> > > > registers using the supplied TCGv_i64 parameter.
> > > 
> > > Have you tried some benchmarks or tests to measure the impact of these
> > > changes? I've tried the (very unscientific) benchmarks I've written about
> > > before here:
> > > 
> > > http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html
> > > 
> > > (which seem to use AltiVec/VMX instructions but not sure which) on mac99
> > > with MorphOS and I could not see any performance increase. I haven't run
> > > enough tests but results with or without this series on master were mostly
> > > the same within a few percents, and sometimes even seen lower performance
> > > with these patches than without. I haven't tried to find out why (no time
> > > for that now) so can't really draw any conclusions from this. I'm also not
> > > sure if I've actually tested what you've changed or these use instructions
> > > that your patches don't optimise yet, or the changes I've seen were just
> > > normal changes between runs; but I wonder if the increased number of
> > > temporaries could result in lower performance in some cases?
> > 
> > What was your host machine.  IIUC this change will only improve
> > performance if the host tcg backend is able to implement TCG vector
> > ops in terms of vector ops on the host.
> 
> Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64
> should be supported but not sure what are the CPU requirements.
> 
> > In addition, this series only converts a subset of the integer and
> > logical vector instructions.  If your testcase is mostly floating
> > point (vectored or otherwise), it will still be softfloat and so not
> > see any speedup.
> 
> Yes, I don't really know what these tests use but I think "lame" test is
> mostly floating point but tried with "lame_vmx" which should at least use
> some vector ops and "mplayer -benchmark" test is more vmx dependent based on
> my previous profiling and testing with hardfloat but I'm not sure. (When
> testing these with hardfloat I've found that lame was benefiting from
> hardfloat but mplayer wasn't and more VMX related functions showed up with
> mplayer so I assumed it's more VMX bound.)

I should clarify here.  When I say "floating point" above, I'm not
meaning things using the regular FPU instead of the vector unit.  I'm
saying *anything* involving floating point calculations whether
they're done in the FPU or the vector unit.

The patches here don't convert all VMX instructions to use vector TCG
ops - they only convert a few, and those few are about using the
vector unit for integer (and logical) operations.  VMX instructions
involving floating point calculations are unaffected and will still
use soft-float.

> I've tried to do some profiling again to find out what's used but I can't
> get good results with the tools I have (oprofile stopped working since I've
> updated my machine and Linux perf provides results that are hard to
> interpret for me, haven't tried if gprof would work now it didn't before)
> but I've seen some vector related helpers in the profile so at least some
> vector ops are used. The "helper_vperm" came up top at about 11th (not sure
> where is it called from), other vector helpers were lower.
> 
> I don't remember details now but previously when testing hardfloat I've
> written this: "I've looked at vperm which came out top in one of the
> profiles I've taken and on little endian hosts it has the loop backwards and
> also accesses vector elements from end to front which I wonder may be enough
> for the compiler to not be able to optimise it? But I haven't checked
> assembly. The altivec dependent mplayer video decoding test did not change
> much with hardfloat, it took 98% compared to master so likely altivec is
> dominating here." (Although this was with the PPC specific vector helpers
> before VMX patch so not sure if this is still relevant.)
> 
> The top 10 in profile were still related to low level memory access and MMU
> management stuff as I've found before:
> 
> http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03609.html
> http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03704.html
> 
> I think implementing i2c for mac99 may help this and some other
> optimisations may also be possible but I don't know enough about these to
> try that.
> 
> It also looks like with --enable-debug something is always flusing tlb and
> blowing away tb caches so these will be top in profile and likely dominate
> runtime so can't really use profile to measure impact of VMX patch. Without
> --enable-debug I can't get call graphs so can't get useful profile. I think
> I've looked at this before as well but can't remember now which check
> enabled by --enable-debug is responsible for constant tb cache flush and if
> that could be avoided. I just don't use --enable-debug since unless need to
> debug somthing.
> 
> Maybe the PPC softmmu should be reviewed and optimised by someone who knows
> it...

I'm not sure there is anyone who knows it at this point.  I probably
know it as well as anybody, and the ppc32 code scares me.  It's a
crufty mess and it would be nice to clean up, but that requires
someone with enough time and interest.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-11  1:20       ` David Gibson
@ 2018-12-11  3:03         ` BALATON Zoltan
  2018-12-11 19:35         ` Mark Cave-Ayland
  1 sibling, 0 replies; 33+ messages in thread
From: BALATON Zoltan @ 2018-12-11  3:03 UTC (permalink / raw)
  To: David Gibson; +Cc: Mark Cave-Ayland, qemu-ppc, richard.henderson, qemu-devel

On Tue, 11 Dec 2018, David Gibson wrote:
> On Mon, Dec 10, 2018 at 09:54:51PM +0100, BALATON Zoltan wrote:
>> Yes, I don't really know what these tests use but I think "lame" test is
>> mostly floating point but tried with "lame_vmx" which should at least use
>> some vector ops and "mplayer -benchmark" test is more vmx dependent based on
>> my previous profiling and testing with hardfloat but I'm not sure. (When
>> testing these with hardfloat I've found that lame was benefiting from
>> hardfloat but mplayer wasn't and more VMX related functions showed up with
>> mplayer so I assumed it's more VMX bound.)
>
> I should clarify here.  When I say "floating point" above, I'm not
> meaning things using the regular FPU instead of the vector unit.  I'm
> saying *anything* involving floating point calculations whether
> they're done in the FPU or the vector unit.

OK that clarifies it. I admit I was only testing these but didn't have 
time to look what changed exactly.

> The patches here don't convert all VMX instructions to use vector TCG
> ops - they only convert a few, and those few are about using the
> vector unit for integer (and logical) operations.  VMX instructions
> involving floating point calculations are unaffected and will still
> use soft-float.

What I've said above about lame test being more FPU and mplayer more VMX 
intensive probably still holds as I've retried now on a Haswell i5 and got 
1-2% difference with lame_vmx and ~6% with mplayer. That's very little 
improvement but if only some VMX instructions should be faster then this 
may make sense.

These tests are not the best, maybe there are better ways to measure this 
but I don't know of any,

>> Maybe the PPC softmmu should be reviewed and optimised by someone who knows
>> it...
>
> I'm not sure there is anyone who knows it at this point.  I probably
> know it as well as anybody, and the ppc32 code scares me.  It's a
> crufty mess and it would be nice to clean up, but that requires
> someone with enough time and interest.

At least this seems to be a big bottleneck in PPC emulation and one that's 
not being worked on (others like hardfloat and VMX while not finished and 
still lot to do but already there are some results but no one is looking 
at softmmu). I was just trying to direct some attention to that softmmu 
may also need some optimisation and hope someone would notice this. I have 
some interest but not much time these days and if it scares you what 
should I say. I don't even understand most of it so it would take a lot of
time to even get how it works and what would need to be done. So I hope 
someone with more time or knowledge shows up and maybe at least provides 
some hints on what may need to be done.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access
  2018-12-10  5:17   ` David Gibson
  2018-12-10 18:25     ` Richard Henderson
@ 2018-12-11 19:06     ` Mark Cave-Ayland
  1 sibling, 0 replies; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-11 19:06 UTC (permalink / raw)
  To: David Gibson; +Cc: richard.henderson, qemu-ppc, qemu-devel

On 10/12/2018 05:17, David Gibson wrote:

> On Fri, Dec 07, 2018 at 08:56:30AM +0000, Mark Cave-Ayland wrote:
>> These helpers allow us to move FP register values to/from the specified TCGv_i64
>> argument.
>>
>> To prevent FP helpers accessing the cpu_fpr array directly, add extra TCG
>> temporaries as required.
> 
> It's not obvious to me why that's a desirable thing.  I'm assuming
> it's somehow necessary for the stuff later in the series, but I think
> we need a brief rationale here to explain why this isn't just adding
> extra reg copies for the sake of it.

Ah okay. It's because the VSX registers extend the set of VMX and FPR registers (see
the cpu_vsrh() and cpu_vsrl() helpers at the top of
target/ppc/translate/vsx-impl.inc.c). So in order to remove the static TCGv_i64
arrays then these helpers must also be converted.


ATB,

Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-10 13:04 ` [Qemu-devel] " Aleksandar Markovic
@ 2018-12-11 19:11   ` Mark Cave-Ayland
  0 siblings, 0 replies; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-11 19:11 UTC (permalink / raw)
  To: Aleksandar Markovic; +Cc: richard.henderson, qemu-ppc, qemu-devel, david

On 10/12/2018 13:04, Aleksandar Markovic wrote:

> On Dec 7, 2018 9:59 AM, "Mark Cave-Ayland" <mark.cave-ayland@ilande.co.uk>
> wrote:
>>
>> This patchset is an attempt at trying to improve the VMX (Altivec)
> instruction
>> performance by making use of the new TCG vector operations where possible.
>>
> 
> Hello, Mark.
> 
> I just want to say that I support these efforts. Very interesting, it can
> bring significant improvements for multimedia intensive emulations. But,
> even more important, we can see new tcg vector interface in action,
> possibly suggest improvements, extensions.
> 
> I would just like to ask you to add some performance comparisons, if
> possible. Very simple tests would be sufficient.
> 
> Thanks again for this series!

Thanks Aleksander! In my local tests I haven't done much in the way of benchmarking,
other than to check the output of the disassembler to confirm that the x86 vector
opcodes are being used. Given that only a small number of TCG vector operations are
currently suitable for use with the VMX code, then I would expect certain artificial
benchmarks to show improvements e.g. with the logical ops but there was certainly
nothing noticeable in normal usage with my OpenBIOS test images.

I agree with you that having a second implementation other than ARM will help provide
ideas as to how the vector interfaces can be evolved to support more operations in
future.


ATB,

Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access
  2018-12-10 18:43   ` Richard Henderson
@ 2018-12-11 19:15     ` Mark Cave-Ayland
  0 siblings, 0 replies; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-11 19:15 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-ppc, david

On 10/12/2018 18:43, Richard Henderson wrote:

> On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
>> -    gen_helper_f##op(cpu_fpr[rD(ctx->opcode)], cpu_env,                       \
>> -                     cpu_fpr[rA(ctx->opcode)],                                \
>> -                     cpu_fpr[rC(ctx->opcode)], cpu_fpr[rB(ctx->opcode)]);     \
>> +    get_fpr(t0, rA(ctx->opcode));                                             \
>> +    get_fpr(t1, rC(ctx->opcode));                                             \
>> +    get_fpr(t2, rB(ctx->opcode));                                             \
>> +    gen_helper_f##op(t3, cpu_env, t0, t1, t2);                                \
>> +    set_fpr(rD(ctx->opcode), t3);                                             \
>>      if (isfloat) {                                                            \
>> -        gen_helper_frsp(cpu_fpr[rD(ctx->opcode)], cpu_env,                    \
>> -                        cpu_fpr[rD(ctx->opcode)]);                            \
>> +        get_fpr(t0, rD(ctx->opcode));                                         \
>> +        gen_helper_frsp(t3, cpu_env, t0);                                     \
>> +        set_fpr(rD(ctx->opcode), t3);                                         \
>>      }                                                                         \
> 
> This is an accurate conversion, but the writeback to the rD register need not
> happen until after helper_frsp.  Just move it below the isfloat block.

Okay - I'll fix this up in the next iteration.

> I do see that helper_frsp can raise an exception for invalid_op for SNaN.  If
> that code were actually reachable, this would have been an existing bug, in
> that we should not have written back to rD after the first operation.  However,
> any SNaN will already have been eliminated by the first operation (via
> squashing to QNaN or by exiting via exception).
> 
> Similarly in GEN_FLOAT_AB.

Noted.

>> +    get_fpr(t0, rB(ctx->opcode));
>> +    gen_helper_frsqrte(t1, cpu_env, t0);
>> +    set_fpr(rD(ctx->opcode), t1);
>> +    gen_helper_frsp(t1, cpu_env, t1);
>> +    gen_compute_fprf_float64(t1);
> 
> gen_frsqrtes has the set_fpr in the wrong place.  Likewise gen_fsqrts.

Ooops. I remember having a think a bit harder about these two functions, so I'll go
and take another look.


ATB,

Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/6] target/ppc: introduce get_avr64() and set_avr64() helpers for VMX register access
  2018-12-10 18:49   ` Richard Henderson
@ 2018-12-11 19:16     ` Mark Cave-Ayland
  0 siblings, 0 replies; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-11 19:16 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-ppc, david

On 10/12/2018 18:49, Richard Henderson wrote:

> On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
>> +    avr = tcg_temp_new_i64();                                                 \
>>      EA = tcg_temp_new();                                                      \
>>      gen_addr_reg_index(ctx, EA);                                              \
>>      tcg_gen_andi_tl(EA, EA, ~0xf);                                            \
>>      /* We only need to swap high and low halves. gen_qemu_ld64_i64 does       \
>>         necessary 64-bit byteswap already. */                                  \
>>      if (ctx->le_mode) {                                                       \
>> -        gen_qemu_ld64_i64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                \
>> +        gen_qemu_ld64_i64(ctx, avr, EA);                                      \
>> +        set_avr64(rD(ctx->opcode), avr, false);                               \
>>          tcg_gen_addi_tl(EA, EA, 8);                                           \
>> -        gen_qemu_ld64_i64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                \
>> +        gen_qemu_ld64_i64(ctx, avr, EA);                                      \
>> +        set_avr64(rD(ctx->opcode), avr, true);  
> 
> An accurate conversion, but I'm going to call this an existing bug:
> 
> The writeback to both avr{l,h} should be delayed until all exceptions have been
> raised.  Thus you should perform the two gen_qemu_ld64_i64 into two temporaries
> and only then write them both back via set_avr64.

Thanks for the pointer. I'll add this to the list of changes for the next revision of
the patchset.

> Otherwise,
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


ATB,

Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 4/6] target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env
  2018-12-10 19:05   ` Richard Henderson
@ 2018-12-11 19:21     ` Mark Cave-Ayland
  2018-12-11 21:24       ` Richard Henderson
  0 siblings, 1 reply; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-11 19:21 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-ppc, david

On 10/12/2018 19:05, Richard Henderson wrote:

> On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
>> Instead of accessing the FPR, VMX and VSX registers through static arrays of
>> TCGv_i64 globals, remove them and change the helpers to load/store data directly
>> within cpu_env.
>>
>> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
>> ---
>>  target/ppc/translate.c | 59 ++++++++++++++------------------------------------
>>  1 file changed, 16 insertions(+), 43 deletions(-)
> 
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
> 
> Note however, that there are other steps that you must add here before using
> vector operations in the next patch:
> 
> (1a) The fpr and vsr arrays must be merged, since fpr[n] == vsrh[n].
>      If this isn't done, then you simply cannot apply one operation
>      to two disjoint memory blocks.
> 
> (1b) The vsr and avr arrays should be merged, since vsr[32+n] == avr[n].
>      This is simply tidiness, matching the layout to the architecture.
> 
> These steps will modify gdbstub.c, machine.c, and linux-user/.

The reason I didn't touch the VSR arrays was because I was hoping that this could be
done as a follow up later; my thought was that since I'd only introduced vector
operations into the VMX instructions then currently no vector operations could be
done across the 2 separate memory blocks?

> (2) The vsr array needs to be QEMU_ALIGN(16).  See target/arm/cpu.h.
>     We assert that the host addresses are 16 byte aligned, so that we
>     can eventually use Altivec/VSX in tcg/ppc/.

That's a good observation. Presumably being on Intel the unaligned accesses would
still work but just be slower? I've certainly seen the new vector ops being emitted
in the generated code.


ATB,

Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/6] target/ppc: introduce get_cpu_vsr{l, h}() and set_cpu_vsr{l, h}() helpers for VSR register access
  2018-12-10 19:16   ` Richard Henderson
@ 2018-12-11 19:24     ` Mark Cave-Ayland
  0 siblings, 0 replies; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-11 19:24 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-ppc, david

On 10/12/2018 19:16, Richard Henderson wrote:

> On 12/7/18 2:56 AM, Mark Cave-Ayland wrote:
>> +static inline void get_vsr(TCGv_i64 dst, int n)
>> +{
>> +    tcg_gen_ld_i64(dst, cpu_env, offsetof(CPUPPCState, vsr[n]));
>> +}
>> +
>> +static inline void set_vsr(int n, TCGv_i64 src)
>> +{
>> +    tcg_gen_st_i64(src, cpu_env, offsetof(CPUPPCState, vsr[n]));
>> +}
> 
> Why isn't this helper still using cpu_vsr[n]?

Gah this is just a mistake from squashing patches from my working branch to tidy them
up for submission. I'll fix this in the next iteration.


ATB,

Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-11  1:20       ` David Gibson
  2018-12-11  3:03         ` BALATON Zoltan
@ 2018-12-11 19:35         ` Mark Cave-Ayland
  2018-12-11 21:32           ` Richard Henderson
  1 sibling, 1 reply; 33+ messages in thread
From: Mark Cave-Ayland @ 2018-12-11 19:35 UTC (permalink / raw)
  To: David Gibson, BALATON Zoltan; +Cc: qemu-ppc, richard.henderson, qemu-devel

On 11/12/2018 01:20, David Gibson wrote:

> On Mon, Dec 10, 2018 at 09:54:51PM +0100, BALATON Zoltan wrote:
>> On Mon, 10 Dec 2018, David Gibson wrote:
>>> On Mon, Dec 10, 2018 at 01:33:53AM +0100, BALATON Zoltan wrote:
>>>> On Fri, 7 Dec 2018, Mark Cave-Ayland wrote:
>>>>> This patchset is an attempt at trying to improve the VMX (Altivec) instruction
>>>>> performance by making use of the new TCG vector operations where possible.
>>>>
>>>> This is very welcome, thanks for doing this.
>>>>
>>>>> In order to use TCG vector operations, the registers must be accessible from cpu_env
>>>>> whilst currently they are accessed via arrays of static TCG globals. Patches 1-3
>>>>> are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR
>>>>> registers using the supplied TCGv_i64 parameter.
>>>>
>>>> Have you tried some benchmarks or tests to measure the impact of these
>>>> changes? I've tried the (very unscientific) benchmarks I've written about
>>>> before here:
>>>>
>>>> http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html
>>>>
>>>> (which seem to use AltiVec/VMX instructions but not sure which) on mac99
>>>> with MorphOS and I could not see any performance increase. I haven't run
>>>> enough tests but results with or without this series on master were mostly
>>>> the same within a few percents, and sometimes even seen lower performance
>>>> with these patches than without. I haven't tried to find out why (no time
>>>> for that now) so can't really draw any conclusions from this. I'm also not
>>>> sure if I've actually tested what you've changed or these use instructions
>>>> that your patches don't optimise yet, or the changes I've seen were just
>>>> normal changes between runs; but I wonder if the increased number of
>>>> temporaries could result in lower performance in some cases?
>>>
>>> What was your host machine.  IIUC this change will only improve
>>> performance if the host tcg backend is able to implement TCG vector
>>> ops in terms of vector ops on the host.
>>
>> Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64
>> should be supported but not sure what are the CPU requirements.
>>
>>> In addition, this series only converts a subset of the integer and
>>> logical vector instructions.  If your testcase is mostly floating
>>> point (vectored or otherwise), it will still be softfloat and so not
>>> see any speedup.
>>
>> Yes, I don't really know what these tests use but I think "lame" test is
>> mostly floating point but tried with "lame_vmx" which should at least use
>> some vector ops and "mplayer -benchmark" test is more vmx dependent based on
>> my previous profiling and testing with hardfloat but I'm not sure. (When
>> testing these with hardfloat I've found that lame was benefiting from
>> hardfloat but mplayer wasn't and more VMX related functions showed up with
>> mplayer so I assumed it's more VMX bound.)
> 
> I should clarify here.  When I say "floating point" above, I'm not
> meaning things using the regular FPU instead of the vector unit.  I'm
> saying *anything* involving floating point calculations whether
> they're done in the FPU or the vector unit.
> 
> The patches here don't convert all VMX instructions to use vector TCG
> ops - they only convert a few, and those few are about using the
> vector unit for integer (and logical) operations.  VMX instructions
> involving floating point calculations are unaffected and will still
> use soft-float.

Right. As I mentioned in an earlier email, this is hopefully laying the groundwork
for future evolution of the TCG vector operations and making use of the existing
functions first.

Certainly I'd be interested at looking at the hardfloat patches after this, since FP
performance is something that can offer enormous benefit to MacOS emulation.

>> I've tried to do some profiling again to find out what's used but I can't
>> get good results with the tools I have (oprofile stopped working since I've
>> updated my machine and Linux perf provides results that are hard to
>> interpret for me, haven't tried if gprof would work now it didn't before)
>> but I've seen some vector related helpers in the profile so at least some
>> vector ops are used. The "helper_vperm" came up top at about 11th (not sure
>> where is it called from), other vector helpers were lower.
>>
>> I don't remember details now but previously when testing hardfloat I've
>> written this: "I've looked at vperm which came out top in one of the
>> profiles I've taken and on little endian hosts it has the loop backwards and
>> also accesses vector elements from end to front which I wonder may be enough
>> for the compiler to not be able to optimise it? But I haven't checked
>> assembly. The altivec dependent mplayer video decoding test did not change
>> much with hardfloat, it took 98% compared to master so likely altivec is
>> dominating here." (Although this was with the PPC specific vector helpers
>> before VMX patch so not sure if this is still relevant.)
>>
>> The top 10 in profile were still related to low level memory access and MMU
>> management stuff as I've found before:
>>
>> http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03609.html
>> http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03704.html
>>
>> I think implementing i2c for mac99 may help this and some other
>> optimisations may also be possible but I don't know enough about these to
>> try that.
>>
>> It also looks like with --enable-debug something is always flusing tlb and
>> blowing away tb caches so these will be top in profile and likely dominate
>> runtime so can't really use profile to measure impact of VMX patch. Without
>> --enable-debug I can't get call graphs so can't get useful profile. I think
>> I've looked at this before as well but can't remember now which check
>> enabled by --enable-debug is responsible for constant tb cache flush and if
>> that could be avoided. I just don't use --enable-debug since unless need to
>> debug somthing.
>>
>> Maybe the PPC softmmu should be reviewed and optimised by someone who knows
>> it...
> 
> I'm not sure there is anyone who knows it at this point.  I probably
> know it as well as anybody, and the ppc32 code scares me.  It's a
> crufty mess and it would be nice to clean up, but that requires
> someone with enough time and interest.

As a newcomer to this code, it's not particularly easy to read. It strikes me that
one of the main improvements would be to switch over to using generated helper
templates rather than to make use of nested macros as in use currently.

Looking at your profiles above, the primary hotspot appears to be
helper_lookup_tb_ptr(). However as someone quite new to the TCG parts of QEMU, I
couldn't tell you whether or not this is to be expected.

Perhaps a question for Richard: what could we consider to be a "normal" backend when
looking at profiles in terms of recommended features to implement, and to get an idea
of what a typical profile should look like?

It would be interesting to compare with a similar workload profile on e.g. ARM to see
if there is anything obvious that stands out for PPC.


ATB,

Mark.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 4/6] target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env
  2018-12-11 19:21     ` Mark Cave-Ayland
@ 2018-12-11 21:24       ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2018-12-11 21:24 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel, qemu-ppc, david

On 12/11/18 1:21 PM, Mark Cave-Ayland wrote:
>> Note however, that there are other steps that you must add here before using
>> vector operations in the next patch:
>>
>> (1a) The fpr and vsr arrays must be merged, since fpr[n] == vsrh[n].
>>      If this isn't done, then you simply cannot apply one operation
>>      to two disjoint memory blocks.
>>
>> (1b) The vsr and avr arrays should be merged, since vsr[32+n] == avr[n].
>>      This is simply tidiness, matching the layout to the architecture.
>>
>> These steps will modify gdbstub.c, machine.c, and linux-user/.
> 
> The reason I didn't touch the VSR arrays was because I was hoping that this could be
> done as a follow up later; my thought was that since I'd only introduced vector
> operations into the VMX instructions then currently no vector operations could be
> done across the 2 separate memory blocks?

True, until you convert the VSX insns you can delay this.
Though honestly I would consider doing both at once.

>> (2) The vsr array needs to be QEMU_ALIGN(16).  See target/arm/cpu.h.
>>     We assert that the host addresses are 16 byte aligned, so that we
>>     can eventually use Altivec/VSX in tcg/ppc/.
> 
> That's a good observation. Presumably being on Intel the unaligned accesses would
> still work but just be slower? I've certainly seen the new vector ops being emitted
> in the generated code.

Yes, currently I generate unaligned loads.  It made sense when considering AVX2
and ARM SVE, since I do not increase the alignment requirements to 32-bytes
when using 256-bit vectors.

I do wonder if I should go back and generate aligned loads, just to raise
SIGBUS when one has forgotten the QEMU_ALIGN marker, as a portability aid.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
  2018-12-11 19:35         ` Mark Cave-Ayland
@ 2018-12-11 21:32           ` Richard Henderson
  0 siblings, 0 replies; 33+ messages in thread
From: Richard Henderson @ 2018-12-11 21:32 UTC (permalink / raw)
  To: Mark Cave-Ayland, David Gibson, BALATON Zoltan; +Cc: qemu-ppc, qemu-devel

On 12/11/18 1:35 PM, Mark Cave-Ayland wrote:
> Looking at your profiles above, the primary hotspot appears to be
> helper_lookup_tb_ptr(). However as someone quite new to the TCG parts of QEMU, I
> couldn't tell you whether or not this is to be expected.
> 
> Perhaps a question for Richard: what could we consider to be a "normal" backend when
> looking at profiles in terms of recommended features to implement, and to get an idea
> of what a typical profile should look like?
> 
> It would be interesting to compare with a similar workload profile on e.g. ARM to see
> if there is anything obvious that stands out for PPC.

For Alpha, which has relatively sane tlb management, the top entry in the
profile is helper_lookup_tb_ptr at about 8%.  Otherwise, somewhere in the top 2
or 3 functions will be helper_le_ldq_mmu at about 2%.

That's probably a best-case scenario.


r~

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2018-12-11 21:32 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access Mark Cave-Ayland
2018-12-10  5:17   ` David Gibson
2018-12-10 18:25     ` Richard Henderson
2018-12-11  0:23       ` David Gibson
2018-12-11 19:06     ` Mark Cave-Ayland
2018-12-10 18:43   ` Richard Henderson
2018-12-11 19:15     ` Mark Cave-Ayland
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 2/6] target/ppc: introduce get_avr64() and set_avr64() helpers for VMX " Mark Cave-Ayland
2018-12-10 18:49   ` Richard Henderson
2018-12-11 19:16     ` Mark Cave-Ayland
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 3/6] target/ppc: introduce get_cpu_vsr{l, h}() and set_cpu_vsr{l, h}() helpers for VSR " Mark Cave-Ayland
2018-12-10 19:16   ` Richard Henderson
2018-12-11 19:24     ` Mark Cave-Ayland
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 4/6] target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env Mark Cave-Ayland
2018-12-10 19:05   ` Richard Henderson
2018-12-11 19:21     ` Mark Cave-Ayland
2018-12-11 21:24       ` Richard Henderson
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 5/6] target/ppc: convert VMX logical instructions to use vector operations Mark Cave-Ayland
2018-12-10 19:08   ` Richard Henderson
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 6/6] target/ppc: convert vaddu[b, h, w, d] and vsubu[b, h, w, d] over " Mark Cave-Ayland
2018-12-10 19:09   ` Richard Henderson
2018-12-10  0:33 ` [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG " BALATON Zoltan
2018-12-10  2:59   ` David Gibson
2018-12-10 20:54     ` BALATON Zoltan
2018-12-10 21:09       ` Richard Henderson
2018-12-10 23:01         ` BALATON Zoltan
2018-12-11  1:20       ` David Gibson
2018-12-11  3:03         ` BALATON Zoltan
2018-12-11 19:35         ` Mark Cave-Ayland
2018-12-11 21:32           ` Richard Henderson
2018-12-10 13:04 ` [Qemu-devel] " Aleksandar Markovic
2018-12-11 19:11   ` Mark Cave-Ayland

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.