qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] VSX MMA Implementation
@ 2022-04-26 12:50 Lucas Mateus Castro(alqotel)
  2022-04-26 12:50 ` [RFC PATCH 1/7] target/ppc: Implement xxm[tf]acc and xxsetaccz Lucas Mateus Castro(alqotel)
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Lucas Mateus Castro(alqotel) @ 2022-04-26 12:50 UTC (permalink / raw)
  To: qemu-ppc
  Cc: Daniel Henrique Barboza, richard.henderson, qemu-devel,
	Greg Kurz, Lucas Mateus Castro (alqotel),
	Alex Bennée, David Gibson

From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>

This patch series is an RFC of the Matrix-Multiply Assist (MMA)
instructions implementation from the PowerISA 3.1 

These and the VDIV/VMOD implementation are the last new PowerISA 3.1
instructions left to be implemented.

Thanks
Lucas Mateus Castro (alqotel) (7):
  target/ppc: Implement xxm[tf]acc and xxsetaccz
  target/ppc: Implemented xvi*ger* instructions
  target/ppc: Implemented pmxvi*ger* instructions
  target/ppc: Implemented xvf*ger*
  target/ppc: Implemented xvf16ger*
  target/ppc: Implemented pmxvf*ger*
  target/ppc: Implemented [pm]xvbf16ger2*

 include/fpu/softfloat.h             |   9 ++
 target/ppc/cpu.h                    |  15 +++
 target/ppc/fpu_helper.c             | 130 ++++++++++++++++++
 target/ppc/helper.h                 |   7 +
 target/ppc/insn32.decode            |  49 +++++++
 target/ppc/insn64.decode            |  80 +++++++++++
 target/ppc/int_helper.c             |  85 ++++++++++++
 target/ppc/internal.h               |  28 ++++
 target/ppc/translate/vsx-impl.c.inc | 200 ++++++++++++++++++++++++++++
 9 files changed, 603 insertions(+)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 1/7] target/ppc: Implement xxm[tf]acc and xxsetaccz
  2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
@ 2022-04-26 12:50 ` Lucas Mateus Castro(alqotel)
  2022-04-26 22:59   ` Richard Henderson
  2022-04-26 12:50 ` [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions Lucas Mateus Castro(alqotel)
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Lucas Mateus Castro(alqotel) @ 2022-04-26 12:50 UTC (permalink / raw)
  To: qemu-ppc
  Cc: Daniel Henrique Barboza, richard.henderson, Greg Kurz,
	open list:All patches CC here, Lucas Mateus Castro (alqotel),
	Cédric Le Goater, David Gibson

From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>

Implement the following PowerISA v3.1 instructions:
xxmfacc: VSX Move From Accumulator
xxmtacc: VSX Move To Accumulator
xxsetaccz: VSX Set Accumulator to Zero

The PowerISA 3.1 mentions that for the current version of the
architecture, "the hardware implementation provides the effect of ACC[i]
and VSRs 4*i to 4*i + 3 logically containing the same data" and "The
Accumulators introduce no new logical state at this time" (page 501).
For now it seems unnecessary to create new structures, so this patch
just uses ACC[i] as VSRs 4*i to 4*i+3 and therefore move to and from
accumulators are no-ops.

Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
---
 target/ppc/insn32.decode            |  9 ++++++++
 target/ppc/translate/vsx-impl.c.inc | 36 +++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index 39372fe673..7a76bedfa6 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -151,6 +151,9 @@
 &X_vrt_frbp     vrt frbp
 @X_vrt_frbp     ...... vrt:5 ..... ....0 .......... .           &X_vrt_frbp frbp=%x_frbp
 
+&X_a            ra
+@X_a            ...... ra:3 .. ..... ..... .......... .         &X_a
+
 %xx_xt          0:1 21:5
 %xx_xb          1:1 11:5
 %xx_xa          2:1 16:5
@@ -710,3 +713,9 @@ XVTLSBB         111100 ... -- 00010 ..... 111011011 . - @XX2_bf_xb
 &XL_s           s:uint8_t
 @XL_s           ......-------------- s:1 .......... -   &XL_s
 RFEBB           010011-------------- .   0010010010 -   @XL_s
+
+## Accumulator Instructions
+
+XXMFACC         011111 ... -- 00000 ----- 0010110001 -   @X_a
+XXMTACC         011111 ... -- 00001 ----- 0010110001 -   @X_a
+XXSETACCZ       011111 ... -- 00011 ----- 0010110001 -   @X_a
diff --git a/target/ppc/translate/vsx-impl.c.inc b/target/ppc/translate/vsx-impl.c.inc
index 3692740736..919b889c40 100644
--- a/target/ppc/translate/vsx-impl.c.inc
+++ b/target/ppc/translate/vsx-impl.c.inc
@@ -2787,6 +2787,42 @@ static bool trans_XVCVBF16SPN(DisasContext *ctx, arg_XX2 *a)
     return true;
 }
 
+    /*
+     *  The PowerISA 3.1 mentions that for the current version of the
+     *  architecture, "the hardware implementation provides the effect of
+     *  ACC[i] and VSRs 4*i to 4*i + 3 logically containing the same data"
+     *  and "The Accumulators introduce no new logical state at this time"
+     *  (page 501). For now it seems unnecessary to create new structures,
+     *  so this patch just uses ACC[i] as VSRs 4*i to 4*i+3 and therefore
+     *  move to and from accumulators are no-ops.
+     */
+static bool trans_XXMFACC(DisasContext *ctx, arg_X_a *a)
+{
+    REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+    REQUIRE_VSX(ctx);
+    return true;
+}
+
+static bool trans_XXMTACC(DisasContext *ctx, arg_X_a *a)
+{
+    REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+    REQUIRE_VSX(ctx);
+    return true;
+}
+
+static bool trans_XXSETACCZ(DisasContext *ctx, arg_X_a *a)
+{
+    REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+    REQUIRE_VSX(ctx);
+    int i;
+    TCGv_i64 zero = tcg_constant_i64(0);
+    for (i = 0; i < 4; i++) {
+        set_cpu_vsr(a->ra * 4 + i, zero, false);
+        set_cpu_vsr(a->ra * 4 + i, zero, true);
+    }
+    return true;
+}
+
 #undef GEN_XX2FORM
 #undef GEN_XX3FORM
 #undef GEN_XX2IFORM
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions
  2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
  2022-04-26 12:50 ` [RFC PATCH 1/7] target/ppc: Implement xxm[tf]acc and xxsetaccz Lucas Mateus Castro(alqotel)
@ 2022-04-26 12:50 ` Lucas Mateus Castro(alqotel)
  2022-04-26 23:40   ` Richard Henderson
  2022-04-26 12:50 ` [RFC PATCH 3/7] target/ppc: Implemented pmxvi*ger* instructions Lucas Mateus Castro(alqotel)
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Lucas Mateus Castro(alqotel) @ 2022-04-26 12:50 UTC (permalink / raw)
  To: qemu-ppc
  Cc: Daniel Henrique Barboza, richard.henderson, Greg Kurz,
	open list:All patches CC here, Lucas Mateus Castro (alqotel),
	Cédric Le Goater, David Gibson

From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>

Implement the following PowerISA v3.1 instructions:
xvi4ger8:     VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update)
xvi4ger8pp:   VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update)
Positive multiply, Positive accumulate
xvi8ger4:     VSX Vector 4-bit Signed Integer GER (rank-8 update)
xvi8ger4pp:   VSX Vector 4-bit Signed Integer GER (rank-8 update)
Positive multiply, Positive accumulate
xvi8ger4spp:  VSX Vector 8-bit Signed/Unsigned Integer GER (rank-4 update)
with Saturate Positive multiply, Positive accumulate
xvi16ger2:    VSX Vector 16-bit Signed Integer GER (rank-2 update)
xvi16ger2pp:  VSX Vector 16-bit Signed Integer GER (rank-2 update)
Positive multiply, Positive accumulate
xvi16ger2s:   VSX Vector 16-bit Signed Integer GER (rank-2 update)
with Saturation
xvi16ger2spp: VSX Vector 16-bit Signed Integer GER (rank-2 update)
with Saturation Positive multiply, Positive accumulate

Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
---
 target/ppc/cpu.h                    |  5 ++
 target/ppc/helper.h                 |  3 +
 target/ppc/insn32.decode            | 15 +++++
 target/ppc/int_helper.c             | 85 +++++++++++++++++++++++++++++
 target/ppc/internal.h               | 28 ++++++++++
 target/ppc/translate/vsx-impl.c.inc | 50 +++++++++++++++++
 6 files changed, 186 insertions(+)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index c2b6c987c0..ee55c6cfa2 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -2688,6 +2688,11 @@ static inline uint64_t *cpu_vsrl_ptr(CPUPPCState *env, int i)
     return (uint64_t *)((uintptr_t)env + vsr64_offset(i, false));
 }
 
+static inline ppc_vsr_t *cpu_vsr_ptr(CPUPPCState *env, int i)
+{
+    return (ppc_vsr_t *)((uintptr_t)env + vsr_full_offset(i));
+}
+
 static inline long avr64_offset(int i, bool high)
 {
     return vsr64_offset(i + 32, high);
diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index aa6773c4a5..06553517de 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -537,6 +537,9 @@ DEF_HELPER_5(XXBLENDVB, void, vsr, vsr, vsr, vsr, i32)
 DEF_HELPER_5(XXBLENDVH, void, vsr, vsr, vsr, vsr, i32)
 DEF_HELPER_5(XXBLENDVW, void, vsr, vsr, vsr, vsr, i32)
 DEF_HELPER_5(XXBLENDVD, void, vsr, vsr, vsr, vsr, i32)
+DEF_HELPER_6(XVI4GER8, void, env, i32, i32, i32, i32, i32)
+DEF_HELPER_6(XVI8GER4, void, env, i32, i32, i32, i32, i32)
+DEF_HELPER_6(XVI16GER2, void, env, i32, i32, i32, i32, i32)
 
 DEF_HELPER_2(efscfsi, i32, env, i32)
 DEF_HELPER_2(efscfui, i32, env, i32)
diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index 7a76bedfa6..653f50db93 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -170,6 +170,9 @@
 &XX3            xt xa xb
 @XX3            ...... ..... ..... ..... ........ ...           &XX3 xt=%xx_xt xa=%xx_xa xb=%xx_xb
 
+%xx_at          23:3 !function=times_4
+@XX3_at         ...... ... .. ..... ..... ........ ...          &XX3 xt=%xx_at xb=%xx_xb
+
 &XX3_dm         xt xa xb dm
 @XX3_dm         ...... ..... ..... ..... . dm:2 ..... ...       &XX3_dm xt=%xx_xt xa=%xx_xa xb=%xx_xb
 
@@ -719,3 +722,15 @@ RFEBB           010011-------------- .   0010010010 -   @XL_s
 XXMFACC         011111 ... -- 00000 ----- 0010110001 -   @X_a
 XXMTACC         011111 ... -- 00001 ----- 0010110001 -   @X_a
 XXSETACCZ       011111 ... -- 00011 ----- 0010110001 -   @X_a
+
+## Vector GER instruction
+
+XVI4GER8        111011 ... -- ..... ..... 00100011 ..-  @XX3_at xa=%xx_xa
+XVI4GER8PP      111011 ... -- ..... ..... 00100010 ..-  @XX3_at xa=%xx_xa
+XVI8GER4        111011 ... -- ..... ..... 00000011 ..-  @XX3_at xa=%xx_xa
+XVI8GER4PP      111011 ... -- ..... ..... 00000010 ..-  @XX3_at xa=%xx_xa
+XVI16GER2       111011 ... -- ..... ..... 01001011 ..-  @XX3_at xa=%xx_xa
+XVI16GER2PP     111011 ... -- ..... ..... 01101011 ..-  @XX3_at xa=%xx_xa
+XVI8GER4SPP     111011 ... -- ..... ..... 01100011 ..-  @XX3_at xa=%xx_xa
+XVI16GER2S      111011 ... -- ..... ..... 00101011 ..-  @XX3_at xa=%xx_xa
+XVI16GER2SPP    111011 ... -- ..... ..... 00101010 ..-  @XX3_at xa=%xx_xa
diff --git a/target/ppc/int_helper.c b/target/ppc/int_helper.c
index 8c1674510b..bd2f1a7c2a 100644
--- a/target/ppc/int_helper.c
+++ b/target/ppc/int_helper.c
@@ -782,6 +782,91 @@ VCT(uxs, cvtsduw, u32)
 VCT(sxs, cvtsdsw, s32)
 #undef VCT
 
+/*
+ * Packed VSX Integer GER Flags
+ * 00 - no accumulation no saturation
+ * 01 - accumulate but no saturation
+ * 10 - no accumulation but with saturation
+ * 11 - accumulate with saturation
+ */
+static inline bool get_sat(uint32_t flags)
+{
+    return flags & 0x2;
+}
+
+static inline bool get_acc(uint32_t flags)
+{
+    return flags & 0x1;
+}
+
+#define GET_VsrN(a, i) (extract32(a->VsrB((i) / 2), (i) % 2 ? 4 : 0, 4))
+#define GET_VsrB(a, i) a->VsrB(i)
+#define GET_VsrH(a, i) a->VsrH(i)
+
+#define GET_VsrSN(a, i) (sextract32(a->VsrSB((i) / 2), (i) % 2 ? 4 : 0, 4))
+#define GET_VsrSB(a, i) a->VsrSB(i)
+#define GET_VsrSH(a, i) a->VsrSH(i)
+
+#define XVIGER(NAME, RANK, EL)                                                 \
+    void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r,                    \
+              uint32_t  at_r, uint32_t mask, uint32_t packed_flags)            \
+    {                                                                          \
+        ppc_vsr_t *a = cpu_vsr_ptr(env, a_r), *b = cpu_vsr_ptr(env, b_r), *at; \
+        bool sat = get_sat(packed_flags), acc = get_acc(packed_flags);         \
+        uint8_t pmsk = ger_get_pmsk(mask), xmsk = ger_get_xmsk(mask),          \
+                ymsk = ger_get_ymsk(mask);                                     \
+        uint8_t pmsk_bit, xmsk_bit, ymsk_bit;                                  \
+        int64_t psum;                                                          \
+        int32_t va, vb;                                                        \
+        int i, j, k;                                                           \
+        for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) {           \
+            at = cpu_vsr_ptr(env, at_r + i);                                   \
+            for (j = 0, ymsk_bit = 1 << 3; j < 4; j++, ymsk_bit >>= 1) {       \
+                if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) {                  \
+                    psum = 0;                                                  \
+                    for (k = 0, pmsk_bit = 1 << (RANK - 1); k < RANK;          \
+                         k++, pmsk_bit >>= 1) {                                \
+                        if (pmsk_bit & pmsk) {                                 \
+                            va = (int32_t)GET_VsrS##EL(a, RANK * i + k);       \
+                            vb = (int32_t) ((RANK == 4) ?                      \
+                                                GET_Vsr##EL(b, RANK * j + k) : \
+                                                GET_VsrS##EL(b, RANK * j + k));\
+                            psum += va * vb;                                   \
+                        }                                                      \
+                    }                                                          \
+                    if (acc) {                                                 \
+                        psum += at->VsrSW(j);                                  \
+                    }                                                          \
+                    if (sat && psum > INT32_MAX) {                             \
+                        set_vscr_sat(env);                                     \
+                        at->VsrSW(j) = INT32_MAX;                              \
+                    } else if (sat && psum < INT32_MIN) {                      \
+                        set_vscr_sat(env);                                     \
+                        at->VsrSW(j) = INT32_MIN;                              \
+                    } else {                                                   \
+                        at->VsrSW(j) = (int32_t) psum;                         \
+                    }                                                          \
+                } else {                                                       \
+                    at->VsrSW(j) = 0;                                          \
+                }                                                              \
+            }                                                                  \
+        }                                                                      \
+    }
+
+XVIGER(helper_XVI4GER8, 8, N)
+XVIGER(helper_XVI8GER4, 4, B)
+XVIGER(helper_XVI16GER2, 2, H)
+
+#undef GER_MULT
+#undef XVIGER_NAME
+#undef XVIGER
+#undef GET_VsrN
+#undef GET_VsrB
+#undef GET_VsrH
+#undef GET_VsrSN
+#undef GET_VsrSB
+#undef GET_VsrSH
+
 target_ulong helper_vclzlsbb(ppc_avr_t *r)
 {
     target_ulong count = 0;
diff --git a/target/ppc/internal.h b/target/ppc/internal.h
index 8094e0b033..a994d98238 100644
--- a/target/ppc/internal.h
+++ b/target/ppc/internal.h
@@ -291,4 +291,32 @@ G_NORETURN void ppc_cpu_do_unaligned_access(CPUState *cs, vaddr addr,
                                             uintptr_t retaddr);
 #endif
 
+/*
+ * Auxiliary functions to pack/unpack masks for GER instructions.
+ *
+ * Packed format:
+ *  Bits 0-3: xmsk
+ *  Bits 4-7: ymsk
+ *  Bits 8-15: pmsk
+ */
+static inline uint8_t ger_get_xmsk(uint32_t packed_masks)
+{
+    return packed_masks & 0xF;
+}
+
+static inline uint8_t ger_get_ymsk(uint32_t packed_masks)
+{
+    return (packed_masks >> 4) & 0xF;
+}
+
+static inline uint8_t ger_get_pmsk(uint32_t packed_masks)
+{
+    return (packed_masks >> 8) & 0xFF;
+}
+
+static inline int ger_pack_masks(int pmsk, int ymsk, int xmsk)
+{
+    return (pmsk & 0xFF) << 8 | (ymsk & 0xF) << 4 | (xmsk & 0xF);
+}
+
 #endif /* PPC_INTERNAL_H */
diff --git a/target/ppc/translate/vsx-impl.c.inc b/target/ppc/translate/vsx-impl.c.inc
index 919b889c40..1eb68c7081 100644
--- a/target/ppc/translate/vsx-impl.c.inc
+++ b/target/ppc/translate/vsx-impl.c.inc
@@ -2823,6 +2823,56 @@ static bool trans_XXSETACCZ(DisasContext *ctx, arg_X_a *a)
     return true;
 }
 
+/*
+ * Packed VSX Integer GER Flags
+ * 00 - no accumulation no saturation
+ * 01 - accumulate but no saturation
+ * 10 - no accumulation but with saturation
+ * 11 - accumulate with saturation
+ */
+static uint32_t pack_flags_xvi(int acc, int sat)
+{
+    return (sat << 1) | acc;
+}
+
+static bool do_ger_XX3(DisasContext *ctx, arg_XX3 *a, uint32_t op,
+                             void (*helper)(TCGv_env, TCGv_i32, TCGv_i32,
+                                            TCGv_i32, TCGv_i32, TCGv_i32))
+{
+    uint32_t mask;
+    REQUIRE_INSNS_FLAGS2(ctx, ISA310);
+    REQUIRE_VSX(ctx);
+    if (unlikely((a->xa / 4 == a->xt / 4) || (a->xb / 4 == a->xt / 4))) {
+        gen_invalid(ctx);
+        return true;
+    }
+
+    mask = 0xFFFFFFFF;
+    helper(cpu_env, tcg_constant_i32(a->xa), tcg_constant_i32(a->xb),
+           tcg_constant_i32(a->xt), tcg_constant_i32(mask),
+           tcg_constant_i32(op));
+    return true;
+}
+
+/* Used to keep line length < 80 */
+#define GER_NOP pack_flags_xvi(0, 0)
+#define GER_PP  pack_flags_xvi(1, 0)
+#define GER_SAT pack_flags_xvi(0, 1)
+#define GER_SPP pack_flags_xvi(1, 1)
+TRANS(XVI4GER8, do_ger_XX3, GER_NOP, gen_helper_XVI4GER8)
+TRANS(XVI4GER8PP, do_ger_XX3, GER_PP, gen_helper_XVI4GER8)
+TRANS(XVI8GER4, do_ger_XX3, GER_NOP, gen_helper_XVI8GER4)
+TRANS(XVI8GER4PP, do_ger_XX3, GER_PP, gen_helper_XVI8GER4)
+TRANS(XVI8GER4SPP, do_ger_XX3, GER_SPP, gen_helper_XVI8GER4)
+TRANS(XVI16GER2, do_ger_XX3, GER_NOP, gen_helper_XVI16GER2)
+TRANS(XVI16GER2PP, do_ger_XX3, GER_PP, gen_helper_XVI16GER2)
+TRANS(XVI16GER2S, do_ger_XX3, GER_SAT, gen_helper_XVI16GER2)
+TRANS(XVI16GER2SPP, do_ger_XX3, GER_SPP, gen_helper_XVI16GER2)
+#undef GER_NOP
+#undef GER_PP
+#undef GER_SAT
+#undef GER_SPP
+
 #undef GEN_XX2FORM
 #undef GEN_XX3FORM
 #undef GEN_XX2IFORM
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 3/7] target/ppc: Implemented pmxvi*ger* instructions
  2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
  2022-04-26 12:50 ` [RFC PATCH 1/7] target/ppc: Implement xxm[tf]acc and xxsetaccz Lucas Mateus Castro(alqotel)
  2022-04-26 12:50 ` [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions Lucas Mateus Castro(alqotel)
@ 2022-04-26 12:50 ` Lucas Mateus Castro(alqotel)
  2022-04-26 12:50 ` [RFC PATCH 4/7] target/ppc: Implemented xvf*ger* Lucas Mateus Castro(alqotel)
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Lucas Mateus Castro(alqotel) @ 2022-04-26 12:50 UTC (permalink / raw)
  To: qemu-ppc
  Cc: Daniel Henrique Barboza, richard.henderson, Greg Kurz,
	open list:All patches CC here, Lucas Mateus Castro (alqotel),
	Cédric Le Goater, David Gibson

From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>

Implement the following PowerISA v3.1 instructions:
pmxvi4ger8:     Prefixed Masked VSX Vector 8-bit Signed/Unsigned Integer
GER (rank-4 update)
pmxvi4ger8pp:   Prefixed Masked VSX Vector 8-bit Signed/Unsigned Integer
GER (rank-4 update) Positive multiply, Positive accumulate
pmxvi8ger4:     Prefixed Masked VSX Vector 4-bit Signed Integer GER
(rank-8 update)
pmxvi8ger4pp:   Prefixed Masked VSX Vector 4-bit Signed Integer GER
(rank-8 update) Positive multiply, Positive accumulate
pmxvi8ger4spp:  Prefixed Masked VSX Vector 8-bit Signed/Unsigned Integer
GER (rank-4 update) with Saturate Positive multiply, Positive accumulate
pmxvi16ger2:    Prefixed Masked VSX Vector 16-bit Signed Integer GER
(rank-2 update)
pmxvi16ger2pp:  Prefixed Masked VSX Vector 16-bit Signed Integer GER
(rank-2 update) Positive multiply, Positive accumulate
pmxvi16ger2s:   Prefixed Masked VSX Vector 16-bit Signed Integer GER
(rank-2 update) with Saturation
pmxvi16ger2spp: Prefixed Masked VSX Vector 16-bit Signed Integer GER
(rank-2 update) with Saturation Positive multiply, Positive accumulate

Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
---
 target/ppc/insn64.decode            | 30 +++++++++++++++++++++++++++++
 target/ppc/translate/vsx-impl.c.inc | 28 +++++++++++++++++++++++++--
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/target/ppc/insn64.decode b/target/ppc/insn64.decode
index 691e8fe6c0..18915f1977 100644
--- a/target/ppc/insn64.decode
+++ b/target/ppc/insn64.decode
@@ -68,6 +68,15 @@
                 ...... ..... ..... ..... ..... .. ....   \
                 &8RR_XX4_uim3 xt=%8rr_xx_xt xa=%8rr_xx_xa xb=%8rr_xx_xb xc=%8rr_xx_xc
 
+# Format MMIRR:XX3
+&MMIRR_XX3      xa xb xt pmsk xmsk ymsk
+%xx3_xa         2:1 16:5
+%xx3_xb         1:1 11:5
+%xx3_at         23:3 !function=times_4
+@MMIRR_XX3      ...... .. .... .. . . ........ xmsk:4 ymsk:4  \
+                ...... ... .. ..... ..... ........ ...  \
+                &MMIRR_XX3 xa=%xx3_xa xb=%xx3_xb xt=%xx3_at
+
 ### Fixed-Point Load Instructions
 
 PLBZ            000001 10 0--.-- .................. \
@@ -115,6 +124,27 @@ PSTFS           000001 10 0--.-- .................. \
 PSTFD           000001 10 0--.-- .................. \
                 110110 ..... ..... ................     @PLS_D
 
+## Vector GER instruction
+
+PMXVI4GER8      000001 11 1001 -- - - pmsk:8 ........              \
+                111011 ... -- ..... ..... 00100011 ..-  @MMIRR_XX3
+PMXVI4GER8PP    000001 11 1001 -- - - pmsk:8 ........              \
+                111011 ... -- ..... ..... 00100010 ..-  @MMIRR_XX3
+PMXVI8GER4      000001 11 1001 -- - - pmsk:4 ---- ........         \
+                111011 ... -- ..... ..... 00000011 ..-  @MMIRR_XX3
+PMXVI8GER4PP    000001 11 1001 -- - - pmsk:4 ---- ........         \
+                111011 ... -- ..... ..... 00000010 ..-  @MMIRR_XX3
+PMXVI16GER2     000001 11 1001 -- - - pmsk:2 ------ ........       \
+                111011 ... -- ..... ..... 01001011 ..-  @MMIRR_XX3
+PMXVI16GER2PP   000001 11 1001 -- - - pmsk:2 ------ ........       \
+                111011 ... -- ..... ..... 01101011 ..-  @MMIRR_XX3
+PMXVI8GER4SPP   000001 11 1001 -- - - pmsk:4 ---- ........         \
+                111011 ... -- ..... ..... 01100011 ..-  @MMIRR_XX3
+PMXVI16GER2S    000001 11 1001 -- - - pmsk:2 ------ ........       \
+                111011 ... -- ..... ..... 00101011 ..-  @MMIRR_XX3
+PMXVI16GER2SPP  000001 11 1001 -- - - pmsk:2 ------ ........       \
+                111011 ... -- ..... ..... 00101010 ..-  @MMIRR_XX3
+
 ### Prefixed No-operation Instruction
 
 @PNOP           000001 11 0000-- 000000000000000000     \
diff --git a/target/ppc/translate/vsx-impl.c.inc b/target/ppc/translate/vsx-impl.c.inc
index 1eb68c7081..eb7b8cb0c6 100644
--- a/target/ppc/translate/vsx-impl.c.inc
+++ b/target/ppc/translate/vsx-impl.c.inc
@@ -2835,7 +2835,7 @@ static uint32_t pack_flags_xvi(int acc, int sat)
     return (sat << 1) | acc;
 }
 
-static bool do_ger_XX3(DisasContext *ctx, arg_XX3 *a, uint32_t op,
+static bool do_ger_MMIRR_XX3(DisasContext *ctx, arg_MMIRR_XX3 *a, uint32_t op,
                              void (*helper)(TCGv_env, TCGv_i32, TCGv_i32,
                                             TCGv_i32, TCGv_i32, TCGv_i32))
 {
@@ -2847,11 +2847,25 @@ static bool do_ger_XX3(DisasContext *ctx, arg_XX3 *a, uint32_t op,
         return true;
     }
 
-    mask = 0xFFFFFFFF;
+    mask = ger_pack_masks(a->pmsk, a->ymsk, a->xmsk);
     helper(cpu_env, tcg_constant_i32(a->xa), tcg_constant_i32(a->xb),
            tcg_constant_i32(a->xt), tcg_constant_i32(mask),
            tcg_constant_i32(op));
     return true;
+
+}
+static bool do_ger_XX3(DisasContext *ctx, arg_XX3 *a, uint32_t op_flags,
+                       void (*helper)(TCGv_env, TCGv_i32, TCGv_i32,
+                                      TCGv_i32, TCGv_i32, TCGv_i32))
+{
+    arg_MMIRR_XX3 m;
+    m.xa = a->xa;
+    m.xb = a->xb;
+    m.xt = a->xt;
+    m.pmsk = 0xFF;
+    m.ymsk = 0xF;
+    m.xmsk = 0xF;
+    return do_ger_MMIRR_XX3(ctx, &m, op_flags, helper);
 }
 
 /* Used to keep line length < 80 */
@@ -2868,6 +2882,16 @@ TRANS(XVI16GER2, do_ger_XX3, GER_NOP, gen_helper_XVI16GER2)
 TRANS(XVI16GER2PP, do_ger_XX3, GER_PP, gen_helper_XVI16GER2)
 TRANS(XVI16GER2S, do_ger_XX3, GER_SAT, gen_helper_XVI16GER2)
 TRANS(XVI16GER2SPP, do_ger_XX3, GER_SPP, gen_helper_XVI16GER2)
+
+TRANS64(PMXVI4GER8, do_ger_MMIRR_XX3, GER_NOP, gen_helper_XVI4GER8)
+TRANS64(PMXVI4GER8PP, do_ger_MMIRR_XX3, GER_PP, gen_helper_XVI4GER8)
+TRANS64(PMXVI8GER4, do_ger_MMIRR_XX3, GER_NOP, gen_helper_XVI8GER4)
+TRANS64(PMXVI8GER4PP, do_ger_MMIRR_XX3, GER_PP, gen_helper_XVI8GER4)
+TRANS64(PMXVI8GER4SPP, do_ger_MMIRR_XX3, GER_SPP, gen_helper_XVI8GER4)
+TRANS64(PMXVI16GER2, do_ger_MMIRR_XX3, GER_NOP, gen_helper_XVI16GER2)
+TRANS64(PMXVI16GER2PP, do_ger_MMIRR_XX3, GER_PP, gen_helper_XVI16GER2)
+TRANS64(PMXVI16GER2S, do_ger_MMIRR_XX3, GER_SAT, gen_helper_XVI16GER2)
+TRANS64(PMXVI16GER2SPP, do_ger_MMIRR_XX3, GER_SPP, gen_helper_XVI16GER2)
 #undef GER_NOP
 #undef GER_PP
 #undef GER_SAT
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 4/7] target/ppc: Implemented xvf*ger*
  2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
                   ` (2 preceding siblings ...)
  2022-04-26 12:50 ` [RFC PATCH 3/7] target/ppc: Implemented pmxvi*ger* instructions Lucas Mateus Castro(alqotel)
@ 2022-04-26 12:50 ` Lucas Mateus Castro(alqotel)
  2022-04-27  0:09   ` Richard Henderson
  2022-04-26 12:50 ` [RFC PATCH 5/7] target/ppc: Implemented xvf16ger* Lucas Mateus Castro(alqotel)
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Lucas Mateus Castro(alqotel) @ 2022-04-26 12:50 UTC (permalink / raw)
  To: qemu-ppc
  Cc: Daniel Henrique Barboza, richard.henderson, Greg Kurz,
	open list:All patches CC here, Lucas Mateus Castro (alqotel),
	Cédric Le Goater, David Gibson

From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>

Implement the following PowerISA v3.1 instructions:
xvf32ger:   VSX Vector 32-bit Floating-Point GER (rank-1 update)
xvf32gernn: VSX Vector 32-bit Floating-Point GER (rank-1 update) Negative
multiply, Negative accumulate
xvf32gernp: VSX Vector 32-bit Floating-Point GER (rank-1 update) Negative
multiply, Positive accumulate
xvf32gerpn: VSX Vector 32-bit Floating-Point GER (rank-1 update) Positive
multiply, Negative accumulate
xvf32gerpp: VSX Vector 32-bit Floating-Point GER (rank-1 update) Positive
multiply, Positive accumulate
xvf64ger:   VSX Vector 64-bit Floating-Point GER (rank-1 update)
xvf64gernn: VSX Vector 64-bit Floating-Point GER (rank-1 update) Negative
multiply, Negative accumulate
xvf64gernp: VSX Vector 64-bit Floating-Point GER (rank-1 update) Negative
multiply, Positive accumulate
xvf64gerpn: VSX Vector 64-bit Floating-Point GER (rank-1 update) Positive
multiply, Negative accumulate
xvf64gerpp: VSX Vector 64-bit Floating-Point GER (rank-1 update) Positive
multiply, Positive accumulate

Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
---
 target/ppc/cpu.h                    |  4 ++
 target/ppc/fpu_helper.c             | 64 +++++++++++++++++++++++++++++
 target/ppc/helper.h                 |  2 +
 target/ppc/insn32.decode            | 13 ++++++
 target/ppc/translate/vsx-impl.c.inc | 39 ++++++++++++++++++
 5 files changed, 122 insertions(+)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index ee55c6cfa2..b5d7b35dda 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -2652,6 +2652,8 @@ static inline bool lsw_reg_in_range(int start, int nregs, int rx)
 #define VsrSW(i) s32[i]
 #define VsrD(i) u64[i]
 #define VsrSD(i) s64[i]
+#define VsrSF(i) f32[i]
+#define VsrDF(i) f64[i]
 #else
 #define VsrB(i) u8[15 - (i)]
 #define VsrSB(i) s8[15 - (i)]
@@ -2661,6 +2663,8 @@ static inline bool lsw_reg_in_range(int start, int nregs, int rx)
 #define VsrSW(i) s32[3 - (i)]
 #define VsrD(i) u64[1 - (i)]
 #define VsrSD(i) s64[1 - (i)]
+#define VsrSF(i) f32[3 - (i)]
+#define VsrDF(i) f64[1 - (i)]
 #endif
 
 static inline int vsr64_offset(int i, bool high)
diff --git a/target/ppc/fpu_helper.c b/target/ppc/fpu_helper.c
index 99281cc37a..6b03666d09 100644
--- a/target/ppc/fpu_helper.c
+++ b/target/ppc/fpu_helper.c
@@ -3462,3 +3462,67 @@ void helper_xssubqp(CPUPPCState *env, uint32_t opcode,
     *xt = t;
     do_float_check_status(env, GETPC());
 }
+
+static inline bool ger_acc_flag(uint32_t flag)
+{
+    return flag & 0x1;
+}
+
+static inline bool ger_neg_mul_flag(uint32_t flag)
+{
+    return flag & 0x2;
+}
+
+static inline bool ger_neg_acc_flag(uint32_t flag)
+{
+    return flag & 0x4;
+}
+
+#define VSXGER(NAME, TYPE, EL)                                          \
+    void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r,             \
+              uint32_t  at_r, uint32_t mask, uint32_t packed_flags)     \
+    {                                                                   \
+        ppc_vsr_t *a, *b, *at;                                          \
+        TYPE aux_acc, va, vb;                                           \
+        int i, j, xmsk_bit, ymsk_bit, op_flags;                         \
+        uint8_t xmsk = mask & 0x0F;                                     \
+        uint8_t ymsk = (mask >> 4) & 0x0F;                              \
+        int ymax = MIN(4, 128 / (sizeof(TYPE) * 8));                    \
+        b = cpu_vsr_ptr(env, b_r);                                      \
+        float_status *excp_ptr = &env->fp_status;                       \
+        bool acc = ger_acc_flag(packed_flags);                          \
+        bool neg_acc = ger_neg_acc_flag(packed_flags);                  \
+        bool neg_mul = ger_neg_mul_flag(packed_flags);                  \
+        helper_reset_fpstatus(env);                                     \
+        for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) {    \
+            a = cpu_vsr_ptr(env, a_r + i / ymax);                       \
+            at = cpu_vsr_ptr(env, at_r + i);                            \
+            for (j = 0, ymsk_bit = 1 << (ymax - 1); j < ymax;           \
+                 j++, ymsk_bit >>= 1) {                                 \
+                if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) {           \
+                    op_flags = (neg_acc ^ neg_mul) ?                    \
+                                          float_muladd_negate_c : 0;    \
+                    op_flags |= (neg_mul) ?                             \
+                                     float_muladd_negate_result : 0;    \
+                    va = a->Vsr##EL(i % ymax);                          \
+                    vb = b->Vsr##EL(j);                                 \
+                    aux_acc = at->Vsr##EL(j);                           \
+                    if (acc) {                                          \
+                        at->Vsr##EL(j) = TYPE##_muladd(va, vb, aux_acc, \
+                                                       op_flags,        \
+                                                       excp_ptr);       \
+                    } else {                                            \
+                        at->Vsr##EL(j) = TYPE##_mul(va, vb, excp_ptr);  \
+                    }                                                   \
+                } else {                                                \
+                    at->Vsr##EL(j) = 0;                                 \
+                }                                                       \
+            }                                                           \
+        }                                                               \
+        do_float_check_status(env, GETPC());                            \
+    }
+
+VSXGER(helper_XVF32GER, float32, SF)
+VSXGER(helper_XVF64GER, float64, DF)
+
+#undef VSXGER
diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 06553517de..7d725292b1 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -540,6 +540,8 @@ DEF_HELPER_5(XXBLENDVD, void, vsr, vsr, vsr, vsr, i32)
 DEF_HELPER_6(XVI4GER8, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVI8GER4, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVI16GER2, void, env, i32, i32, i32, i32, i32)
+DEF_HELPER_6(XVF32GER, void, env, i32, i32, i32, i32, i32)
+DEF_HELPER_6(XVF64GER, void, env, i32, i32, i32, i32, i32)
 
 DEF_HELPER_2(efscfsi, i32, env, i32)
 DEF_HELPER_2(efscfui, i32, env, i32)
diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index 653f50db93..9652ca286c 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -171,6 +171,7 @@
 @XX3            ...... ..... ..... ..... ........ ...           &XX3 xt=%xx_xt xa=%xx_xa xb=%xx_xb
 
 %xx_at          23:3 !function=times_4
+%xx_xa_pair     2:1 17:4 !function=times_2
 @XX3_at         ...... ... .. ..... ..... ........ ...          &XX3 xt=%xx_at xb=%xx_xb
 
 &XX3_dm         xt xa xb dm
@@ -734,3 +735,15 @@ XVI16GER2PP     111011 ... -- ..... ..... 01101011 ..-  @XX3_at xa=%xx_xa
 XVI8GER4SPP     111011 ... -- ..... ..... 01100011 ..-  @XX3_at xa=%xx_xa
 XVI16GER2S      111011 ... -- ..... ..... 00101011 ..-  @XX3_at xa=%xx_xa
 XVI16GER2SPP    111011 ... -- ..... ..... 00101010 ..-  @XX3_at xa=%xx_xa
+
+XVF32GER        111011 ... -- ..... ..... 00011011 ..-  @XX3_at xa=%xx_xa
+XVF32GERPP      111011 ... -- ..... ..... 00011010 ..-  @XX3_at xa=%xx_xa
+XVF32GERPN      111011 ... -- ..... ..... 10011010 ..-  @XX3_at xa=%xx_xa
+XVF32GERNP      111011 ... -- ..... ..... 01011010 ..-  @XX3_at xa=%xx_xa
+XVF32GERNN      111011 ... -- ..... ..... 11011010 ..-  @XX3_at xa=%xx_xa
+
+XVF64GER        111011 ... -- .... 0 ..... 00111011 ..-  @XX3_at xa=%xx_xa_pair
+XVF64GERPP      111011 ... -- .... 0 ..... 00111010 ..-  @XX3_at xa=%xx_xa_pair
+XVF64GERPN      111011 ... -- .... 0 ..... 10111010 ..-  @XX3_at xa=%xx_xa_pair
+XVF64GERNP      111011 ... -- .... 0 ..... 01111010 ..-  @XX3_at xa=%xx_xa_pair
+XVF64GERNN      111011 ... -- .... 0 ..... 11111010 ..-  @XX3_at xa=%xx_xa_pair
diff --git a/target/ppc/translate/vsx-impl.c.inc b/target/ppc/translate/vsx-impl.c.inc
index eb7b8cb0c6..b1fb0f31f3 100644
--- a/target/ppc/translate/vsx-impl.c.inc
+++ b/target/ppc/translate/vsx-impl.c.inc
@@ -2835,6 +2835,19 @@ static uint32_t pack_flags_xvi(int acc, int sat)
     return (sat << 1) | acc;
 }
 
+/*
+ * Packed VSX Floating Point GER Flags
+ * 000 - no accumulation no saturation
+ * 001 - positive accumulate, positive multiply
+ * 011 - positive accumulate, negative multiply
+ * 101 - negative accumulate, positive multiply
+ * 111 - negative accumulate, negative multiply
+ */
+static inline uint32_t ger_pack_flags_xvf(bool acc, bool nm, bool na)
+{
+    return (acc ? 0x1 : 0) | (nm ? 0x2 : 0) | (na ? 0x4 : 0);
+}
+
 static bool do_ger_MMIRR_XX3(DisasContext *ctx, arg_MMIRR_XX3 *a, uint32_t op,
                              void (*helper)(TCGv_env, TCGv_i32, TCGv_i32,
                                             TCGv_i32, TCGv_i32, TCGv_i32))
@@ -2897,6 +2910,32 @@ TRANS64(PMXVI16GER2SPP, do_ger_MMIRR_XX3, GER_SPP, gen_helper_XVI16GER2)
 #undef GER_SAT
 #undef GER_SPP
 
+/* To keep line size < 80 */
+#define GER_NOP ger_pack_flags_xvf(false, false, false)
+#define GER_PP ger_pack_flags_xvf(true, false, false)
+#define GER_NP ger_pack_flags_xvf(true, true, false)
+#define GER_PN ger_pack_flags_xvf(true, false, true)
+#define GER_NN ger_pack_flags_xvf(true, true, true)
+
+TRANS(XVF32GER, do_ger_XX3, GER_NOP, gen_helper_XVF32GER)
+TRANS(XVF32GERPP, do_ger_XX3, GER_PP, gen_helper_XVF32GER)
+TRANS(XVF32GERPN, do_ger_XX3, GER_PN, gen_helper_XVF32GER)
+TRANS(XVF32GERNP, do_ger_XX3, GER_NP, gen_helper_XVF32GER)
+TRANS(XVF32GERNN, do_ger_XX3, GER_NN, gen_helper_XVF32GER)
+
+TRANS(XVF64GER, do_ger_XX3, GER_NOP, gen_helper_XVF64GER)
+TRANS(XVF64GERPP, do_ger_XX3, GER_PP, gen_helper_XVF64GER)
+TRANS(XVF64GERPN, do_ger_XX3, GER_PN, gen_helper_XVF64GER)
+TRANS(XVF64GERNP, do_ger_XX3, GER_NP, gen_helper_XVF64GER)
+TRANS(XVF64GERNN, do_ger_XX3, GER_NN, gen_helper_XVF64GER)
+
+
+#undef GER_NOP
+#undef GER_PP
+#undef GER_NP
+#undef GER_PN
+#undef GER_NN
+
 #undef GEN_XX2FORM
 #undef GEN_XX3FORM
 #undef GEN_XX2IFORM
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 5/7] target/ppc: Implemented xvf16ger*
  2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
                   ` (3 preceding siblings ...)
  2022-04-26 12:50 ` [RFC PATCH 4/7] target/ppc: Implemented xvf*ger* Lucas Mateus Castro(alqotel)
@ 2022-04-26 12:50 ` Lucas Mateus Castro(alqotel)
  2022-04-27  0:26   ` Richard Henderson
  2022-04-26 12:50 ` [RFC PATCH 6/7] target/ppc: Implemented pmxvf*ger* Lucas Mateus Castro(alqotel)
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Lucas Mateus Castro(alqotel) @ 2022-04-26 12:50 UTC (permalink / raw)
  To: qemu-ppc
  Cc: Peter Maydell, Daniel Henrique Barboza, richard.henderson,
	Greg Kurz, open list:All patches CC here,
	Lucas Mateus Castro (alqotel),
	Cédric Le Goater, Alex Bennée, Aurelien Jarno,
	David Gibson

From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>

Implement the following PowerISA v3.1 instructions:
xvf16ger2:   VSX Vector 16-bit Floating-Point GER (rank-2 update)
xvf16ger2nn: VSX Vector 16-bit Floating-Point GER (rank-2 update) Negative
multiply, Negative accumulate
xvf16ger2np: VSX Vector 16-bit Floating-Point GER (rank-2 update) Negative
multiply, Positive accumulate
xvf16ger2pn: VSX Vector 16-bit Floating-Point GER (rank-2 update) Positive
multiply, Negative accumulate
xvf16ger2pp: VSX Vector 16-bit Floating-Point GER (rank-2 update) Positive
multiply, Positive accumulate

Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
---
 include/fpu/softfloat.h             |  9 ++++
 target/ppc/cpu.h                    |  3 ++
 target/ppc/fpu_helper.c             | 65 +++++++++++++++++++++++++++++
 target/ppc/helper.h                 |  1 +
 target/ppc/insn32.decode            |  6 +++
 target/ppc/translate/vsx-impl.c.inc |  6 +++
 6 files changed, 90 insertions(+)

diff --git a/include/fpu/softfloat.h b/include/fpu/softfloat.h
index 3dcf20e3a2..63d7ff18f0 100644
--- a/include/fpu/softfloat.h
+++ b/include/fpu/softfloat.h
@@ -619,6 +619,15 @@ static inline float32 float32_chs(float32 a)
     return make_float32(float32_val(a) ^ 0x80000000);
 }
 
+static inline float32 float32_neg(float32 a)
+{
+    if (((a & 0x7f800000) == 0x7f800000) && (a & 0x007fffff)) {
+        return a;
+    } else {
+        return float32_chs(a);
+    }
+}
+
 static inline bool float32_is_infinity(float32 a)
 {
     return (float32_val(a) & 0x7fffffff) == 0x7f800000;
diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index b5d7b35dda..91167f8cc0 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -225,6 +225,7 @@ typedef union _ppc_vsr_t {
     int16_t s16[8];
     int32_t s32[4];
     int64_t s64[2];
+    float16 f16[8];
     float32 f32[4];
     float64 f64[2];
     float128 f128;
@@ -2652,6 +2653,7 @@ static inline bool lsw_reg_in_range(int start, int nregs, int rx)
 #define VsrSW(i) s32[i]
 #define VsrD(i) u64[i]
 #define VsrSD(i) s64[i]
+#define VsrHF(i) f16[i]
 #define VsrSF(i) f32[i]
 #define VsrDF(i) f64[i]
 #else
@@ -2663,6 +2665,7 @@ static inline bool lsw_reg_in_range(int start, int nregs, int rx)
 #define VsrSW(i) s32[3 - (i)]
 #define VsrD(i) u64[1 - (i)]
 #define VsrSD(i) s64[1 - (i)]
+#define VsrHF(i) f16[7 - (i)]
 #define VsrSF(i) f32[3 - (i)]
 #define VsrDF(i) f64[1 - (i)]
 #endif
diff --git a/target/ppc/fpu_helper.c b/target/ppc/fpu_helper.c
index 6b03666d09..c3aead642a 100644
--- a/target/ppc/fpu_helper.c
+++ b/target/ppc/fpu_helper.c
@@ -3478,6 +3478,67 @@ static inline bool ger_neg_acc_flag(uint32_t flag)
     return flag & 0x4;
 }
 
+#define float16_to_float32(A, PTR) float16_to_float32(A, true, PTR)
+
+#define GET_VSR(VSR, A, I, SRC_T, TARGET_T)                             \
+    SRC_T##_to_##TARGET_T(A->VSR(I), excp_ptr)
+
+#define VSXGER16(NAME, ORIG_T, OR_EL)                                   \
+    void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r,             \
+              uint32_t  at_r, uint32_t mask, uint32_t packed_flags)     \
+    {                                                                   \
+        ppc_vsr_t *at;                                                  \
+        float32 psum, aux_acc, va, vb, vc, vd;                          \
+        int i, j, xmsk_bit, ymsk_bit;                                   \
+        uint8_t xmsk = mask & 0x0F;                                     \
+        uint8_t ymsk = (mask >> 4) & 0x0F;                              \
+        uint8_t pmsk = (mask >> 8) & 0x3;                               \
+        ppc_vsr_t *b = cpu_vsr_ptr(env, b_r);                           \
+        ppc_vsr_t *a = cpu_vsr_ptr(env, a_r);                           \
+        float_status *excp_ptr = &env->fp_status;                       \
+        bool acc = ger_acc_flag(packed_flags);                          \
+        bool neg_acc = ger_neg_acc_flag(packed_flags);                  \
+        bool neg_mul = ger_neg_mul_flag(packed_flags);                  \
+        for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) {    \
+            at = cpu_vsr_ptr(env, at_r + i);                            \
+            for (j = 0, ymsk_bit = 1 << 3; j < 4; j++, ymsk_bit >>= 1) {\
+                if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) {           \
+                    va = !(pmsk & 2) ? float32_zero :                   \
+                                       GET_VSR(Vsr##OR_EL, a,           \
+                                               2 * i, ORIG_T, float32); \
+                    vb = !(pmsk & 2) ? float32_zero :                   \
+                                       GET_VSR(Vsr##OR_EL, b,           \
+                                               2 * j, ORIG_T, float32); \
+                    vc = !(pmsk & 1) ? float32_zero :                   \
+                                       GET_VSR(Vsr##OR_EL, a,           \
+                                            2 * i + 1, ORIG_T, float32);\
+                    vd = !(pmsk & 1) ? float32_zero :                   \
+                                       GET_VSR(Vsr##OR_EL, b,           \
+                                            2 * j + 1, ORIG_T, float32);\
+                    psum = float32_mul(va, vb, excp_ptr);               \
+                    psum = float32_muladd(vc, vd, psum, 0, excp_ptr);   \
+                    if (acc) {                                          \
+                        if (neg_mul) {                                  \
+                            psum = float32_neg(psum);                   \
+                        }                                               \
+                        if (neg_acc) {                                  \
+                            aux_acc = float32_neg(at->VsrSF(j));        \
+                        } else {                                        \
+                            aux_acc = at->VsrSF(j);                     \
+                        }                                               \
+                        at->VsrSF(j) = float32_add(psum, aux_acc,       \
+                                                   excp_ptr);           \
+                    } else {                                            \
+                        at->VsrSF(j) = psum;                            \
+                    }                                                   \
+                } else {                                                \
+                    at->VsrSF(j) = 0;                                   \
+                }                                                       \
+            }                                                           \
+        }                                                               \
+        do_float_check_status(env, GETPC());                            \
+    }
+
 #define VSXGER(NAME, TYPE, EL)                                          \
     void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r,             \
               uint32_t  at_r, uint32_t mask, uint32_t packed_flags)     \
@@ -3522,7 +3583,11 @@ static inline bool ger_neg_acc_flag(uint32_t flag)
         do_float_check_status(env, GETPC());                            \
     }
 
+VSXGER16(helper_XVF16GER2, float16, HF)
 VSXGER(helper_XVF32GER, float32, SF)
 VSXGER(helper_XVF64GER, float64, DF)
 
+#undef VSXGER16
 #undef VSXGER
+#undef GET_VSR
+#undef float16_to_float32
diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 7d725292b1..cc59a3b71d 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -540,6 +540,7 @@ DEF_HELPER_5(XXBLENDVD, void, vsr, vsr, vsr, vsr, i32)
 DEF_HELPER_6(XVI4GER8, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVI8GER4, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVI16GER2, void, env, i32, i32, i32, i32, i32)
+DEF_HELPER_6(XVF16GER2, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVF32GER, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVF64GER, void, env, i32, i32, i32, i32, i32)
 
diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index 9652ca286c..a204730d1d 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -736,6 +736,12 @@ XVI8GER4SPP     111011 ... -- ..... ..... 01100011 ..-  @XX3_at xa=%xx_xa
 XVI16GER2S      111011 ... -- ..... ..... 00101011 ..-  @XX3_at xa=%xx_xa
 XVI16GER2SPP    111011 ... -- ..... ..... 00101010 ..-  @XX3_at xa=%xx_xa
 
+XVF16GER2       111011 ... -- ..... ..... 00010011 ..-  @XX3_at xa=%xx_xa
+XVF16GER2PP     111011 ... -- ..... ..... 00010010 ..-  @XX3_at xa=%xx_xa
+XVF16GER2PN     111011 ... -- ..... ..... 10010010 ..-  @XX3_at xa=%xx_xa
+XVF16GER2NP     111011 ... -- ..... ..... 01010010 ..-  @XX3_at xa=%xx_xa
+XVF16GER2NN     111011 ... -- ..... ..... 11010010 ..-  @XX3_at xa=%xx_xa
+
 XVF32GER        111011 ... -- ..... ..... 00011011 ..-  @XX3_at xa=%xx_xa
 XVF32GERPP      111011 ... -- ..... ..... 00011010 ..-  @XX3_at xa=%xx_xa
 XVF32GERPN      111011 ... -- ..... ..... 10011010 ..-  @XX3_at xa=%xx_xa
diff --git a/target/ppc/translate/vsx-impl.c.inc b/target/ppc/translate/vsx-impl.c.inc
index b1fb0f31f3..9285e27159 100644
--- a/target/ppc/translate/vsx-impl.c.inc
+++ b/target/ppc/translate/vsx-impl.c.inc
@@ -2917,6 +2917,12 @@ TRANS64(PMXVI16GER2SPP, do_ger_MMIRR_XX3, GER_SPP, gen_helper_XVI16GER2)
 #define GER_PN ger_pack_flags_xvf(true, false, true)
 #define GER_NN ger_pack_flags_xvf(true, true, true)
 
+TRANS(XVF16GER2, do_ger_XX3, GER_NOP, gen_helper_XVF16GER2)
+TRANS(XVF16GER2PP, do_ger_XX3, GER_PP, gen_helper_XVF16GER2)
+TRANS(XVF16GER2PN, do_ger_XX3, GER_PN, gen_helper_XVF16GER2)
+TRANS(XVF16GER2NP, do_ger_XX3, GER_NP, gen_helper_XVF16GER2)
+TRANS(XVF16GER2NN, do_ger_XX3, GER_NN, gen_helper_XVF16GER2)
+
 TRANS(XVF32GER, do_ger_XX3, GER_NOP, gen_helper_XVF32GER)
 TRANS(XVF32GERPP, do_ger_XX3, GER_PP, gen_helper_XVF32GER)
 TRANS(XVF32GERPN, do_ger_XX3, GER_PN, gen_helper_XVF32GER)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 6/7] target/ppc: Implemented pmxvf*ger*
  2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
                   ` (4 preceding siblings ...)
  2022-04-26 12:50 ` [RFC PATCH 5/7] target/ppc: Implemented xvf16ger* Lucas Mateus Castro(alqotel)
@ 2022-04-26 12:50 ` Lucas Mateus Castro(alqotel)
  2022-04-27  0:33   ` Richard Henderson
  2022-04-26 12:50 ` [RFC PATCH 7/7] target/ppc: Implemented [pm]xvbf16ger2* Lucas Mateus Castro(alqotel)
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Lucas Mateus Castro(alqotel) @ 2022-04-26 12:50 UTC (permalink / raw)
  To: qemu-ppc
  Cc: Daniel Henrique Barboza, richard.henderson, Greg Kurz,
	open list:All patches CC here, Lucas Mateus Castro (alqotel),
	Cédric Le Goater, David Gibson

From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>

Implement the following PowerISA v3.1 instructions:
pmxvf16ger2:   Prefixed Masked VSX Vector 16-bit Floating-Point GER
(rank-2 update)
pmxvf16ger2nn: Prefixed Masked VSX Vector 16-bit Floating-Point GER
(rank-2 update) Negative multiply, Negative accumulate
pmxvf16ger2np: Prefixed Masked VSX Vector 16-bit Floating-Point GER
(rank-2 update) Negative multiply, Positive accumulate
pmxvf16ger2pn: Prefixed Masked VSX Vector 16-bit Floating-Point GER
(rank-2 update) Positive multiply, Negative accumulate
pmxvf16ger2pp: Prefixed Masked VSX Vector 16-bit Floating-Point GER
(rank-2 update) Positive multiply, Positive accumulate
pmxvf32ger:    Prefixed Masked VSX Vector 32-bit Floating-Point GER
(rank-1 update)
pmxvf32gernn:  Prefixed Masked VSX Vector 32-bit Floating-Point GER
(rank-1 update) Negative multiply, Negative accumulate
pmxvf32gernp:  Prefixed Masked VSX Vector 32-bit Floating-Point GER
(rank-1 update) Negative multiply, Positive accumulate
pmxvf32gerpn:  Prefixed Masked VSX Vector 32-bit Floating-Point GER
(rank-1 update) Positive multiply, Negative accumulate
pmxvf32gerpp:  Prefixed Masked VSX Vector 32-bit Floating-Point GER
(rank-1 update) Positive multiply, Positive accumulate
pmxvf64ger:    Prefixed Masked VSX Vector 64-bit Floating-Point GER
(rank-1 update)
pmxvf64gernn:  Prefixed Masked VSX Vector 64-bit Floating-Point GER
(rank-1 update) Negative multiply, Negative accumulate
pmxvf64gernp:  Prefixed Masked VSX Vector 64-bit Floating-Point GER
(rank-1 update) Negative multiply, Positive accumulate
pmxvf64gerpn:  Prefixed Masked VSX Vector 64-bit Floating-Point GER
(rank-1 update) Positive multiply, Negative accumulate
pmxvf64gerpp:  Prefixed Masked VSX Vector 64-bit Floating-Point GER
(rank-1 update) Positive multiply, Positive accumulate

Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
---
 target/ppc/insn64.decode            | 39 +++++++++++++++++++++++++++++
 target/ppc/translate/vsx-impl.c.inc | 33 ++++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/target/ppc/insn64.decode b/target/ppc/insn64.decode
index 18915f1977..bc5e4dfe1a 100644
--- a/target/ppc/insn64.decode
+++ b/target/ppc/insn64.decode
@@ -73,10 +73,16 @@
 %xx3_xa         2:1 16:5
 %xx3_xb         1:1 11:5
 %xx3_at         23:3 !function=times_4
+%xx3_xa_pair    2:1 17:4 !function=times_2
 @MMIRR_XX3      ...... .. .... .. . . ........ xmsk:4 ymsk:4  \
                 ...... ... .. ..... ..... ........ ...  \
                 &MMIRR_XX3 xa=%xx3_xa xb=%xx3_xb xt=%xx3_at
 
+&MMIRR_XX3_NO_P xa xb xt xmsk ymsk
+@MMIRR_XX3_NO_P ...... .. .... .. . . ........ xmsk:4 .... \
+                ...... ... .. ..... ..... ........ ... \
+                &MMIRR_XX3_NO_P xb=%xx3_xb xt=%xx3_at
+
 ### Fixed-Point Load Instructions
 
 PLBZ            000001 10 0--.-- .................. \
@@ -145,6 +151,39 @@ PMXVI16GER2S    000001 11 1001 -- - - pmsk:2 ------ ........       \
 PMXVI16GER2SPP  000001 11 1001 -- - - pmsk:2 ------ ........       \
                 111011 ... -- ..... ..... 00101010 ..-  @MMIRR_XX3
 
+PMXVF16GER2     000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 00010011 ..-  @MMIRR_XX3
+PMXVF16GER2PP   000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 00010010 ..-  @MMIRR_XX3
+PMXVF16GER2PN   000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 10010010 ..-  @MMIRR_XX3
+PMXVF16GER2NP   000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 01010010 ..-  @MMIRR_XX3
+PMXVF16GER2NN   000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 11010010 ..-  @MMIRR_XX3
+
+PMXVF32GER      000001 11 1001 -- - - -------- .... ymsk:4 \
+                111011 ... -- ..... ..... 00011011 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa
+PMXVF32GERPP    000001 11 1001 -- - - -------- .... ymsk:4 \
+                111011 ... -- ..... ..... 00011010 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa
+PMXVF32GERPN    000001 11 1001 -- - - -------- .... ymsk:4 \
+                111011 ... -- ..... ..... 10011010 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa
+PMXVF32GERNP    000001 11 1001 -- - - -------- .... ymsk:4 \
+                111011 ... -- ..... ..... 01011010 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa
+PMXVF32GERNN    000001 11 1001 -- - - -------- .... ymsk:4 \
+                111011 ... -- ..... ..... 11011010 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa
+
+PMXVF64GER      000001 11 1001 -- - - -------- .... ymsk:2 -- \
+                111011 ... -- ....0 ..... 00111011 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa_pair
+PMXVF64GERPP    000001 11 1001 -- - - -------- .... ymsk:2 -- \
+                111011 ... -- ....0 ..... 00111010 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa_pair
+PMXVF64GERPN    000001 11 1001 -- - - -------- .... ymsk:2 -- \
+                111011 ... -- ....0 ..... 10111010 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa_pair
+PMXVF64GERNP    000001 11 1001 -- - - -------- .... ymsk:2 -- \
+                111011 ... -- ....0 ..... 01111010 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa_pair
+PMXVF64GERNN    000001 11 1001 -- - - -------- .... ymsk:2 -- \
+                111011 ... -- ....0 ..... 11111010 ..-  @MMIRR_XX3_NO_P xa=%xx3_xa_pair
+
 ### Prefixed No-operation Instruction
 
 @PNOP           000001 11 0000-- 000000000000000000     \
diff --git a/target/ppc/translate/vsx-impl.c.inc b/target/ppc/translate/vsx-impl.c.inc
index 9285e27159..06f5c1220d 100644
--- a/target/ppc/translate/vsx-impl.c.inc
+++ b/target/ppc/translate/vsx-impl.c.inc
@@ -2867,6 +2867,22 @@ static bool do_ger_MMIRR_XX3(DisasContext *ctx, arg_MMIRR_XX3 *a, uint32_t op,
     return true;
 
 }
+
+static bool do_ger_MMIRR_XX3_NO_PMSK(DisasContext *ctx, arg_MMIRR_XX3_NO_P *a,
+                                     int op_flag, void (*helper)(TCGv_env,
+                                     TCGv_i32, TCGv_i32, TCGv_i32,
+                                     TCGv_i32, TCGv_i32))
+{
+    arg_MMIRR_XX3 x;
+    x.xa = a->xa;
+    x.xb = a->xb;
+    x.xt = a->xt;
+    x.pmsk = 0x1;
+    x.ymsk = a->ymsk;
+    x.xmsk = a->xmsk;
+    return do_ger_MMIRR_XX3(ctx, &x, op_flag, helper);
+}
+
 static bool do_ger_XX3(DisasContext *ctx, arg_XX3 *a, uint32_t op_flags,
                        void (*helper)(TCGv_env, TCGv_i32, TCGv_i32,
                                       TCGv_i32, TCGv_i32, TCGv_i32))
@@ -2935,6 +2951,23 @@ TRANS(XVF64GERPN, do_ger_XX3, GER_PN, gen_helper_XVF64GER)
 TRANS(XVF64GERNP, do_ger_XX3, GER_NP, gen_helper_XVF64GER)
 TRANS(XVF64GERNN, do_ger_XX3, GER_NN, gen_helper_XVF64GER)
 
+TRANS64(PMXVF16GER2, do_ger_MMIRR_XX3, GER_NOP, gen_helper_XVF16GER2)
+TRANS64(PMXVF16GER2PP, do_ger_MMIRR_XX3, GER_PP, gen_helper_XVF16GER2)
+TRANS64(PMXVF16GER2PN, do_ger_MMIRR_XX3, GER_PN, gen_helper_XVF16GER2)
+TRANS64(PMXVF16GER2NP, do_ger_MMIRR_XX3, GER_NP, gen_helper_XVF16GER2)
+TRANS64(PMXVF16GER2NN, do_ger_MMIRR_XX3, GER_NN, gen_helper_XVF16GER2)
+
+TRANS64(PMXVF32GER, do_ger_MMIRR_XX3_NO_PMSK, GER_NOP, gen_helper_XVF32GER)
+TRANS64(PMXVF32GERPP, do_ger_MMIRR_XX3_NO_PMSK, GER_PP, gen_helper_XVF32GER)
+TRANS64(PMXVF32GERPN, do_ger_MMIRR_XX3_NO_PMSK, GER_PN, gen_helper_XVF32GER)
+TRANS64(PMXVF32GERNP, do_ger_MMIRR_XX3_NO_PMSK, GER_NP, gen_helper_XVF32GER)
+TRANS64(PMXVF32GERNN, do_ger_MMIRR_XX3_NO_PMSK, GER_NN, gen_helper_XVF32GER)
+
+TRANS64(PMXVF64GER, do_ger_MMIRR_XX3_NO_PMSK, GER_NOP, gen_helper_XVF64GER)
+TRANS64(PMXVF64GERPP, do_ger_MMIRR_XX3_NO_PMSK, GER_PP, gen_helper_XVF64GER)
+TRANS64(PMXVF64GERPN, do_ger_MMIRR_XX3_NO_PMSK, GER_PN, gen_helper_XVF64GER)
+TRANS64(PMXVF64GERNP, do_ger_MMIRR_XX3_NO_PMSK, GER_NP, gen_helper_XVF64GER)
+TRANS64(PMXVF64GERNN, do_ger_MMIRR_XX3_NO_PMSK, GER_NN, gen_helper_XVF64GER)
 
 #undef GER_NOP
 #undef GER_PP
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 7/7] target/ppc: Implemented [pm]xvbf16ger2*
  2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
                   ` (5 preceding siblings ...)
  2022-04-26 12:50 ` [RFC PATCH 6/7] target/ppc: Implemented pmxvf*ger* Lucas Mateus Castro(alqotel)
@ 2022-04-26 12:50 ` Lucas Mateus Castro(alqotel)
  2022-04-27  6:21 ` [RFC PATCH 0/7] VSX MMA Implementation Joel Stanley
  2022-04-28 14:05 ` Lucas Mateus Martins Araujo e Castro
  8 siblings, 0 replies; 21+ messages in thread
From: Lucas Mateus Castro(alqotel) @ 2022-04-26 12:50 UTC (permalink / raw)
  To: qemu-ppc
  Cc: Daniel Henrique Barboza, richard.henderson, Greg Kurz,
	open list:All patches CC here, Lucas Mateus Castro (alqotel),
	Cédric Le Goater, David Gibson

From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>

Implement the following PowerISA v3.1 instructions:
xvbf16ger2:   VSX Vector bfloat16 GER (rank-2 update)
xvbf16ger2nn: VSX Vector bfloat16 GER (rank-2 update) Negative multiply,
Negative accumulate
xvbf16ger2np: VSX Vector bfloat16 GER (rank-2 update) Negative multiply,
Positive accumulate
xvbf16ger2pn: VSX Vector bfloat16 GER (rank-2 update) Positive multiply,
Negative accumulate
xvbf16ger2pp: VSX Vector bfloat16 GER (rank-2 update) Positive multiply,
Positive accumulate
pmxvbf16ger2:   Prefixed Masked VSX Vector bfloat16 GER (rank-2 update)
pmxvbf16ger2nn: Prefixed Masked VSX Vector bfloat16 GER (rank-2 update)
Negative multiply, Negative accumulate
pmxvbf16ger2np: Prefixed Masked VSX Vector bfloat16 GER (rank-2 update)
Negative multiply, Positive accumulate
pmxvbf16ger2pn: Prefixed Masked VSX Vector bfloat16 GER (rank-2 update)
Positive multiply, Negative accumulate
pmxvbf16ger2pp: Prefixed Masked VSX Vector bfloat16 GER (rank-2 update)
Positive multiply, Positive accumulate

Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
---
There's a discrepancy between this implementation and mambo/the
hardware where implementing it with float32_mul then float32_muladd
results in incorrect signal in some 0 or infinite results, but implementing
with a multiplication then muladd using FloatParts64 results in a different
result in operations where an underflow would've ocurred in the first
multiplication if it was rounded to 32 bits. I've not been able to solve this
---
 target/ppc/cpu.h                    |  3 +++
 target/ppc/fpu_helper.c             |  1 +
 target/ppc/helper.h                 |  1 +
 target/ppc/insn32.decode            |  6 ++++++
 target/ppc/insn64.decode            | 11 +++++++++++
 target/ppc/translate/vsx-impl.c.inc | 12 ++++++++++++
 6 files changed, 34 insertions(+)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 91167f8cc0..10780adf65 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -225,6 +225,7 @@ typedef union _ppc_vsr_t {
     int16_t s16[8];
     int32_t s32[4];
     int64_t s64[2];
+    bfloat16 bf16[8];
     float16 f16[8];
     float32 f32[4];
     float64 f64[2];
@@ -2653,6 +2654,7 @@ static inline bool lsw_reg_in_range(int start, int nregs, int rx)
 #define VsrSW(i) s32[i]
 #define VsrD(i) u64[i]
 #define VsrSD(i) s64[i]
+#define VsrBF(i) bf16[i]
 #define VsrHF(i) f16[i]
 #define VsrSF(i) f32[i]
 #define VsrDF(i) f64[i]
@@ -2665,6 +2667,7 @@ static inline bool lsw_reg_in_range(int start, int nregs, int rx)
 #define VsrSW(i) s32[3 - (i)]
 #define VsrD(i) u64[1 - (i)]
 #define VsrSD(i) s64[1 - (i)]
+#define VsrBF(i) bf16[7 - (i)]
 #define VsrHF(i) f16[7 - (i)]
 #define VsrSF(i) f32[3 - (i)]
 #define VsrDF(i) f64[1 - (i)]
diff --git a/target/ppc/fpu_helper.c b/target/ppc/fpu_helper.c
index c3aead642a..9acba0f804 100644
--- a/target/ppc/fpu_helper.c
+++ b/target/ppc/fpu_helper.c
@@ -3583,6 +3583,7 @@ static inline bool ger_neg_acc_flag(uint32_t flag)
         do_float_check_status(env, GETPC());                            \
     }
 
+VSXGER16(helper_XVBF16GER2, bfloat16, BF)
 VSXGER16(helper_XVF16GER2, float16, HF)
 VSXGER(helper_XVF32GER, float32, SF)
 VSXGER(helper_XVF64GER, float64, DF)
diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index cc59a3b71d..68748ecc03 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -540,6 +540,7 @@ DEF_HELPER_5(XXBLENDVD, void, vsr, vsr, vsr, vsr, i32)
 DEF_HELPER_6(XVI4GER8, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVI8GER4, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVI16GER2, void, env, i32, i32, i32, i32, i32)
+DEF_HELPER_6(XVBF16GER2, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVF16GER2, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVF32GER, void, env, i32, i32, i32, i32, i32)
 DEF_HELPER_6(XVF64GER, void, env, i32, i32, i32, i32, i32)
diff --git a/target/ppc/insn32.decode b/target/ppc/insn32.decode
index a204730d1d..fff6e406f0 100644
--- a/target/ppc/insn32.decode
+++ b/target/ppc/insn32.decode
@@ -736,6 +736,12 @@ XVI8GER4SPP     111011 ... -- ..... ..... 01100011 ..-  @XX3_at xa=%xx_xa
 XVI16GER2S      111011 ... -- ..... ..... 00101011 ..-  @XX3_at xa=%xx_xa
 XVI16GER2SPP    111011 ... -- ..... ..... 00101010 ..-  @XX3_at xa=%xx_xa
 
+XVBF16GER2      111011 ... -- ..... ..... 00110011 ..-  @XX3_at xa=%xx_xa
+XVBF16GER2PP    111011 ... -- ..... ..... 00110010 ..-  @XX3_at xa=%xx_xa
+XVBF16GER2PN    111011 ... -- ..... ..... 10110010 ..-  @XX3_at xa=%xx_xa
+XVBF16GER2NP    111011 ... -- ..... ..... 01110010 ..-  @XX3_at xa=%xx_xa
+XVBF16GER2NN    111011 ... -- ..... ..... 11110010 ..-  @XX3_at xa=%xx_xa
+
 XVF16GER2       111011 ... -- ..... ..... 00010011 ..-  @XX3_at xa=%xx_xa
 XVF16GER2PP     111011 ... -- ..... ..... 00010010 ..-  @XX3_at xa=%xx_xa
 XVF16GER2PN     111011 ... -- ..... ..... 10010010 ..-  @XX3_at xa=%xx_xa
diff --git a/target/ppc/insn64.decode b/target/ppc/insn64.decode
index bc5e4dfe1a..4cd6219ad5 100644
--- a/target/ppc/insn64.decode
+++ b/target/ppc/insn64.decode
@@ -151,6 +151,17 @@ PMXVI16GER2S    000001 11 1001 -- - - pmsk:2 ------ ........       \
 PMXVI16GER2SPP  000001 11 1001 -- - - pmsk:2 ------ ........       \
                 111011 ... -- ..... ..... 00101010 ..-  @MMIRR_XX3
 
+PMXVBF16GER2    000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 00110011 ..-  @MMIRR_XX3
+PMXVBF16GER2PP  000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 00110010 ..-  @MMIRR_XX3
+PMXVBF16GER2PN  000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 10110010 ..-  @MMIRR_XX3
+PMXVBF16GER2NP  000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 01110010 ..-  @MMIRR_XX3
+PMXVBF16GER2NN  000001 11 1001 -- - - pmsk:2 ------ ........ \
+                111011 ... -- ..... ..... 11110010 ..-  @MMIRR_XX3
+
 PMXVF16GER2     000001 11 1001 -- - - pmsk:2 ------ ........ \
                 111011 ... -- ..... ..... 00010011 ..-  @MMIRR_XX3
 PMXVF16GER2PP   000001 11 1001 -- - - pmsk:2 ------ ........ \
diff --git a/target/ppc/translate/vsx-impl.c.inc b/target/ppc/translate/vsx-impl.c.inc
index 06f5c1220d..bb5e6f0693 100644
--- a/target/ppc/translate/vsx-impl.c.inc
+++ b/target/ppc/translate/vsx-impl.c.inc
@@ -2933,6 +2933,12 @@ TRANS64(PMXVI16GER2SPP, do_ger_MMIRR_XX3, GER_SPP, gen_helper_XVI16GER2)
 #define GER_PN ger_pack_flags_xvf(true, false, true)
 #define GER_NN ger_pack_flags_xvf(true, true, true)
 
+TRANS(XVBF16GER2, do_ger_XX3, GER_NOP, gen_helper_XVBF16GER2)
+TRANS(XVBF16GER2PP, do_ger_XX3, GER_PP, gen_helper_XVBF16GER2)
+TRANS(XVBF16GER2PN, do_ger_XX3, GER_PN, gen_helper_XVBF16GER2)
+TRANS(XVBF16GER2NP, do_ger_XX3, GER_NP, gen_helper_XVBF16GER2)
+TRANS(XVBF16GER2NN, do_ger_XX3, GER_NN, gen_helper_XVBF16GER2)
+
 TRANS(XVF16GER2, do_ger_XX3, GER_NOP, gen_helper_XVF16GER2)
 TRANS(XVF16GER2PP, do_ger_XX3, GER_PP, gen_helper_XVF16GER2)
 TRANS(XVF16GER2PN, do_ger_XX3, GER_PN, gen_helper_XVF16GER2)
@@ -2957,6 +2963,12 @@ TRANS64(PMXVF16GER2PN, do_ger_MMIRR_XX3, GER_PN, gen_helper_XVF16GER2)
 TRANS64(PMXVF16GER2NP, do_ger_MMIRR_XX3, GER_NP, gen_helper_XVF16GER2)
 TRANS64(PMXVF16GER2NN, do_ger_MMIRR_XX3, GER_NN, gen_helper_XVF16GER2)
 
+TRANS64(PMXVBF16GER2, do_ger_MMIRR_XX3, GER_NOP, gen_helper_XVBF16GER2)
+TRANS64(PMXVBF16GER2PP, do_ger_MMIRR_XX3, GER_PP, gen_helper_XVBF16GER2)
+TRANS64(PMXVBF16GER2PN, do_ger_MMIRR_XX3, GER_PN, gen_helper_XVBF16GER2)
+TRANS64(PMXVBF16GER2NP, do_ger_MMIRR_XX3, GER_NP, gen_helper_XVBF16GER2)
+TRANS64(PMXVBF16GER2NN, do_ger_MMIRR_XX3, GER_NN, gen_helper_XVBF16GER2)
+
 TRANS64(PMXVF32GER, do_ger_MMIRR_XX3_NO_PMSK, GER_NOP, gen_helper_XVF32GER)
 TRANS64(PMXVF32GERPP, do_ger_MMIRR_XX3_NO_PMSK, GER_PP, gen_helper_XVF32GER)
 TRANS64(PMXVF32GERPN, do_ger_MMIRR_XX3_NO_PMSK, GER_PN, gen_helper_XVF32GER)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/7] target/ppc: Implement xxm[tf]acc and xxsetaccz
  2022-04-26 12:50 ` [RFC PATCH 1/7] target/ppc: Implement xxm[tf]acc and xxsetaccz Lucas Mateus Castro(alqotel)
@ 2022-04-26 22:59   ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2022-04-26 22:59 UTC (permalink / raw)
  To: Lucas Mateus Castro(alqotel), qemu-ppc
  Cc: open list:All patches CC here, Greg Kurz,
	Daniel Henrique Barboza, Cédric Le Goater, David Gibson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
> From: "Lucas Mateus Castro (alqotel)"<lucas.araujo@eldorado.org.br>
> 
> Implement the following PowerISA v3.1 instructions:
> xxmfacc: VSX Move From Accumulator
> xxmtacc: VSX Move To Accumulator
> xxsetaccz: VSX Set Accumulator to Zero
> 
> The PowerISA 3.1 mentions that for the current version of the
> architecture, "the hardware implementation provides the effect of ACC[i]
> and VSRs 4*i to 4*i + 3 logically containing the same data" and "The
> Accumulators introduce no new logical state at this time" (page 501).
> For now it seems unnecessary to create new structures, so this patch
> just uses ACC[i] as VSRs 4*i to 4*i+3 and therefore move to and from
> accumulators are no-ops.
> 
> Signed-off-by: Lucas Mateus Castro (alqotel)<lucas.araujo@eldorado.org.br>
> ---
>   target/ppc/insn32.decode            |  9 ++++++++
>   target/ppc/translate/vsx-impl.c.inc | 36 +++++++++++++++++++++++++++++
>   2 files changed, 45 insertions(+)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


> +    TCGv_i64 zero = tcg_constant_i64(0);
> +    for (i = 0; i < 4; i++) {
> +        set_cpu_vsr(a->ra * 4 + i, zero, false);
> +        set_cpu_vsr(a->ra * 4 + i, zero, true);
> +    }

or

   tcg_gen_gvec_dup_imm(MO_64, acc_full_offset(a->ra), 64, 64, 0);


r~


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions
  2022-04-26 12:50 ` [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions Lucas Mateus Castro(alqotel)
@ 2022-04-26 23:40   ` Richard Henderson
  2022-04-27 20:24     ` Lucas Mateus Martins Araujo e Castro
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2022-04-26 23:40 UTC (permalink / raw)
  To: Lucas Mateus Castro(alqotel), qemu-ppc
  Cc: open list:All patches CC here, Greg Kurz,
	Daniel Henrique Barboza, Cédric Le Goater, David Gibson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
> +%xx_at          23:3 !function=times_4
> +@XX3_at         ...... ... .. ..... ..... ........ ...          &XX3 xt=%xx_at xb=%xx_xb

Hmm.  Depends, I suppose on whether you want acc[0-7] or vsr[0-28]

> +/*
> + * Packed VSX Integer GER Flags
> + * 00 - no accumulation no saturation
> + * 01 - accumulate but no saturation
> + * 10 - no accumulation but with saturation
> + * 11 - accumulate with saturation
> + */
> +static inline bool get_sat(uint32_t flags)
> +{
> +    return flags & 0x2;
> +}
> +
> +static inline bool get_acc(uint32_t flags)
> +{
> +    return flags & 0x1;
> +}

Better to have separate helpers for these?  They'd be immediate operands to the function 
replacing XVIGER (see below) and thus optimize well.

> +#define GET_VsrN(a, i) (extract32(a->VsrB((i) / 2), (i) % 2 ? 4 : 0, 4))
> +#define GET_VsrB(a, i) a->VsrB(i)
> +#define GET_VsrH(a, i) a->VsrH(i)
> +
> +#define GET_VsrSN(a, i) (sextract32(a->VsrSB((i) / 2), (i) % 2 ? 4 : 0, 4))
> +#define GET_VsrSB(a, i) a->VsrSB(i)
> +#define GET_VsrSH(a, i) a->VsrSH(i)

These can be made into functions of the form

     typedef int32_t xviger_extract(ppc_vsr_t *a, int i);


> +#define XVIGER(NAME, RANK, EL)                                                 \
> +    void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r,                    \
> +              uint32_t  at_r, uint32_t mask, uint32_t packed_flags)            \
> +    {                                                                          \
> +        ppc_vsr_t *a = cpu_vsr_ptr(env, a_r), *b = cpu_vsr_ptr(env, b_r), *at; \
> +        bool sat = get_sat(packed_flags), acc = get_acc(packed_flags);         \
> +        uint8_t pmsk = ger_get_pmsk(mask), xmsk = ger_get_xmsk(mask),          \
> +                ymsk = ger_get_ymsk(mask);                                     \
> +        uint8_t pmsk_bit, xmsk_bit, ymsk_bit;                                  \
> +        int64_t psum;                                                          \
> +        int32_t va, vb;                                                        \
> +        int i, j, k;                                                           \
> +        for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) {           \
> +            at = cpu_vsr_ptr(env, at_r + i);                                   \
> +            for (j = 0, ymsk_bit = 1 << 3; j < 4; j++, ymsk_bit >>= 1) {       \
> +                if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) {                  \
> +                    psum = 0;                                                  \
> +                    for (k = 0, pmsk_bit = 1 << (RANK - 1); k < RANK;          \
> +                         k++, pmsk_bit >>= 1) {                                \
> +                        if (pmsk_bit & pmsk) {                                 \
> +                            va = (int32_t)GET_VsrS##EL(a, RANK * i + k);       \
> +                            vb = (int32_t) ((RANK == 4) ?                      \
> +                                                GET_Vsr##EL(b, RANK * j + k) : \
> +                                                GET_VsrS##EL(b, RANK * j + k));\
> +                            psum += va * vb;                                   \
> +                        }                                                      \
> +                    }                                                          \
> +                    if (acc) {                                                 \
> +                        psum += at->VsrSW(j);                                  \
> +                    }                                                          \
> +                    if (sat && psum > INT32_MAX) {                             \
> +                        set_vscr_sat(env);                                     \
> +                        at->VsrSW(j) = INT32_MAX;                              \
> +                    } else if (sat && psum < INT32_MIN) {                      \
> +                        set_vscr_sat(env);                                     \
> +                        at->VsrSW(j) = INT32_MIN;                              \
> +                    } else {                                                   \
> +                        at->VsrSW(j) = (int32_t) psum;                         \
> +                    }                                                          \
> +                } else {                                                       \
> +                    at->VsrSW(j) = 0;                                          \
> +                }                                                              \
> +            }                                                                  \
> +        }                                                                      \
> +    }

... which means that this monster can be a function instead of a non-debuggable macro.

> diff --git a/target/ppc/internal.h b/target/ppc/internal.h
> index 8094e0b033..a994d98238 100644
> --- a/target/ppc/internal.h
> +++ b/target/ppc/internal.h
> @@ -291,4 +291,32 @@ G_NORETURN void ppc_cpu_do_unaligned_access(CPUState *cs, vaddr addr,
>                                               uintptr_t retaddr);
>   #endif
>   
> +/*
> + * Auxiliary functions to pack/unpack masks for GER instructions.
> + *
> + * Packed format:
> + *  Bits 0-3: xmsk
> + *  Bits 4-7: ymsk
> + *  Bits 8-15: pmsk
> + */
> +static inline uint8_t ger_get_xmsk(uint32_t packed_masks)
> +{
> +    return packed_masks & 0xF;
> +}
> +
> +static inline uint8_t ger_get_ymsk(uint32_t packed_masks)
> +{
> +    return (packed_masks >> 4) & 0xF;
> +}
> +
> +static inline uint8_t ger_get_pmsk(uint32_t packed_masks)
> +{
> +    return (packed_masks >> 8) & 0xFF;
> +}
> +
> +static inline int ger_pack_masks(int pmsk, int ymsk, int xmsk)
> +{
> +    return (pmsk & 0xFF) << 8 | (ymsk & 0xF) << 4 | (xmsk & 0xF);
> +}

Use hw/registerfields.h.  C.f. PREDDESC in target/arm/internals.h.

> +static bool do_ger_XX3(DisasContext *ctx, arg_XX3 *a, uint32_t op,
> +                             void (*helper)(TCGv_env, TCGv_i32, TCGv_i32,
> +                                            TCGv_i32, TCGv_i32, TCGv_i32))
> +{
> +    uint32_t mask;
> +    REQUIRE_INSNS_FLAGS2(ctx, ISA310);
> +    REQUIRE_VSX(ctx);
> +    if (unlikely((a->xa / 4 == a->xt / 4) || (a->xb / 4 == a->xt / 4))) {
> +        gen_invalid(ctx);
> +        return true;
> +    }
> +
> +    mask = 0xFFFFFFFF;
> +    helper(cpu_env, tcg_constant_i32(a->xa), tcg_constant_i32(a->xb),
> +           tcg_constant_i32(a->xt), tcg_constant_i32(mask),
> +           tcg_constant_i32(op));
> +    return true;
> +}

Why are you passing register numbers instead of pointers, like everywhere else?


r~


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/7] target/ppc: Implemented xvf*ger*
  2022-04-26 12:50 ` [RFC PATCH 4/7] target/ppc: Implemented xvf*ger* Lucas Mateus Castro(alqotel)
@ 2022-04-27  0:09   ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2022-04-27  0:09 UTC (permalink / raw)
  To: Lucas Mateus Castro(alqotel), qemu-ppc
  Cc: open list:All patches CC here, Greg Kurz,
	Daniel Henrique Barboza, Cédric Le Goater, David Gibson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
> +#define VSXGER(NAME, TYPE, EL)                                          \
> +    void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r,             \
> +              uint32_t  at_r, uint32_t mask, uint32_t packed_flags)     \
> +    {                                                                   \
> +        ppc_vsr_t *a, *b, *at;                                          \
> +        TYPE aux_acc, va, vb;                                           \
> +        int i, j, xmsk_bit, ymsk_bit, op_flags;                         \
> +        uint8_t xmsk = mask & 0x0F;                                     \
> +        uint8_t ymsk = (mask >> 4) & 0x0F;                              \
> +        int ymax = MIN(4, 128 / (sizeof(TYPE) * 8));                    \
> +        b = cpu_vsr_ptr(env, b_r);                                      \
> +        float_status *excp_ptr = &env->fp_status;                       \
> +        bool acc = ger_acc_flag(packed_flags);                          \
> +        bool neg_acc = ger_neg_acc_flag(packed_flags);                  \
> +        bool neg_mul = ger_neg_mul_flag(packed_flags);                  \
> +        helper_reset_fpstatus(env);                                     \
> +        for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) {    \
> +            a = cpu_vsr_ptr(env, a_r + i / ymax);                       \
> +            at = cpu_vsr_ptr(env, at_r + i);                            \
> +            for (j = 0, ymsk_bit = 1 << (ymax - 1); j < ymax;           \
> +                 j++, ymsk_bit >>= 1) {                                 \
> +                if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) {           \
> +                    op_flags = (neg_acc ^ neg_mul) ?                    \
> +                                          float_muladd_negate_c : 0;    \
> +                    op_flags |= (neg_mul) ?                             \
> +                                     float_muladd_negate_result : 0;    \

There's no need to compute op_flags in the inner loop.
Indeed, probably better to compute it in translation.

This macro is trickier than the integer to turn into a function, however,

> +                    va = a->Vsr##EL(i % ymax);                          \
> +                    vb = b->Vsr##EL(j);                                 \
> +                    aux_acc = at->Vsr##EL(j);                           \
> +                    if (acc) {                                          \
> +                        at->Vsr##EL(j) = TYPE##_muladd(va, vb, aux_acc, \
> +                                                       op_flags,        \
> +                                                       excp_ptr);       \
> +                    } else {                                            \
> +                        at->Vsr##EL(j) = TYPE##_mul(va, vb, excp_ptr);  \
> +                    }                                                   \
> +                } else {                                                \
> +                    at->Vsr##EL(j) = 0;                                 \
> +                }                                                       \

static void vsxger_zero_f(ppc_vsr_t *a, int j)
{
     a->VsrSF(i) = float32_zero;
}

static uint64_t vsxger_mul_f(ppc_vsr_t *d, ppc_vsr_t *a, ppc_vsr_t *b,
                              int i, int j, int flags, float_status *s)
{
     float32 af = a->VsrSF(i);
     float32 bf = b->VsrSF(j);
     d->VsrSF(j) = float32_mul(af, bf, s);
}

static uint64_t vsxger_mac_f(ppc_vsr_t *d, ppc_vsr_t *a, ppc_vsr_t *b,
                              int i, int j, int flags, float_status *s)
{
     float32 af = a->VsrSF(i);
     float32 bf = b->VsrSF(j);
     float32 cf = d->VsrSF(j);
     d->VsrSF(j) = float32_muladd(af, bf, cf, flags, s);
}

is probably a good place to start for callbacks.


r~


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 5/7] target/ppc: Implemented xvf16ger*
  2022-04-26 12:50 ` [RFC PATCH 5/7] target/ppc: Implemented xvf16ger* Lucas Mateus Castro(alqotel)
@ 2022-04-27  0:26   ` Richard Henderson
  2022-04-27 21:11     ` Lucas Mateus Martins Araujo e Castro
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2022-04-27  0:26 UTC (permalink / raw)
  To: Lucas Mateus Castro(alqotel), qemu-ppc
  Cc: Peter Maydell, Daniel Henrique Barboza, Greg Kurz,
	open list:All patches CC here, Cédric Le Goater,
	Alex Bennée, Aurelien Jarno, David Gibson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
> +#define VSXGER16(NAME, ORIG_T, OR_EL)                                   \
> +    void NAME(CPUPPCState *env, uint32_t a_r, uint32_t b_r,             \
> +              uint32_t  at_r, uint32_t mask, uint32_t packed_flags)     \
> +    {                                                                   \
> +        ppc_vsr_t *at;                                                  \
> +        float32 psum, aux_acc, va, vb, vc, vd;                          \
> +        int i, j, xmsk_bit, ymsk_bit;                                   \
> +        uint8_t xmsk = mask & 0x0F;                                     \
> +        uint8_t ymsk = (mask >> 4) & 0x0F;                              \
> +        uint8_t pmsk = (mask >> 8) & 0x3;                               \
> +        ppc_vsr_t *b = cpu_vsr_ptr(env, b_r);                           \
> +        ppc_vsr_t *a = cpu_vsr_ptr(env, a_r);                           \
> +        float_status *excp_ptr = &env->fp_status;                       \
> +        bool acc = ger_acc_flag(packed_flags);                          \
> +        bool neg_acc = ger_neg_acc_flag(packed_flags);                  \
> +        bool neg_mul = ger_neg_mul_flag(packed_flags);                  \
> +        for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) {    \
> +            at = cpu_vsr_ptr(env, at_r + i);                            \
> +            for (j = 0, ymsk_bit = 1 << 3; j < 4; j++, ymsk_bit >>= 1) {\
> +                if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) {           \
> +                    va = !(pmsk & 2) ? float32_zero :                   \
> +                                       GET_VSR(Vsr##OR_EL, a,           \
> +                                               2 * i, ORIG_T, float32); \
> +                    vb = !(pmsk & 2) ? float32_zero :                   \
> +                                       GET_VSR(Vsr##OR_EL, b,           \
> +                                               2 * j, ORIG_T, float32); \
> +                    vc = !(pmsk & 1) ? float32_zero :                   \
> +                                       GET_VSR(Vsr##OR_EL, a,           \
> +                                            2 * i + 1, ORIG_T, float32);\
> +                    vd = !(pmsk & 1) ? float32_zero :                   \
> +                                       GET_VSR(Vsr##OR_EL, b,           \
> +                                            2 * j + 1, ORIG_T, float32);\
> +                    psum = float32_mul(va, vb, excp_ptr);               \
> +                    psum = float32_muladd(vc, vd, psum, 0, excp_ptr);   \

This isn't correct -- the intermediate 'prod' (the first multiply) is not rounded.  I 
think the correct way to implement this (barring new softfloat functions) is to compute 
the intermediate product as float64 with float_round_to_odd, then float64r32_muladd into 
the correct rounding mode to finish.

> +                    if (acc) {                                          \
> +                        if (neg_mul) {                                  \
> +                            psum = float32_neg(psum);                   \
> +                        }                                               \
> +                        if (neg_acc) {                                  \
> +                            aux_acc = float32_neg(at->VsrSF(j));        \
> +                        } else {                                        \
> +                            aux_acc = at->VsrSF(j);                     \
> +                        }                                               \
> +                        at->VsrSF(j) = float32_add(psum, aux_acc,       \
> +                                                   excp_ptr);           \

This one, thankfully, uses the rounded intermediate result 'msum', so is ok.

Please do convert this from a macro.  Given that float16 and bfloat16 are addressed the 
same, I think the only callback you need is the conversion from float16_to_float64.  Drop 
the bf16 accessor to ppc_vsr_t.


r~


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 6/7] target/ppc: Implemented pmxvf*ger*
  2022-04-26 12:50 ` [RFC PATCH 6/7] target/ppc: Implemented pmxvf*ger* Lucas Mateus Castro(alqotel)
@ 2022-04-27  0:33   ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2022-04-27  0:33 UTC (permalink / raw)
  To: Lucas Mateus Castro(alqotel), qemu-ppc
  Cc: open list:All patches CC here, Greg Kurz,
	Daniel Henrique Barboza, Cédric Le Goater, David Gibson

On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
> +&MMIRR_XX3_NO_P xa xb xt xmsk ymsk

Don't create this...

> +@MMIRR_XX3_NO_P ...... .. .... .. . . ........ xmsk:4 .... \
> +                ...... ... .. ..... ..... ........ ... \
> +                &MMIRR_XX3_NO_P xb=%xx3_xb xt=%xx3_at

just set pmsk=1 here instead...

> +static bool do_ger_MMIRR_XX3_NO_PMSK(DisasContext *ctx, arg_MMIRR_XX3_NO_P *a,
> +                                     int op_flag, void (*helper)(TCGv_env,
> +                                     TCGv_i32, TCGv_i32, TCGv_i32,
> +                                     TCGv_i32, TCGv_i32))
> +{
> +    arg_MMIRR_XX3 x;
> +    x.xa = a->xa;
> +    x.xb = a->xb;
> +    x.xt = a->xt;
> +    x.pmsk = 0x1;
> +    x.ymsk = a->ymsk;
> +    x.xmsk = a->xmsk;
> +    return do_ger_MMIRR_XX3(ctx, &x, op_flag, helper);
> +}

so you can drop this.


r~


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/7] VSX MMA Implementation
  2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
                   ` (6 preceding siblings ...)
  2022-04-26 12:50 ` [RFC PATCH 7/7] target/ppc: Implemented [pm]xvbf16ger2* Lucas Mateus Castro(alqotel)
@ 2022-04-27  6:21 ` Joel Stanley
  2022-04-27  7:10   ` Cédric Le Goater
  2022-04-28 14:05 ` Lucas Mateus Martins Araujo e Castro
  8 siblings, 1 reply; 21+ messages in thread
From: Joel Stanley @ 2022-04-27  6:21 UTC (permalink / raw)
  To: Lucas Mateus Castro(alqotel)
  Cc: Daniel Henrique Barboza, Richard Henderson, QEMU Developers,
	Greg Kurz, qemu-ppc, Alex Bennée, David Gibson

On Tue, 26 Apr 2022 at 12:51, Lucas Mateus Castro(alqotel)
<lucas.araujo@eldorado.org.br> wrote:
>
> From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>
>
> This patch series is an RFC of the Matrix-Multiply Assist (MMA)
> instructions implementation from the PowerISA 3.1
>
> These and the VDIV/VMOD implementation are the last new PowerISA 3.1
> instructions left to be implemented.
>
> Thanks
> Lucas Mateus Castro (alqotel) (7):
>   target/ppc: Implement xxm[tf]acc and xxsetaccz
>   target/ppc: Implemented xvi*ger* instructions
>   target/ppc: Implemented pmxvi*ger* instructions
>   target/ppc: Implemented xvf*ger*
>   target/ppc: Implemented xvf16ger*
>   target/ppc: Implemented pmxvf*ger*
>   target/ppc: Implemented [pm]xvbf16ger2*

I have a small test case for the MMA instructions that Alistair wrote
a while back[1]. It passes when run with these patches applied
(previously it would sigill).

$ qemu-ppc64le -cpu power10  -L ~/ppc64le/ ./test -m
Smoke test MMA
MMA[0] = 1 (Correct)
MMA[1] = 2 (Correct)
MMA[2] = 3 (Correct)
MMA[3] = 4 (Correct)
MMA[4] = 2 (Correct)
MMA[5] = 4 (Correct)
MMA[6] = 6 (Correct)
MMA[7] = 8 (Correct)
MMA[8] = 3 (Correct)
MMA[9] = 6 (Correct)
MMA[10] = 9 (Correct)
MMA[11] = 12 (Correct)
MMA[12] = 4 (Correct)
MMA[13] = 8 (Correct)
MMA[14] = 12 (Correct)
MMA[15] = 16 (Correct)

[1] https://github.com/shenki/p10_tests


>
>  include/fpu/softfloat.h             |   9 ++
>  target/ppc/cpu.h                    |  15 +++
>  target/ppc/fpu_helper.c             | 130 ++++++++++++++++++
>  target/ppc/helper.h                 |   7 +
>  target/ppc/insn32.decode            |  49 +++++++
>  target/ppc/insn64.decode            |  80 +++++++++++
>  target/ppc/int_helper.c             |  85 ++++++++++++
>  target/ppc/internal.h               |  28 ++++
>  target/ppc/translate/vsx-impl.c.inc | 200 ++++++++++++++++++++++++++++
>  9 files changed, 603 insertions(+)
>
> --
> 2.31.1
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/7] VSX MMA Implementation
  2022-04-27  6:21 ` [RFC PATCH 0/7] VSX MMA Implementation Joel Stanley
@ 2022-04-27  7:10   ` Cédric Le Goater
  2022-05-05  6:06     ` Joel Stanley
  0 siblings, 1 reply; 21+ messages in thread
From: Cédric Le Goater @ 2022-04-27  7:10 UTC (permalink / raw)
  To: Joel Stanley, Lucas Mateus Castro(alqotel)
  Cc: Leandro Lupori, Daniel Henrique Barboza, Richard Henderson,
	QEMU Developers, Greg Kurz, qemu-ppc, Matheus Ferst,
	David Gibson

Hello,

On 4/27/22 08:21, Joel Stanley wrote:
> On Tue, 26 Apr 2022 at 12:51, Lucas Mateus Castro(alqotel)
> <lucas.araujo@eldorado.org.br> wrote:
>>
>> From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>
>>
>> This patch series is an RFC of the Matrix-Multiply Assist (MMA)
>> instructions implementation from the PowerISA 3.1
>>
>> These and the VDIV/VMOD implementation are the last new PowerISA 3.1
>> instructions left to be implemented.
>>
>> Thanks
>> Lucas Mateus Castro (alqotel) (7):
>>    target/ppc: Implement xxm[tf]acc and xxsetaccz
>>    target/ppc: Implemented xvi*ger* instructions
>>    target/ppc: Implemented pmxvi*ger* instructions
>>    target/ppc: Implemented xvf*ger*
>>    target/ppc: Implemented xvf16ger*
>>    target/ppc: Implemented pmxvf*ger*
>>    target/ppc: Implemented [pm]xvbf16ger2*
> 
> I have a small test case for the MMA instructions that Alistair wrote
> a while back[1]. It passes when run with these patches applied
> (previously it would sigill).

Could we have your Tested-by then ?


> 
> $ qemu-ppc64le -cpu power10  -L ~/ppc64le/ ./test -m
> Smoke test MMA
> MMA[0] = 1 (Correct)
> MMA[1] = 2 (Correct)
> MMA[2] = 3 (Correct)
> MMA[3] = 4 (Correct)
> MMA[4] = 2 (Correct)
> MMA[5] = 4 (Correct)
> MMA[6] = 6 (Correct)
> MMA[7] = 8 (Correct)
> MMA[8] = 3 (Correct)
> MMA[9] = 6 (Correct)
> MMA[10] = 9 (Correct)
> MMA[11] = 12 (Correct)
> MMA[12] = 4 (Correct)
> MMA[13] = 8 (Correct)
> MMA[14] = 12 (Correct)
> MMA[15] = 16 (Correct)
> 
> [1] https://github.com/shenki/p10_tests

Looks like a good candidate for tests/tcg/ppc64le/. Adding Matheus and Leandro.

Thanks,

C.



> 
> 
>>
>>   include/fpu/softfloat.h             |   9 ++
>>   target/ppc/cpu.h                    |  15 +++
>>   target/ppc/fpu_helper.c             | 130 ++++++++++++++++++
>>   target/ppc/helper.h                 |   7 +
>>   target/ppc/insn32.decode            |  49 +++++++
>>   target/ppc/insn64.decode            |  80 +++++++++++
>>   target/ppc/int_helper.c             |  85 ++++++++++++
>>   target/ppc/internal.h               |  28 ++++
>>   target/ppc/translate/vsx-impl.c.inc | 200 ++++++++++++++++++++++++++++
>>   9 files changed, 603 insertions(+)
>>
>> --
>> 2.31.1
>>
>>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions
  2022-04-26 23:40   ` Richard Henderson
@ 2022-04-27 20:24     ` Lucas Mateus Martins Araujo e Castro
  2022-04-27 22:28       ` Richard Henderson
  0 siblings, 1 reply; 21+ messages in thread
From: Lucas Mateus Martins Araujo e Castro @ 2022-04-27 20:24 UTC (permalink / raw)
  To: Richard Henderson, qemu-ppc
  Cc: open list:All patches CC here, Greg Kurz,
	Daniel Henrique Barboza, Cédric Le Goater, David Gibson

[-- Attachment #1: Type: text/plain, Size: 5246 bytes --]


On 26/04/2022 20:40, Richard Henderson wrote:
>
> On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
>> +%xx_at          23:3 !function=times_4
>> +@XX3_at         ...... ... .. ..... ..... ........ ... &XX3 
>> xt=%xx_at xb=%xx_xb
>
> Hmm.  Depends, I suppose on whether you want acc[0-7] or vsr[0-28]
I mostly used VSR function here, but since I'll change the patch 1 to 
your suggestion (which will require creating acc_full_offset) I'll make 
a few changes to create some functions for the accumulator
>
>> +/*
>> + * Packed VSX Integer GER Flags
>> + * 00 - no accumulation no saturation
>> + * 01 - accumulate but no saturation
>> + * 10 - no accumulation but with saturation
>> + * 11 - accumulate with saturation
>> + */
>> +static inline bool get_sat(uint32_t flags)
>> +{
>> +    return flags & 0x2;
>> +}
>> +
>> +static inline bool get_acc(uint32_t flags)
>> +{
>> +    return flags & 0x1;
>> +}
>
> Better to have separate helpers for these?  They'd be immediate 
> operands to the function
> replacing XVIGER (see below) and thus optimize well.
Do you mean different functions or a function that receives packed_flags 
along with the callback functions?
>
>> +#define GET_VsrN(a, i) (extract32(a->VsrB((i) / 2), (i) % 2 ? 4 : 0, 
>> 4))
>> +#define GET_VsrB(a, i) a->VsrB(i)
>> +#define GET_VsrH(a, i) a->VsrH(i)
>> +
>> +#define GET_VsrSN(a, i) (sextract32(a->VsrSB((i) / 2), (i) % 2 ? 4 : 
>> 0, 4))
>> +#define GET_VsrSB(a, i) a->VsrSB(i)
>> +#define GET_VsrSH(a, i) a->VsrSH(i)
>
> These can be made into functions of the form
>
>     typedef int32_t xviger_extract(ppc_vsr_t *a, int i);
>
In this case it'd be necessary to receive 2 xviger_extract functions 
since XVI8GER4* multiply one value as signed and the other as unsigned 
(and other integer GER treat both as signed).

An alternative would be to isolate the innermost loop into a different 
function, like:

     typedef int64_t do_ger(int32_t a, int32_t b, int32_t at, int32_t pmsk);

     static int64_t ger_rank4(int32_t a, int32_t b, int32_t at, int32_t 
mask)
     {
         int64_t psum = 0, i;
         for (i = 0; i < 4; i++, mask >>= 1) {
             if (mask & 1) {
                 psum += (sextract32(a, i * 8, 8)) * (extract32(b, i * 
8, 8));
            }
         }
         return psum;
     }

That way we could avoid having 'rank' as a parameter, what do you think?

>
>
>> diff --git a/target/ppc/internal.h b/target/ppc/internal.h
>> index 8094e0b033..a994d98238 100644
>> --- a/target/ppc/internal.h
>> +++ b/target/ppc/internal.h
>> @@ -291,4 +291,32 @@ G_NORETURN void 
>> ppc_cpu_do_unaligned_access(CPUState *cs, vaddr addr,
>>                                               uintptr_t retaddr);
>>   #endif
>>
>> +/*
>> + * Auxiliary functions to pack/unpack masks for GER instructions.
>> + *
>> + * Packed format:
>> + *  Bits 0-3: xmsk
>> + *  Bits 4-7: ymsk
>> + *  Bits 8-15: pmsk
>> + */
>> +static inline uint8_t ger_get_xmsk(uint32_t packed_masks)
>> +{
>> +    return packed_masks & 0xF;
>> +}
>> +
>> +static inline uint8_t ger_get_ymsk(uint32_t packed_masks)
>> +{
>> +    return (packed_masks >> 4) & 0xF;
>> +}
>> +
>> +static inline uint8_t ger_get_pmsk(uint32_t packed_masks)
>> +{
>> +    return (packed_masks >> 8) & 0xFF;
>> +}
>> +
>> +static inline int ger_pack_masks(int pmsk, int ymsk, int xmsk)
>> +{
>> +    return (pmsk & 0xFF) << 8 | (ymsk & 0xF) << 4 | (xmsk & 0xF);
>> +}
>
> Use hw/registerfields.h.  C.f. PREDDESC in target/arm/internals.h.
Ok, will do
>
>> +static bool do_ger_XX3(DisasContext *ctx, arg_XX3 *a, uint32_t op,
>> +                             void (*helper)(TCGv_env, TCGv_i32, 
>> TCGv_i32,
>> +                                            TCGv_i32, TCGv_i32, 
>> TCGv_i32))
>> +{
>> +    uint32_t mask;
>> +    REQUIRE_INSNS_FLAGS2(ctx, ISA310);
>> +    REQUIRE_VSX(ctx);
>> +    if (unlikely((a->xa / 4 == a->xt / 4) || (a->xb / 4 == a->xt / 
>> 4))) {
>> +        gen_invalid(ctx);
>> +        return true;
>> +    }
>> +
>> +    mask = 0xFFFFFFFF;
>> +    helper(cpu_env, tcg_constant_i32(a->xa), tcg_constant_i32(a->xb),
>> +           tcg_constant_i32(a->xt), tcg_constant_i32(mask),
>> +           tcg_constant_i32(op));
>> +    return true;
>> +}
>
> Why are you passing register numbers instead of pointers, like 
> everywhere else?
Because here we are not working only with 1 register per register 
number, the ACC uses 4 and the XVF64GER* needs to use XA and XA+1, and 
while VSR is an array so I could do ppc_vsr_ptr+1 I thought it was 
better not to access memory I was not given a pointer to, so I passed XA 
so I can request cpu_vsr_ptr(env, xa) and cpu_vsr_ptr(env, xa + 1)
>
>
> r~
-- 
Lucas Mateus M. Araujo e Castro
Instituto de Pesquisas ELDORADO 
<https://www.eldorado.org.br/?utm_campaign=assinatura_de_e-mail&utm_medium=email&utm_source=RD+Station>
Departamento Computação Embarcada
Analista de Software Trainee
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>

[-- Attachment #2: Type: text/html, Size: 8574 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 5/7] target/ppc: Implemented xvf16ger*
  2022-04-27  0:26   ` Richard Henderson
@ 2022-04-27 21:11     ` Lucas Mateus Martins Araujo e Castro
  2022-04-27 22:30       ` Richard Henderson
  0 siblings, 1 reply; 21+ messages in thread
From: Lucas Mateus Martins Araujo e Castro @ 2022-04-27 21:11 UTC (permalink / raw)
  To: Richard Henderson, qemu-ppc
  Cc: Peter Maydell, Daniel Henrique Barboza, Greg Kurz,
	open list:All patches CC here, Cédric Le Goater,
	Alex Bennée, Aurelien Jarno, David Gibson

[-- Attachment #1: Type: text/plain, Size: 7596 bytes --]


On 26/04/2022 21:26, Richard Henderson wrote:
> On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
>> +#define VSXGER16(NAME, ORIG_T, 
>> OR_EL)                                   \
>> +    void NAME(CPUPPCState *env, uint32_t a_r, uint32_t 
>> b_r,             \
>> +              uint32_t  at_r, uint32_t mask, uint32_t 
>> packed_flags)     \
>> + { \
>> +        ppc_vsr_t 
>> *at;                                                  \
>> +        float32 psum, aux_acc, va, vb, vc, 
>> vd;                          \
>> +        int i, j, xmsk_bit, 
>> ymsk_bit;                                   \
>> +        uint8_t xmsk = mask & 
>> 0x0F;                                     \
>> +        uint8_t ymsk = (mask >> 4) & 
>> 0x0F;                              \
>> +        uint8_t pmsk = (mask >> 8) & 
>> 0x3;                               \
>> +        ppc_vsr_t *b = cpu_vsr_ptr(env, 
>> b_r);                           \
>> +        ppc_vsr_t *a = cpu_vsr_ptr(env, 
>> a_r);                           \
>> +        float_status *excp_ptr = 
>> &env->fp_status;                       \
>> +        bool acc = 
>> ger_acc_flag(packed_flags);                          \
>> +        bool neg_acc = 
>> ger_neg_acc_flag(packed_flags);                  \
>> +        bool neg_mul = 
>> ger_neg_mul_flag(packed_flags);                  \
>> +        for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) 
>> {    \
>> +            at = cpu_vsr_ptr(env, at_r + 
>> i);                            \
>> +            for (j = 0, ymsk_bit = 1 << 3; j < 4; j++, ymsk_bit >>= 
>> 1) {\
>> +                if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) 
>> {           \
>> +                    va = !(pmsk & 2) ? float32_zero 
>> :                   \
>> +                                       GET_VSR(Vsr##OR_EL, 
>> a,           \
>> +                                               2 * i, ORIG_T, 
>> float32); \
>> +                    vb = !(pmsk & 2) ? float32_zero 
>> :                   \
>> +                                       GET_VSR(Vsr##OR_EL, 
>> b,           \
>> +                                               2 * j, ORIG_T, 
>> float32); \
>> +                    vc = !(pmsk & 1) ? float32_zero 
>> :                   \
>> +                                       GET_VSR(Vsr##OR_EL, 
>> a,           \
>> +                                            2 * i + 1, ORIG_T, 
>> float32);\
>> +                    vd = !(pmsk & 1) ? float32_zero 
>> :                   \
>> +                                       GET_VSR(Vsr##OR_EL, 
>> b,           \
>> +                                            2 * j + 1, ORIG_T, 
>> float32);\
>> +                    psum = float32_mul(va, vb, 
>> excp_ptr);               \
>> +                    psum = float32_muladd(vc, vd, psum, 0, 
>> excp_ptr);   \
>
> This isn't correct -- the intermediate 'prod' (the first multiply) is 
> not rounded.  I
> think the correct way to implement this (barring new softfloat 
> functions) is to compute
> the intermediate product as float64 with float_round_to_odd, then 
> float64r32_muladd into
> the correct rounding mode to finish.
While not mentioned in the pseudocode the instruction description says:

- Let prod be the single-precision product of src10 and src20

Which I understand as the result of the first multiplication being 
stored in a float32

But in xvbf16ger2* it's different (and I think this is the reason the 
last patch is resulting in the wrong signal in some 0 and inf results), 
the description says:

- Let prod be the product of src10 and src20, having infinite precision 
and unbounded exponent range. - Let psum be the sum of the product, 
src11 multiplied by src21, and prod, having infinite precision and 
unbounded exponent range.
- Let r1 be the value psum with its significand rounded to 24-bit 
precision using the rounding mode specified by RN, but retaining 
unbounded exponent range (i.e., cannot overflow or underflow).

>
>> +                    if (acc) 
>> {                                          \
>> +                        if (neg_mul) 
>> {                                  \
>> +                            psum = 
>> float32_neg(psum);                   \
>> + }                                               \
>> +                        if (neg_acc) 
>> {                                  \
>> +                            aux_acc = 
>> float32_neg(at->VsrSF(j));        \
>> +                        } else 
>> {                                        \
>> +                            aux_acc = 
>> at->VsrSF(j);                     \
>> + }                                               \
>> +                        at->VsrSF(j) = float32_add(psum, 
>> aux_acc,       \
>> + excp_ptr);           \
>
> This one, thankfully, uses the rounded intermediate result 'msum', so 
> is ok.
Yes this one is the easier one to deal with, in the description for the 
xvf16ger2* it specifies that msum and the result is rounded to 
single-precision and in the description for the xvbf16ger2 it specifies 
that r1 is 'rounded to a 24-bit significand precision and 8-bit exponent 
range (i.e., single-precision)'
>
> Please do convert this from a macro.  Given that float16 and bfloat16 
> are addressed the
> same, I think the only callback you need is the conversion from 
> float16_to_float64.  Drop
> the bf16 accessor to ppc_vsr_t.
>
Will do, although I'm considering instead of the callback being the 
conversion, maybe have it be a 4 float multiplication
     typedef float32 mul_4float(float16, float16, float16, float16);
Since float16 and bfloat16 are addressed the same, any thoughts?
>
> r~
-- 
Lucas Mateus M. Araujo e Castro
Instituto de Pesquisas ELDORADO 
<https://www.eldorado.org.br/?utm_campaign=assinatura_de_e-mail&utm_medium=email&utm_source=RD+Station>
Departamento Computação Embarcada
Analista de Software Trainee
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>

[-- Attachment #2: Type: text/html, Size: 10291 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions
  2022-04-27 20:24     ` Lucas Mateus Martins Araujo e Castro
@ 2022-04-27 22:28       ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2022-04-27 22:28 UTC (permalink / raw)
  To: Lucas Mateus Martins Araujo e Castro, qemu-ppc
  Cc: open list:All patches CC here, Greg Kurz,
	Daniel Henrique Barboza, Cédric Le Goater, David Gibson

On 4/27/22 13:24, Lucas Mateus Martins Araujo e Castro wrote:
> 
> On 26/04/2022 20:40, Richard Henderson wrote:
>>
>> On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
>>> +%xx_at          23:3 !function=times_4
>>> +@XX3_at         ...... ... .. ..... ..... ........ ... &XX3 xt=%xx_at xb=%xx_xb
>>
>> Hmm.  Depends, I suppose on whether you want acc[0-7] or vsr[0-28]
> I mostly used VSR function here, but since I'll change the patch 1 to your suggestion 
> (which will require creating acc_full_offset) I'll make a few changes to create some 
> functions for the accumulator
>>
>>> +/*
>>> + * Packed VSX Integer GER Flags
>>> + * 00 - no accumulation no saturation
>>> + * 01 - accumulate but no saturation
>>> + * 10 - no accumulation but with saturation
>>> + * 11 - accumulate with saturation
>>> + */
>>> +static inline bool get_sat(uint32_t flags)
>>> +{
>>> +    return flags & 0x2;
>>> +}
>>> +
>>> +static inline bool get_acc(uint32_t flags)
>>> +{
>>> +    return flags & 0x1;
>>> +}
>>
>> Better to have separate helpers for these?  They'd be immediate operands to the function
>> replacing XVIGER (see below) and thus optimize well.
> Do you mean different functions or a function that receives packed_flags along with the 
> callback functions?

I mean separate helper entry points, which use a common function that receives these as 
separate boolean arguments, along with the callbacks.  Use QEMU_FLATTEN on the helper 
entry points to ensure that everything is inlined and the constant args are optimized.

> In this case it'd be necessary to receive 2 xviger_extract functions since XVI8GER4* 
> multiply one value as signed and the other as unsigned (and other integer GER treat both 
> as signed).

Certainly.

> 
> An alternative would be to isolate the innermost loop into a different function, like:
> 
>      typedef int64_t do_ger(int32_t a, int32_t b, int32_t at, int32_t pmsk);
> 
>      static int64_t ger_rank4(int32_t a, int32_t b, int32_t at, int32_t mask)
>      {
>          int64_t psum = 0, i;
>          for (i = 0; i < 4; i++, mask >>= 1) {
>              if (mask & 1) {
>                  psum += (sextract32(a, i * 8, 8)) * (extract32(b, i * 8, 8));
>             }
>          }
>          return psum;
>      }
> 
> That way we could avoid having 'rank' as a parameter, what do you think?

Reasonable.  I certainly like extracting uint32_t from the vector generically and not 
having to pass that on further.

>> Why are you passing register numbers instead of pointers, like everywhere else?
> Because here we are not working only with 1 register per register number, the ACC uses 4 
> and the XVF64GER* needs to use XA and XA+1, and while VSR is an array so I could do 
> ppc_vsr_ptr+1 I thought it was better not to access memory I was not given a pointer to, 
> so I passed XA so I can request cpu_vsr_ptr(env, xa) and cpu_vsr_ptr(env, xa + 1)

I think using cpu_vsr_ptr is the mistake.

It might be clarifying to define a ppc_acc_t, if only as a typedef of ppc_vsr_t.  The 
acc_full_offset function will compute the offset for this pointer and, importantly, will 
be the place to modify if and when the architecture changes to allow or require separate 
storage for the ACC registers.


r~


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 5/7] target/ppc: Implemented xvf16ger*
  2022-04-27 21:11     ` Lucas Mateus Martins Araujo e Castro
@ 2022-04-27 22:30       ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2022-04-27 22:30 UTC (permalink / raw)
  To: Lucas Mateus Martins Araujo e Castro, qemu-ppc
  Cc: Peter Maydell, Daniel Henrique Barboza, Greg Kurz,
	open list:All patches CC here, Cédric Le Goater,
	Alex Bennée, Aurelien Jarno, David Gibson

On 4/27/22 14:11, Lucas Mateus Martins Araujo e Castro wrote:
>> Please do convert this from a macro.  Given that float16 and bfloat16 are addressed the
>> same, I think the only callback you need is the conversion from float16_to_float64.  Drop
>> the bf16 accessor to ppc_vsr_t.
>>
> Will do, although I'm considering instead of the callback being the conversion, maybe have 
> it be a 4 float multiplication
>      typedef float32 mul_4float(float16, float16, float16, float16);
> Since float16 and bfloat16 are addressed the same, any thoughts?

The multiplication would be identical for the two types -- only the conversion is different.


r~


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/7] VSX MMA Implementation
  2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
                   ` (7 preceding siblings ...)
  2022-04-27  6:21 ` [RFC PATCH 0/7] VSX MMA Implementation Joel Stanley
@ 2022-04-28 14:05 ` Lucas Mateus Martins Araujo e Castro
  8 siblings, 0 replies; 21+ messages in thread
From: Lucas Mateus Martins Araujo e Castro @ 2022-04-28 14:05 UTC (permalink / raw)
  To: qemu-ppc
  Cc: Daniel Henrique Barboza, richard.henderson, Greg Kurz,
	qemu-devel, victor.colombo, Alex Bennée, David Gibson

[-- Attachment #1: Type: text/plain, Size: 2223 bytes --]

Something I forgot to mention in the cover letter, the XVFGER 
instructions accumulate the exception status and at the end set the 
FPSCR and take a Program interrupt on a trap-enabled exception, but as 
the exception functions are currently set up in target/ppc/fpu_helper.c 
a call to set a FPSCR bit could raise an exception before all bits could 
be set.

Victor (CCing him) is working on a patch series to fix the FPSCR.FI bit 
that will reorganize do_float_check_status (that would solve the 
aforementioned problem), so for now I sent without trying to solve that 
problem

In v2 I'll remember to mention this in the cover letter

On 26/04/2022 09:50, Lucas Mateus Castro(alqotel) wrote:
> From: "Lucas Mateus Castro (alqotel)"<lucas.araujo@eldorado.org.br>
>
> This patch series is an RFC of the Matrix-Multiply Assist (MMA)
> instructions implementation from the PowerISA 3.1
>
> These and the VDIV/VMOD implementation are the last new PowerISA 3.1
> instructions left to be implemented.
>
> Thanks
> Lucas Mateus Castro (alqotel) (7):
>    target/ppc: Implement xxm[tf]acc and xxsetaccz
>    target/ppc: Implemented xvi*ger* instructions
>    target/ppc: Implemented pmxvi*ger* instructions
>    target/ppc: Implemented xvf*ger*
>    target/ppc: Implemented xvf16ger*
>    target/ppc: Implemented pmxvf*ger*
>    target/ppc: Implemented [pm]xvbf16ger2*
>
>   include/fpu/softfloat.h             |   9 ++
>   target/ppc/cpu.h                    |  15 +++
>   target/ppc/fpu_helper.c             | 130 ++++++++++++++++++
>   target/ppc/helper.h                 |   7 +
>   target/ppc/insn32.decode            |  49 +++++++
>   target/ppc/insn64.decode            |  80 +++++++++++
>   target/ppc/int_helper.c             |  85 ++++++++++++
>   target/ppc/internal.h               |  28 ++++
>   target/ppc/translate/vsx-impl.c.inc | 200 ++++++++++++++++++++++++++++
>   9 files changed, 603 insertions(+)
>
-- 
Lucas Mateus M. Araujo e Castro
Instituto de Pesquisas ELDORADO 
<https://www.eldorado.org.br/?utm_campaign=assinatura_de_e-mail&utm_medium=email&utm_source=RD+Station>
Departamento Computação Embarcada
Analista de Software Trainee
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>

[-- Attachment #2: Type: text/html, Size: 2752 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/7] VSX MMA Implementation
  2022-04-27  7:10   ` Cédric Le Goater
@ 2022-05-05  6:06     ` Joel Stanley
  0 siblings, 0 replies; 21+ messages in thread
From: Joel Stanley @ 2022-05-05  6:06 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Lucas Mateus Castro(alqotel),
	qemu-ppc, Daniel Henrique Barboza, Richard Henderson,
	QEMU Developers, Greg Kurz, David Gibson, Leandro Lupori,
	Matheus Ferst

On Wed, 27 Apr 2022 at 07:10, Cédric Le Goater <clg@kaod.org> wrote:
>
> Hello,
>
> On 4/27/22 08:21, Joel Stanley wrote:
> > On Tue, 26 Apr 2022 at 12:51, Lucas Mateus Castro(alqotel)
> > <lucas.araujo@eldorado.org.br> wrote:
> >>
> >> From: "Lucas Mateus Castro (alqotel)" <lucas.araujo@eldorado.org.br>
> >>
> >> This patch series is an RFC of the Matrix-Multiply Assist (MMA)
> >> instructions implementation from the PowerISA 3.1
> >>
> >> These and the VDIV/VMOD implementation are the last new PowerISA 3.1
> >> instructions left to be implemented.
> >>
> >> Thanks
> >> Lucas Mateus Castro (alqotel) (7):
> >>    target/ppc: Implement xxm[tf]acc and xxsetaccz
> >>    target/ppc: Implemented xvi*ger* instructions
> >>    target/ppc: Implemented pmxvi*ger* instructions
> >>    target/ppc: Implemented xvf*ger*
> >>    target/ppc: Implemented xvf16ger*
> >>    target/ppc: Implemented pmxvf*ger*
> >>    target/ppc: Implemented [pm]xvbf16ger2*
> >
> > I have a small test case for the MMA instructions that Alistair wrote
> > a while back[1]. It passes when run with these patches applied
> > (previously it would sigill).
>
> Could we have your Tested-by then ?

Sure! I was going to re-test v2, but it doesn't hurt to mention it for
this version.

Tested-by: Joel Stanley <joel@jms.id.au>

>
>
> >
> > $ qemu-ppc64le -cpu power10  -L ~/ppc64le/ ./test -m
> > Smoke test MMA
> > MMA[0] = 1 (Correct)
> > MMA[1] = 2 (Correct)
> > MMA[2] = 3 (Correct)
> > MMA[3] = 4 (Correct)
> > MMA[4] = 2 (Correct)
> > MMA[5] = 4 (Correct)
> > MMA[6] = 6 (Correct)
> > MMA[7] = 8 (Correct)
> > MMA[8] = 3 (Correct)
> > MMA[9] = 6 (Correct)
> > MMA[10] = 9 (Correct)
> > MMA[11] = 12 (Correct)
> > MMA[12] = 4 (Correct)
> > MMA[13] = 8 (Correct)
> > MMA[14] = 12 (Correct)
> > MMA[15] = 16 (Correct)
> >
> > [1] https://github.com/shenki/p10_tests
>
> Looks like a good candidate for tests/tcg/ppc64le/. Adding Matheus and Leandro.
>
> Thanks,
>
> C.
>
>
>
> >
> >
> >>
> >>   include/fpu/softfloat.h             |   9 ++
> >>   target/ppc/cpu.h                    |  15 +++
> >>   target/ppc/fpu_helper.c             | 130 ++++++++++++++++++
> >>   target/ppc/helper.h                 |   7 +
> >>   target/ppc/insn32.decode            |  49 +++++++
> >>   target/ppc/insn64.decode            |  80 +++++++++++
> >>   target/ppc/int_helper.c             |  85 ++++++++++++
> >>   target/ppc/internal.h               |  28 ++++
> >>   target/ppc/translate/vsx-impl.c.inc | 200 ++++++++++++++++++++++++++++
> >>   9 files changed, 603 insertions(+)
> >>
> >> --
> >> 2.31.1
> >>
> >>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-05-05  6:12 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-26 12:50 [RFC PATCH 0/7] VSX MMA Implementation Lucas Mateus Castro(alqotel)
2022-04-26 12:50 ` [RFC PATCH 1/7] target/ppc: Implement xxm[tf]acc and xxsetaccz Lucas Mateus Castro(alqotel)
2022-04-26 22:59   ` Richard Henderson
2022-04-26 12:50 ` [RFC PATCH 2/7] target/ppc: Implemented xvi*ger* instructions Lucas Mateus Castro(alqotel)
2022-04-26 23:40   ` Richard Henderson
2022-04-27 20:24     ` Lucas Mateus Martins Araujo e Castro
2022-04-27 22:28       ` Richard Henderson
2022-04-26 12:50 ` [RFC PATCH 3/7] target/ppc: Implemented pmxvi*ger* instructions Lucas Mateus Castro(alqotel)
2022-04-26 12:50 ` [RFC PATCH 4/7] target/ppc: Implemented xvf*ger* Lucas Mateus Castro(alqotel)
2022-04-27  0:09   ` Richard Henderson
2022-04-26 12:50 ` [RFC PATCH 5/7] target/ppc: Implemented xvf16ger* Lucas Mateus Castro(alqotel)
2022-04-27  0:26   ` Richard Henderson
2022-04-27 21:11     ` Lucas Mateus Martins Araujo e Castro
2022-04-27 22:30       ` Richard Henderson
2022-04-26 12:50 ` [RFC PATCH 6/7] target/ppc: Implemented pmxvf*ger* Lucas Mateus Castro(alqotel)
2022-04-27  0:33   ` Richard Henderson
2022-04-26 12:50 ` [RFC PATCH 7/7] target/ppc: Implemented [pm]xvbf16ger2* Lucas Mateus Castro(alqotel)
2022-04-27  6:21 ` [RFC PATCH 0/7] VSX MMA Implementation Joel Stanley
2022-04-27  7:10   ` Cédric Le Goater
2022-05-05  6:06     ` Joel Stanley
2022-04-28 14:05 ` Lucas Mateus Martins Araujo e Castro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).