On 26/04/2022 21:26, Richard Henderson wrote:
> On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote:
>> +#define VSXGER16(NAME, ORIG_T, 
>> OR_EL)                                   \
>> +    void NAME(CPUPPCState *env, uint32_t a_r, uint32_t 
>> b_r,             \
>> +              uint32_t  at_r, uint32_t mask, uint32_t 
>> packed_flags)     \
>> + { \
>> +        ppc_vsr_t 
>> *at;                                                  \
>> +        float32 psum, aux_acc, va, vb, vc, 
>> vd;                          \
>> +        int i, j, xmsk_bit, 
>> ymsk_bit;                                   \
>> +        uint8_t xmsk = mask & 
>> 0x0F;                                     \
>> +        uint8_t ymsk = (mask >> 4) & 
>> 0x0F;                              \
>> +        uint8_t pmsk = (mask >> 8) & 
>> 0x3;                               \
>> +        ppc_vsr_t *b = cpu_vsr_ptr(env, 
>> b_r);                           \
>> +        ppc_vsr_t *a = cpu_vsr_ptr(env, 
>> a_r);                           \
>> +        float_status *excp_ptr = 
>> &env->fp_status;                       \
>> +        bool acc = 
>> ger_acc_flag(packed_flags);                          \
>> +        bool neg_acc = 
>> ger_neg_acc_flag(packed_flags);                  \
>> +        bool neg_mul = 
>> ger_neg_mul_flag(packed_flags);                  \
>> +        for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) 
>> {    \
>> +            at = cpu_vsr_ptr(env, at_r + 
>> i);                            \
>> +            for (j = 0, ymsk_bit = 1 << 3; j < 4; j++, ymsk_bit >>= 
>> 1) {\
>> +                if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) 
>> {           \
>> +                    va = !(pmsk & 2) ? float32_zero 
>> :                   \
>> +                                       GET_VSR(Vsr##OR_EL, 
>> a,           \
>> +                                               2 * i, ORIG_T, 
>> float32); \
>> +                    vb = !(pmsk & 2) ? float32_zero 
>> :                   \
>> +                                       GET_VSR(Vsr##OR_EL, 
>> b,           \
>> +                                               2 * j, ORIG_T, 
>> float32); \
>> +                    vc = !(pmsk & 1) ? float32_zero 
>> :                   \
>> +                                       GET_VSR(Vsr##OR_EL, 
>> a,           \
>> +                                            2 * i + 1, ORIG_T, 
>> float32);\
>> +                    vd = !(pmsk & 1) ? float32_zero 
>> :                   \
>> +                                       GET_VSR(Vsr##OR_EL, 
>> b,           \
>> +                                            2 * j + 1, ORIG_T, 
>> float32);\
>> +                    psum = float32_mul(va, vb, 
>> excp_ptr);               \
>> +                    psum = float32_muladd(vc, vd, psum, 0, 
>> excp_ptr);   \
>
> This isn't correct -- the intermediate 'prod' (the first multiply) is 
> not rounded.  I
> think the correct way to implement this (barring new softfloat 
> functions) is to compute
> the intermediate product as float64 with float_round_to_odd, then 
> float64r32_muladd into
> the correct rounding mode to finish.
While not mentioned in the pseudocode the instruction description says:

- Let prod be the single-precision product of src10 and src20

Which I understand as the result of the first multiplication being 
stored in a float32

But in xvbf16ger2* it's different (and I think this is the reason the 
last patch is resulting in the wrong signal in some 0 and inf results), 
the description says:

- Let prod be the product of src10 and src20, having infinite precision 
and unbounded exponent range. - Let psum be the sum of the product, 
src11 multiplied by src21, and prod, having infinite precision and 
unbounded exponent range.
- Let r1 be the value psum with its significand rounded to 24-bit 
precision using the rounding mode specified by RN, but retaining 
unbounded exponent range (i.e., cannot overflow or underflow).

>
>> +                    if (acc) 
>> {                                          \
>> +                        if (neg_mul) 
>> {                                  \
>> +                            psum = 
>> float32_neg(psum);                   \
>> + }                                               \
>> +                        if (neg_acc) 
>> {                                  \
>> +                            aux_acc = 
>> float32_neg(at->VsrSF(j));        \
>> +                        } else 
>> {                                        \
>> +                            aux_acc = 
>> at->VsrSF(j);                     \
>> + }                                               \
>> +                        at->VsrSF(j) = float32_add(psum, 
>> aux_acc,       \
>> + excp_ptr);           \
>
> This one, thankfully, uses the rounded intermediate result 'msum', so 
> is ok.
Yes this one is the easier one to deal with, in the description for the 
xvf16ger2* it specifies that msum and the result is rounded to 
single-precision and in the description for the xvbf16ger2 it specifies 
that r1 is 'rounded to a 24-bit significand precision and 8-bit exponent 
range (i.e., single-precision)'
>
> Please do convert this from a macro.  Given that float16 and bfloat16 
> are addressed the
> same, I think the only callback you need is the conversion from 
> float16_to_float64.  Drop
> the bf16 accessor to ppc_vsr_t.
>
Will do, although I'm considering instead of the callback being the 
conversion, maybe have it be a 4 float multiplication
     typedef float32 mul_4float(float16, float16, float16, float16);
Since float16 and bfloat16 are addressed the same, any thoughts?
>
> r~
-- 
Lucas Mateus M. Araujo e Castro
Instituto de Pesquisas ELDORADO 
<https://www.eldorado.org.br/?utm_campaign=assinatura_de_e-mail&utm_medium=email&utm_source=RD+Station>
Departamento Computação Embarcada
Analista de Software Trainee
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>