On 26/04/2022 21:26, Richard Henderson wrote: > On 4/26/22 05:50, Lucas Mateus Castro(alqotel) wrote: >> +#define VSXGER16(NAME, ORIG_T, >> OR_EL)                                   \ >> +    void NAME(CPUPPCState *env, uint32_t a_r, uint32_t >> b_r,             \ >> +              uint32_t  at_r, uint32_t mask, uint32_t >> packed_flags)     \ >> + { \ >> +        ppc_vsr_t >> *at;                                                  \ >> +        float32 psum, aux_acc, va, vb, vc, >> vd;                          \ >> +        int i, j, xmsk_bit, >> ymsk_bit;                                   \ >> +        uint8_t xmsk = mask & >> 0x0F;                                     \ >> +        uint8_t ymsk = (mask >> 4) & >> 0x0F;                              \ >> +        uint8_t pmsk = (mask >> 8) & >> 0x3;                               \ >> +        ppc_vsr_t *b = cpu_vsr_ptr(env, >> b_r);                           \ >> +        ppc_vsr_t *a = cpu_vsr_ptr(env, >> a_r);                           \ >> +        float_status *excp_ptr = >> &env->fp_status;                       \ >> +        bool acc = >> ger_acc_flag(packed_flags);                          \ >> +        bool neg_acc = >> ger_neg_acc_flag(packed_flags);                  \ >> +        bool neg_mul = >> ger_neg_mul_flag(packed_flags);                  \ >> +        for (i = 0, xmsk_bit = 1 << 3; i < 4; i++, xmsk_bit >>= 1) >> {    \ >> +            at = cpu_vsr_ptr(env, at_r + >> i);                            \ >> +            for (j = 0, ymsk_bit = 1 << 3; j < 4; j++, ymsk_bit >>= >> 1) {\ >> +                if ((xmsk_bit & xmsk) && (ymsk_bit & ymsk)) >> {           \ >> +                    va = !(pmsk & 2) ? float32_zero >> :                   \ >> +                                       GET_VSR(Vsr##OR_EL, >> a,           \ >> +                                               2 * i, ORIG_T, >> float32); \ >> +                    vb = !(pmsk & 2) ? float32_zero >> :                   \ >> +                                       GET_VSR(Vsr##OR_EL, >> b,           \ >> +                                               2 * j, ORIG_T, >> float32); \ >> +                    vc = !(pmsk & 1) ? float32_zero >> :                   \ >> +                                       GET_VSR(Vsr##OR_EL, >> a,           \ >> +                                            2 * i + 1, ORIG_T, >> float32);\ >> +                    vd = !(pmsk & 1) ? float32_zero >> :                   \ >> +                                       GET_VSR(Vsr##OR_EL, >> b,           \ >> +                                            2 * j + 1, ORIG_T, >> float32);\ >> +                    psum = float32_mul(va, vb, >> excp_ptr);               \ >> +                    psum = float32_muladd(vc, vd, psum, 0, >> excp_ptr);   \ > > This isn't correct -- the intermediate 'prod' (the first multiply) is > not rounded.  I > think the correct way to implement this (barring new softfloat > functions) is to compute > the intermediate product as float64 with float_round_to_odd, then > float64r32_muladd into > the correct rounding mode to finish. While not mentioned in the pseudocode the instruction description says: - Let prod be the single-precision product of src10 and src20 Which I understand as the result of the first multiplication being stored in a float32 But in xvbf16ger2* it's different (and I think this is the reason the last patch is resulting in the wrong signal in some 0 and inf results), the description says: - Let prod be the product of src10 and src20, having infinite precision and unbounded exponent range. - Let psum be the sum of the product, src11 multiplied by src21, and prod, having infinite precision and unbounded exponent range. - Let r1 be the value psum with its significand rounded to 24-bit precision using the rounding mode specified by RN, but retaining unbounded exponent range (i.e., cannot overflow or underflow). > >> +                    if (acc) >> {                                          \ >> +                        if (neg_mul) >> {                                  \ >> +                            psum = >> float32_neg(psum);                   \ >> + }                                               \ >> +                        if (neg_acc) >> {                                  \ >> +                            aux_acc = >> float32_neg(at->VsrSF(j));        \ >> +                        } else >> {                                        \ >> +                            aux_acc = >> at->VsrSF(j);                     \ >> + }                                               \ >> +                        at->VsrSF(j) = float32_add(psum, >> aux_acc,       \ >> + excp_ptr);           \ > > This one, thankfully, uses the rounded intermediate result 'msum', so > is ok. Yes this one is the easier one to deal with, in the description for the xvf16ger2* it specifies that msum and the result is rounded to single-precision and in the description for the xvbf16ger2 it specifies that r1 is 'rounded to a 24-bit significand precision and 8-bit exponent range (i.e., single-precision)' > > Please do convert this from a macro.  Given that float16 and bfloat16 > are addressed the > same, I think the only callback you need is the conversion from > float16_to_float64.  Drop > the bf16 accessor to ppc_vsr_t. > Will do, although I'm considering instead of the callback being the conversion, maybe have it be a 4 float multiplication     typedef float32 mul_4float(float16, float16, float16, float16); Since float16 and bfloat16 are addressed the same, any thoughts? > > r~ -- Lucas Mateus M. Araujo e Castro Instituto de Pesquisas ELDORADO Departamento Computação Embarcada Analista de Software Trainee Aviso Legal - Disclaimer