On Fri, May 1, 2020 at 10:18 PM Richard Henderson < richard.henderson@linaro.org> wrote: > On 5/1/20 6:10 AM, Alex Bennée wrote: > > > > 罗勇刚(Yonggang Luo) writes: > > > >> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan > wrote: > >> > >>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote: > >>>> That's what I suggested, > >>>> We preserve a float computing cache > >>>> typedef struct FpRecord { > >>>> uint8_t op; > >>>> float32 A; > >>>> float32 B; > >>>> } FpRecord; > >>>> FpRecord fp_cache[1024]; > >>>> int fp_cache_length; > >>>> uint32_t fp_exceptions; > >>>> > >>>> 1. For each new fp operation we push it to the fp_cache, > >>>> 2. Once we read the fp_exceptions , then we re-compute > >>>> the fp_exceptions by re-running the fp FpRecord sequence. > >>>> and clear fp_cache_length. > >>> > >>> Why do you need to store more than the last fp op? The cumulative bits > can > >>> be tracked like it's done for other targets by not clearing fp_status > then > >>> you can read it from there. Only the non-sticky FI bit needs to be > >>> computed but that's only determined by the last op so it's enough to > >>> remember that and run that with softfloat (or even hardfloat after > >>> clearing status but softfloat may be faster for this) to get the bits > for > >>> last op when status is read. > >>> > >> Yeap, store only the last fp op is also an option. Do you means that > store > >> the last fp op, > >> and calculate it when necessary? I am thinking about a general fp > >> optmize method that suite > >> for all target. > > > > I think that's getting a little ahead of yourself. Let's prove the > > technique is valuable for PPC (given it has the most to gain). We can > > always generalise later if it's worthwhile. > > Indeed. > > > Rather than creating a new structure I would suggest creating 3 new tcg > > globals (op, inA, inB) and re-factor the front-end code so each FP op > > loaded the TCG globals. The TCG optimizer should pick up aliased loads > > and automatically eliminate the dead ones. We might need some new > > machinery for the TCG to avoid spilling the values over potentially > > faulting loads/stores but that is likely a phase 2 problem. > > There's no point in new tcg globals. > > Every fp operation can raise an exception, and therefore every fp operation > will flush tcg globals to memory. Therefore there is no optimization to be > done at the tcg opcode level. > > However, every fp operation calls a helper function, and the quickest > thing to > do is store the inputs to env->(op, inA, inB, inC) in the helper before > performing the operation. > > > > Next you will want to find places that care about the per-op bits of > > cpu_fpscr and call a helper with the new globals to re-run the > > computation and feed the values in. > > Before we even get to this deferred fp operation thing, there are several > giant > improvements to ppc emulation that can be made: > > Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus(). > I've mentioned this before, that it's possible to leave the steady-state of > env->fp_status.exception_flags == 0, so there's no need for a separate > function > call. I suspect this is worth a decent speedup by itself. > Hi Richard, what kinds of rearrange the fp need to be done? Can you give me a more detailed example? I am still not get the idea. > > Step 2 is to notice when all fp exceptions are masked, so that no > exception can > be raised, and set a tb_flags bit. This is the default fp environment that > libc enables and therefore extremely common. > > Currently, ppc has 3 helpers called per fp operation. If step 1 is handled > correctly, then we're down to 2 fp helpers per fp operation. If no > exceptions > need raising, then we can perform the entire operation with a single > function call. > > We would require a parallel set of fp helpers that (1) performs the > operation > and (2) does any post-processing of the exception bits straight away, but > (3) > without raising any exceptions. Sort of like helper_fadd + > do_float_check_status, but less. IIRC the only real extra work is > categorizing > invalid exceptions. We could even plausibly extend softfloat to do that > while > it is recording the invalid exception. > > Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute > inexact > from the inverse hardfloat operation. This would let us relax the > restriction > of only using hardfloat when we have already have an accrued inexact > exception. > > Only after all of these are done is it worth experimenting with caching the > last fp operation. > > > r~ > -- 此致 礼 罗勇刚 Yours sincerely, Yonggang Luo