* Re: R: About hardfloat in ppc
2020-04-29 11:57 ` Alex Bennée
@ 2020-04-29 12:33 ` 罗勇刚(Yonggang Luo)
2020-04-29 13:38 ` Alex Bennée
2020-04-29 14:31 ` R: " Dino Papararo
2020-04-29 23:12 ` R: " 罗勇刚(Yonggang Luo)
2 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-29 12:33 UTC (permalink / raw)
To: Alex Bennée
Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Dino Papararo
[-- Attachment #1: Type: text/plain, Size: 10859 bytes --]
On Wed, Apr 29, 2020 at 7:57 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Dino Papararo <skizzato73@msn.com> writes:
>
> > Hello,
> > about handling of PPC fpu exceptions and Hard Floats support we could
> consider a different approach for different instructions.
> > i.e. not all fpu instructions take care about inexact or exceptions
> bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding
> f1+f2 into f1 register and no one will check about inexact or exception
> bits raised into FPSCR register.
> > Instead if I'll take fadd. f0,f1,f2 the dot following the add
> instructions means I want take inexact or exceptions bits into account.
> > So I could use hard floats for first case and softfloats for second case.
> > Could this be a fast solution to start implement hard floats for PPC??
>
> While it may be true that normal software practice is not to read the
> exception registers for every operation we can't base our emulation on
> that. We must always be able to re-create the state of the exception
> registers whenever they may be read by the program. There are 3 cases
> this may happen:
>
> - a direct read of the inexact register
> - checking the sigcontext of a synchronous exception (e.g. fault)
> - checking the sigcontext of an asynchronous exception (e.g. timer/IPI)
>
> Given the way the translator works we can simplify the asynchronous case
> because we know they are only ever delivered at the start of translated
> blocks. We must have a fully rectified system state at the end of every
> block. So lets consider some cases:
>
> fpOpA
> clear flags
> fpOpB
> clear flags
> fpOpC
> read flags
>
So we only need clear flags for before the fp op that are running before
the read flags are
triggered? So the key point is finding all the read flags op, and find the
latest clear flags op
before the latest fp op instuction that before the read flags. May this be
expressed in TCG ops?
>
> Assuming we know the fpOps can't generate exceptions we can know that
> only fpOpC will ever generate a user visible floating point flags so we
> can indeed use hardfloat for fpOpA and fpOpB. However if we see the
> pattern:
>
> fpOpA
> ld/st
>
What does ld/st means? load and store float point values?
> clear flags
> fpOpB
> read flags
>
> we must have the fully rectified version of the flags because the ld/st
> may fault. However it's not guaranteed it will fault so we could defer
> the flag calculation for fpOpA until such time as we need it. The
> easiest way would be to save the values going into the operation and
> then re-run it in softfloat when required (hopefully never ;-).
>
> A lot will depend on the behaviour of the architecture. For example:
>
> fpOpA
> fpOpB
> read flags
>
> whether or not we need to be able to calculate the flags for fpOpA will
> depend on if fpOpB completely resets the flags visible or if the result
> is additive.
>
> So in short I think there may be scope for using hardfloat but it will
> require knowledge of front-end knowing if it is safe to skip flag
> calculation in particular cases. We might even need support within TCG
> for saving (and marking) temporaries over potentially faulting
> boundaries so these lazy evaluations can be done. We can certainly add a
> fp-status less set of primitives to softfloat which can use the
> hardfloat path when we know we are using normal numbers.
>
> >
> > A little of documentation here:
> http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/PPCNumerics/PPCNumerics-154.html
> >
> > Regards,
> > Dino Papararo
> >
> > -----Messaggio originale-----
> > Da: Qemu-devel <qemu-devel-bounces+skizzato73=msn.com@nongnu.org> Per
> conto di Alex Bennée
> > Inviato: martedì 28 aprile 2020 10:37
> > A: luoyonggang@gmail.com
> > Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
> > Oggetto: Re: About hardfloat in ppc
> >
> >
> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >
> >> I am confusing why only inexact are set then we can use hard-float.
> >
> > The inexact behaviour of the host hardware may be different from the
> guest architecture we are trying to emulate and the host hardware may not
> be configurable to emulate the guest mode.
> >
> > Have a look in softfloat.c and see all the places where
> float_flag_inexact is set. Can you convince yourself that the host hardware
> will do the same?
> >
> >> And PPC always clearing inexact flag before calling to soft-float
> >> funcitons. so we can not optimize it with hard-float.
> >> I need some resouces about ineact flag and why always clearing inexcat
> >> in PPC FP simualtion.
> >
> > Because that is the behaviour of the PPC floating point unit. The
> inexact flag will represent the last operation done.
> >
> >> I am looking for two possible solution:
> >> 1. do not clear inexact flag in PPC simulation 2. even the inexact are
> >> cleared, we can still use alternative hard-float.
> >>
> >> But now I am the beginner, Have no clue about all the things.
> >
> > Well you'll need to learn about floating point because these are rather
> fundamental aspects of it's behaviour. In the old days QEMU used to use the
> host floating point processor with it's template based translation.
> > However this led to lots of weird bugs because the floating point
> answers under qemu where different from the target it was trying to
> emulate. It was for this reason softfloat was introduced. The hardfloat
> optimisation can only be done when we are confident that we will get the
> exact same answer of the target we are trying to emulate - a "faster but
> incorrect" mode is just going to cause confusion as discussed in the
> previous thread. Have you read that yet?
> >
> >>
> >> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org>
> wrote:
> >>
> >>>
> >>> BALATON Zoltan <balaton@eik.bme.hu> writes:
> >>>
> >>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
> >>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, So
> >>> >>> is that possible to optimize the performance when
> >>> float_flag_inexact
> >>> >>> are cleared?
> >>> >>
> >>> >> There was some discussion about this in the last thread about
> >>> >> enabling hardfloat for PPC. See the thread:
> >>> >>
> >>> >> Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >>> >> Date: Tue, 18 Feb 2020 18:10:16 +0100
> >>> >> Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
> >>> >
> >>> > I've answered this already with link to that thread here:
> >>> >
> >>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
> >>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
> >>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
> >>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
> >>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
> >>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
> >>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >>> > :
> >>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> >>> > :> Are this stable now? I'd like to see hard float to be landed:)
> >>> > :
> >>> > : If you want to see hardfloat for PPC then you should read the
> >>> > replies to : this patch which can be found here:
> >>> > :
> >>> > : http://patchwork.ozlabs.org/patch/1240235/
> >>> > :
> >>> > : to understand what's needed then try to implement the solution
> >>> > with FP : exceptions cached in a global that maybe could work. I
> >>> > won't be able to : do that as said here:
> >>> > :
> >>> > :
> >>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.htm
> >>> > l
> >>> > :
> >>> > : because I don't have time to learn all the details needed. I think
> :
> >>> > others are in the same situation so unless somebody puts in the :
> >>> > necessary effort this won't change.
> >>> >
> >>> > Which also had a proposed solution to the problem that you could
> >>> > try to implement, in particular see this message:
> >>> >
> >>> >
> >>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.9
> >>> 79F074637D@zero.eik.bme.hu/#2375124
> >>> >
> >>> > amd Richard's reply immediately below that. In short to optimise
> >>> > FPU emulation we would either find a way to compute inexact flag
> >>> > quickly without reading the FPU status (this may not be possible)
> >>> > or somehow get status from the FPU but the obvious way of claring
> >>> > the flag and reading them after each operation is too slow. So
> >>> > maybe using exceptions and only clearing when actually there's a
> >>> > change could be faster.
> >>> >
> >>> > As to how to use exceptions see this message in above thread:
> >>> >
> >>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.htm
> >>> > l
> >>> >
> >>> > But that's only to show how to hook in an exception handler what it
> >>> > does needs to be implemented. Then tested and benchmarked.
> >>> >
> >>> > I still don't know where are the extensive PPC floating point tests
> >>> > to use for checking results though as that was never answered.
> >>>
> >>> Specifically for PPC we don't have them. We use the softfloat test
> >>> cases to exercise our softfloat/hardfloat code as part of "make
> >>> check-softfloat". You can also re-build fp-bench for each guest
> >>> target to measure raw throughput.
> >>>
> >>> >> However in short the problem is if the float_flag_inexact is clear
> >>> >> you must use softfloat so you can properly calculate the inexact
> >>> >> status. We can't take advantage of the inexact stickiness without
> >>> >> loosing the fidelity of the calculation.
> >>> >
> >>> > I still don't get why can't we use hardware via exception handler
> >>> > to detect flags for us and why do we only use hardfloat in some
> >>> > corner cases. If reading the status is too costly then we could
> >>> > mirror it in a global which is set by an FP exception handler.
> >>> > Shouldn't that be faster? Is there a reason that can't work?
> >>>
> >>> It would work but it would be slow. Almost every FP operation sets
> >>> the inexact flag so it would generate an exception and exceptions
> >>> take time to process.
> >>>
> >>> For the guests where we use hardfloat operations with inexact already
> >>> latched is not a corner case - it is the common case which is why it
> >>> helps.
> >>>
> >>> >
> >>> > Regards,
> >>> > BALATON Zoltan
> >>>
> >>>
> >>> --
> >>> Alex Bennée
> >>>
>
>
> --
> Alex Bennée
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 15136 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: About hardfloat in ppc
2020-04-29 12:33 ` 罗勇刚(Yonggang Luo)
@ 2020-04-29 13:38 ` Alex Bennée
0 siblings, 0 replies; 40+ messages in thread
From: Alex Bennée @ 2020-04-29 13:38 UTC (permalink / raw)
To: luoyonggang
Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Dino Papararo
罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> On Wed, Apr 29, 2020 at 7:57 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>>
>> Dino Papararo <skizzato73@msn.com> writes:
>>
>> > Hello,
>> > about handling of PPC fpu exceptions and Hard Floats support we could
>> consider a different approach for different instructions.
>> > i.e. not all fpu instructions take care about inexact or exceptions
>> bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding
>> f1+f2 into f1 register and no one will check about inexact or exception
>> bits raised into FPSCR register.
>> > Instead if I'll take fadd. f0,f1,f2 the dot following the add
>> instructions means I want take inexact or exceptions bits into account.
>> > So I could use hard floats for first case and softfloats for second case.
>> > Could this be a fast solution to start implement hard floats for PPC??
>>
>> While it may be true that normal software practice is not to read the
>> exception registers for every operation we can't base our emulation on
>> that. We must always be able to re-create the state of the exception
>> registers whenever they may be read by the program. There are 3 cases
>> this may happen:
>>
>> - a direct read of the inexact register
>> - checking the sigcontext of a synchronous exception (e.g. fault)
>> - checking the sigcontext of an asynchronous exception (e.g. timer/IPI)
>>
>> Given the way the translator works we can simplify the asynchronous case
>> because we know they are only ever delivered at the start of translated
>> blocks. We must have a fully rectified system state at the end of every
>> block. So lets consider some cases:
>>
>> fpOpA
>> clear flags
>> fpOpB
>> clear flags
>> fpOpC
>> read flags
>>
> So we only need clear flags for before the fp op that are running before
> the read flags are
> triggered? So the key point is finding all the read flags op, and find the
> latest clear flags op
> before the latest fp op instuction that before the read flags. May this be
> expressed in TCG ops?
In the simple case of flags not being able to be read from a chain of
operations this could all be handled in the front end by using a
different set of helpers (or maybe tweaking the helper to handle a NULL
fpst?) when it knows the values won't be needed.
The trouble is scanning forward enough to know this is the case as the
way the decoders currently work is by dealing with an instruction at a
time. There are some cases where we use tcg_last_op() to save the
location of an operations and then tcg_set_insn_param() update a
parameter after the fact. Your could save the location of every fpOp
with tcg_last_op() and then go through each on updating the parameters
to the helper to indicate if you care about calculating the flags or
not.
>> Assuming we know the fpOps can't generate exceptions we can know that
>> only fpOpC will ever generate a user visible floating point flags so we
>> can indeed use hardfloat for fpOpA and fpOpB. However if we see the
>> pattern:
>>
>> fpOpA
>> ld/st
>>
> What does ld/st means? load and store float point values?
Generally any load or store to memory has the potential to fault
regardless of what it is actually storing. There may be other
potentially faulting instructions as well - it will depend on your
architecture.
--
Alex Bennée
^ permalink raw reply [flat|nested] 40+ messages in thread
* R: R: About hardfloat in ppc
2020-04-29 11:57 ` Alex Bennée
2020-04-29 12:33 ` 罗勇刚(Yonggang Luo)
@ 2020-04-29 14:31 ` Dino Papararo
2020-04-29 14:49 ` Peter Maydell
2020-04-29 18:25 ` R: " Alex Bennée
2020-04-29 23:12 ` R: " 罗勇刚(Yonggang Luo)
2 siblings, 2 replies; 40+ messages in thread
From: Dino Papararo @ 2020-04-29 14:31 UTC (permalink / raw)
To: Alex Bennée
Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, luoyonggang,
qemu-ppc, Howard Spoelstra
Hi Alex,
maybe a pseudo code can show better what I mean 😊
if (ppc_fpu_instruction == USE_FPSCR) /* instruction have dot '.' so FPSCR will be updated and we need have care about it */
soft_decode (ppc_fpu_instruction)
else /* instruction have not dot '.' and FPSCR will be never updated and we don't need to have care about it -> maxspeed */
hard_decode (ppc_fpu_instruction)
In ppc assembly all instructions who needs to take care of inexact flag and/or exception flags, are processed prior than test instructions, look at following exception handling example:
fadd. f0,f1,f2 # f1 + f2 = f0. CR1 contains except.summary
bta 4,error # if bit 0 of CR1 is set, go to error
# bit 0 is set if any exception occurs
. # if clear, continue operation
.
.
error:
mcrfs 2,1 # copy FPSCR bits 4-7 to CR field 2
# now CR1 and CR2 (bits 6 through 10)
# contain all exception bits from FPSCR
bta 6,invalid # CR bit 6 signals invalid
bta 7,overflow # CR bit 7 signals overflow
bta 8,underflow # CR bit 8 signals underflow
bta 9,divbyzero # CR bit 9 signals divide-by-zero
bta 10,inexact # CR bit 10 signals inexact
invalid:
mcrfs 2,2 # copy FPSCR bits 8-11 to CR field 2
mcrfs 3,3 # copy FPSCR bits 12-15 to CR field 3
mcrfs 4,5 # copy FPSCR bits 20-23 to CR field 4
# invalid bits are now CR bits 11-16 and bit 23
# now do exception handling based on which invalid bit
# is set
overflow:
# do exception handling for overflow exception
underflow:
# do exception handling for underflow exception
divbyzero:
#do exception handling for the divide-by-zero exception
inexact:
# do exception handling for the inexact exception
In this way you can know as soon as possible if you can go with hardfloats or not.
I leave to you TCG's experts how it works and how to implement it, I'm only tryng to explain a possible fast way to go (if ever possible) 😊
..Large majority of software don't check for exceptions at all and if I really want to pursue max precision I'll go for a software multiprecision library like GMP or MPFR Libraries.
So the hardfloats 'should' be set as first choice and only if instruction requires precision/error check process it in softfloats.
I hope to have added some new ideas to discussion, thank a lot Alex!
Dino
-----Messaggio originale-----
Da: Alex Bennée <alex.bennee@linaro.org>
Inviato: mercoledì 29 aprile 2020 13:57
A: Dino Papararo <skizzato73@msn.com>
Cc: luoyonggang@gmail.com; BALATON Zoltan <balaton@eik.bme.hu>; Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>; Programmingkid <programmingkidx@gmail.com>; Howard Spoelstra <hsp.cat7@gmail.com>; qemu-ppc@nongnu.org; qemu-devel@nongnu.org
Oggetto: Re: R: About hardfloat in ppc
Dino Papararo <skizzato73@msn.com> writes:
> Hello,
> about handling of PPC fpu exceptions and Hard Floats support we could consider a different approach for different instructions.
> i.e. not all fpu instructions take care about inexact or exceptions bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 into f1 register and no one will check about inexact or exception bits raised into FPSCR register.
> Instead if I'll take fadd. f0,f1,f2 the dot following the add instructions means I want take inexact or exceptions bits into account.
> So I could use hard floats for first case and softfloats for second case.
> Could this be a fast solution to start implement hard floats for PPC??
While it may be true that normal software practice is not to read the exception registers for every operation we can't base our emulation on that. We must always be able to re-create the state of the exception registers whenever they may be read by the program. There are 3 cases this may happen:
- a direct read of the inexact register
- checking the sigcontext of a synchronous exception (e.g. fault)
- checking the sigcontext of an asynchronous exception (e.g. timer/IPI)
Given the way the translator works we can simplify the asynchronous case because we know they are only ever delivered at the start of translated blocks. We must have a fully rectified system state at the end of every block. So lets consider some cases:
fpOpA
clear flags
fpOpB
clear flags
fpOpC
read flags
Assuming we know the fpOps can't generate exceptions we can know that only fpOpC will ever generate a user visible floating point flags so we can indeed use hardfloat for fpOpA and fpOpB. However if we see the
pattern:
fpOpA
ld/st
clear flags
fpOpB
read flags
we must have the fully rectified version of the flags because the ld/st may fault. However it's not guaranteed it will fault so we could defer the flag calculation for fpOpA until such time as we need it. The easiest way would be to save the values going into the operation and then re-run it in softfloat when required (hopefully never ;-).
A lot will depend on the behaviour of the architecture. For example:
fpOpA
fpOpB
read flags
whether or not we need to be able to calculate the flags for fpOpA will depend on if fpOpB completely resets the flags visible or if the result is additive.
So in short I think there may be scope for using hardfloat but it will require knowledge of front-end knowing if it is safe to skip flag calculation in particular cases. We might even need support within TCG for saving (and marking) temporaries over potentially faulting boundaries so these lazy evaluations can be done. We can certainly add a fp-status less set of primitives to softfloat which can use the hardfloat path when we know we are using normal numbers.
>
> A little of documentation here:
> http://mirror.informatimago.com/next/developer.apple.com/documentation
> /mac/PPCNumerics/PPCNumerics-154.html
>
> Regards,
> Dino Papararo
>
> -----Messaggio originale-----
> Da: Qemu-devel <qemu-devel-bounces+skizzato73=msn.com@nongnu.org> Per
> conto di Alex Bennée
> Inviato: martedì 28 aprile 2020 10:37
> A: luoyonggang@gmail.com
> Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
> Oggetto: Re: About hardfloat in ppc
>
>
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>
>> I am confusing why only inexact are set then we can use hard-float.
>
> The inexact behaviour of the host hardware may be different from the guest architecture we are trying to emulate and the host hardware may not be configurable to emulate the guest mode.
>
> Have a look in softfloat.c and see all the places where float_flag_inexact is set. Can you convince yourself that the host hardware will do the same?
>
>> And PPC always clearing inexact flag before calling to soft-float
>> funcitons. so we can not optimize it with hard-float.
>> I need some resouces about ineact flag and why always clearing
>> inexcat in PPC FP simualtion.
>
> Because that is the behaviour of the PPC floating point unit. The inexact flag will represent the last operation done.
>
>> I am looking for two possible solution:
>> 1. do not clear inexact flag in PPC simulation 2. even the inexact
>> are cleared, we can still use alternative hard-float.
>>
>> But now I am the beginner, Have no clue about all the things.
>
> Well you'll need to learn about floating point because these are rather fundamental aspects of it's behaviour. In the old days QEMU used to use the host floating point processor with it's template based translation.
> However this led to lots of weird bugs because the floating point answers under qemu where different from the target it was trying to emulate. It was for this reason softfloat was introduced. The hardfloat optimisation can only be done when we are confident that we will get the exact same answer of the target we are trying to emulate - a "faster but incorrect" mode is just going to cause confusion as discussed in the previous thread. Have you read that yet?
>
>>
>> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>>
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>
>>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
>>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>>> >>> Because ppc fpu-helper are always clearing float_flag_inexact,
>>> >>> So is that possible to optimize the performance when
>>> float_flag_inexact
>>> >>> are cleared?
>>> >>
>>> >> There was some discussion about this in the last thread about
>>> >> enabling hardfloat for PPC. See the thread:
>>> >>
>>> >> Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>>> >> Date: Tue, 18 Feb 2020 18:10:16 +0100
>>> >> Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
>>> >
>>> > I've answered this already with link to that thread here:
>>> >
>>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
>>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
>>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
>>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
>>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
>>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
>>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>>> > :
>>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>>> > :> Are this stable now? I'd like to see hard float to be landed:)
>>> > :
>>> > : If you want to see hardfloat for PPC then you should read the
>>> > replies to : this patch which can be found here:
>>> > :
>>> > : http://patchwork.ozlabs.org/patch/1240235/
>>> > :
>>> > : to understand what's needed then try to implement the solution
>>> > with FP : exceptions cached in a global that maybe could work. I
>>> > won't be able to : do that as said here:
>>> > :
>>> > :
>>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.ht
>>> > m
>>> > l
>>> > :
>>> > : because I don't have time to learn all the details needed. I think :
>>> > others are in the same situation so unless somebody puts in the :
>>> > necessary effort this won't change.
>>> >
>>> > Which also had a proposed solution to the problem that you could
>>> > try to implement, in particular see this message:
>>> >
>>> >
>>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.
>>> 9
>>> 79F074637D@zero.eik.bme.hu/#2375124
>>> >
>>> > amd Richard's reply immediately below that. In short to optimise
>>> > FPU emulation we would either find a way to compute inexact flag
>>> > quickly without reading the FPU status (this may not be possible)
>>> > or somehow get status from the FPU but the obvious way of claring
>>> > the flag and reading them after each operation is too slow. So
>>> > maybe using exceptions and only clearing when actually there's a
>>> > change could be faster.
>>> >
>>> > As to how to use exceptions see this message in above thread:
>>> >
>>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.ht
>>> > m
>>> > l
>>> >
>>> > But that's only to show how to hook in an exception handler what
>>> > it does needs to be implemented. Then tested and benchmarked.
>>> >
>>> > I still don't know where are the extensive PPC floating point
>>> > tests to use for checking results though as that was never answered.
>>>
>>> Specifically for PPC we don't have them. We use the softfloat test
>>> cases to exercise our softfloat/hardfloat code as part of "make
>>> check-softfloat". You can also re-build fp-bench for each guest
>>> target to measure raw throughput.
>>>
>>> >> However in short the problem is if the float_flag_inexact is
>>> >> clear you must use softfloat so you can properly calculate the
>>> >> inexact status. We can't take advantage of the inexact stickiness
>>> >> without loosing the fidelity of the calculation.
>>> >
>>> > I still don't get why can't we use hardware via exception handler
>>> > to detect flags for us and why do we only use hardfloat in some
>>> > corner cases. If reading the status is too costly then we could
>>> > mirror it in a global which is set by an FP exception handler.
>>> > Shouldn't that be faster? Is there a reason that can't work?
>>>
>>> It would work but it would be slow. Almost every FP operation sets
>>> the inexact flag so it would generate an exception and exceptions
>>> take time to process.
>>>
>>> For the guests where we use hardfloat operations with inexact
>>> already latched is not a corner case - it is the common case which
>>> is why it helps.
>>>
>>> >
>>> > Regards,
>>> > BALATON Zoltan
>>>
>>>
>>> --
>>> Alex Bennée
>>>
--
Alex Bennée
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: About hardfloat in ppc
2020-04-29 14:31 ` R: " Dino Papararo
@ 2020-04-29 14:49 ` Peter Maydell
2020-04-29 18:25 ` R: " Alex Bennée
1 sibling, 0 replies; 40+ messages in thread
From: Peter Maydell @ 2020-04-29 14:49 UTC (permalink / raw)
To: Dino Papararo
Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, luoyonggang,
qemu-ppc, Howard Spoelstra, Alex Bennée
On Wed, 29 Apr 2020 at 15:33, Dino Papararo <skizzato73@msn.com> wrote:
>
> Hi Alex,
> maybe a pseudo code can show better what I mean
>
> if (ppc_fpu_instruction == USE_FPSCR) /* instruction have dot '.' so FPSCR will be updated and we need have care about it */
> soft_decode (ppc_fpu_instruction)
> else /* instruction have not dot '.' and FPSCR will be never updated and we don't need to have care about it -> maxspeed */
> hard_decode (ppc_fpu_instruction)
My understanding was that the '.' indicates whether
the instruction updates CR1 (the condition register),
which is separate from whether it updates FPSCR
flags. So all insns update FPSCR flags; insns with
a '.' additionally update CR state which can be
tested by a following branch insn. (I'm not a PPC
expert but that's what my reading of the ISA spec is.)
> In ppc assembly all instructions who needs to take care of inexact flag and/or exception flags, are processed prior than test instructions, look at following exception handling example:
>
> fadd. f0,f1,f2 # f1 + f2 = f0. CR1 contains except.summary
> bta 4,error # if bit 0 of CR1 is set, go to error
> # bit 0 is set if any exception occurs
> . # if clear, continue operation
> .
> .
> error:
> mcrfs 2,1 # copy FPSCR bits 4-7 to CR field 2
> # now CR1 and CR2 (bits 6 through 10)
> # contain all exception bits from FPSCR
This may be a common pattern, but the architecture doesn't
require it. You could equally do
fadd f0,f1,f2 # insn which sets fpscr bits
mffs 30 # copy whole fpscr to a gp register
# now do stuff based on that value
So unless you can tell for certain that nothing in
the future guest execution can the relevant FPSCR bits
before they're overwritten, you have to generate them
correctly; or be able to re-generate them later, if
you want to get fancy (you could imagine a scheme
similar to how we handle CPU condition flags on
some guests, where instead of calculating them every
time we make a note of what the operation that should
have set them was, so that at the point where the
guest actually does read the fpscr or do something
else that demands the real flag value we can recreate
them, in this case by repeating the fp operation via
softfloat. Getting that working would be a non-trivial
project, though.)
thanks
-- PMM
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: R: About hardfloat in ppc
2020-04-29 14:31 ` R: " Dino Papararo
2020-04-29 14:49 ` Peter Maydell
@ 2020-04-29 18:25 ` Alex Bennée
2020-04-30 0:20 ` 罗勇刚(Yonggang Luo)
1 sibling, 1 reply; 40+ messages in thread
From: Alex Bennée @ 2020-04-29 18:25 UTC (permalink / raw)
To: Dino Papararo
Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, luoyonggang,
qemu-ppc, Howard Spoelstra
Dino Papararo <skizzato73@msn.com> writes:
> Hi Alex,
<snip>
>
> I leave to you TCG's experts how it works and how to implement it, I'm
> only tryng to explain a possible fast way to go (if ever possible) 😊
This is all a theoretical discussion unless someone cares enough to
improve the situation. While I have an interest in improving TCG
performance I'm afraid there are many more easier wins before tackling a
target specific hack for which I'm not familiar. No doubt this thread
will be referred to next time someone wants something done about it.
> ..Large majority of software don't check for exceptions at all and if
> I really want to pursue max precision I'll go for a software
> multiprecision library like GMP or MPFR Libraries.
However for QEMU we regard failure to correctly emulate the architecture
as a bug - we don't code to common software patterns because there is
plenty of software out there that doesn't follow it.
> So the hardfloats 'should' be set as first choice and only if
> instruction requires precision/error check process it in softfloats.
Sure but someone will have to do the work to support that.
--
Alex Bennée
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: R: About hardfloat in ppc
2020-04-29 18:25 ` R: " Alex Bennée
@ 2020-04-30 0:20 ` 罗勇刚(Yonggang Luo)
2020-04-30 2:18 ` Richard Henderson
0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-30 0:20 UTC (permalink / raw)
To: Alex Bennée
Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Dino Papararo
[-- Attachment #1: Type: text/plain, Size: 1740 bytes --]
Question, in hard-float, if we don't want to read the fp register.
for example: If we wanna compute c = a + b in fp32
if c = a + b In hard float
and if b1 = c - a in hard float
if b1 != b at bitwise level, the we se the inexat to 1, otherwsie
we set inexat bit to 0? are this valid?
we can also do it for a * b, a - b, a / b.
On Thu, Apr 30, 2020 at 2:25 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Dino Papararo <skizzato73@msn.com> writes:
>
> > Hi Alex,
> <snip>
> >
> > I leave to you TCG's experts how it works and how to implement it, I'm
> > only tryng to explain a possible fast way to go (if ever possible) 😊
>
> This is all a theoretical discussion unless someone cares enough to
> improve the situation. While I have an interest in improving TCG
> performance I'm afraid there are many more easier wins before tackling a
> target specific hack for which I'm not familiar. No doubt this thread
> will be referred to next time someone wants something done about it.
>
> > ..Large majority of software don't check for exceptions at all and if
> > I really want to pursue max precision I'll go for a software
> > multiprecision library like GMP or MPFR Libraries.
>
> However for QEMU we regard failure to correctly emulate the architecture
> as a bug - we don't code to common software patterns because there is
> plenty of software out there that doesn't follow it.
>
> > So the hardfloats 'should' be set as first choice and only if
> > instruction requires precision/error check process it in softfloats.
>
> Sure but someone will have to do the work to support that.
>
> --
> Alex Bennée
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 2405 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: R: About hardfloat in ppc
2020-04-30 0:20 ` 罗勇刚(Yonggang Luo)
@ 2020-04-30 2:18 ` Richard Henderson
2020-04-30 7:26 ` 罗勇刚(Yonggang Luo)
2020-04-30 8:13 ` 罗勇刚(Yonggang Luo)
0 siblings, 2 replies; 40+ messages in thread
From: Richard Henderson @ 2020-04-30 2:18 UTC (permalink / raw)
To: luoyonggang, Alex Bennée
Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Dino Papararo
On 4/29/20 5:20 PM, 罗勇刚(Yonggang Luo) wrote:
> Question, in hard-float, if we don't want to read the fp register.
> for example: If we wanna compute c = a + b in fp32
> if c = a + b In hard float
> and if b1 = c - a in hard float
> if b1 != b at bitwise level, the we se the inexat to 1, otherwsie
> we set inexat bit to 0? are this valid?
>
> we can also do it for a * b, a - b, a / b.
>
That does seem plausible, for all of the normal values for which we would apply
the hard-float optimization anyway. But we already check for the exceptional
cases:
if (unlikely(f32_is_inf(ur))) {
s->float_exception_flags |= float_flag_overflow;
} else if (unlikely(fabsf(ur.h) <= FLT_MIN)) {
if (post == NULL || post(ua, ub)) {
goto soft;
}
}
r~
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: R: About hardfloat in ppc
2020-04-30 2:18 ` Richard Henderson
@ 2020-04-30 7:26 ` 罗勇刚(Yonggang Luo)
2020-04-30 8:11 ` Alex Bennée
2020-04-30 8:13 ` 罗勇刚(Yonggang Luo)
1 sibling, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-30 7:26 UTC (permalink / raw)
To: Richard Henderson
Cc: Dino Papararo, Mark Cave-Ayland, qemu-devel, Programmingkid,
qemu-ppc, Howard Spoelstra, Alex Bennée
[-- Attachment #1: Type: text/plain, Size: 1155 bytes --]
On Thu, Apr 30, 2020 at 10:18 AM Richard Henderson <
richard.henderson@linaro.org> wrote:
> On 4/29/20 5:20 PM, 罗勇刚(Yonggang Luo) wrote:
> > Question, in hard-float, if we don't want to read the fp register.
> > for example: If we wanna compute c = a + b in fp32
> > if c = a + b In hard float
> > and if b1 = c - a in hard float
> > if b1 != b at bitwise level, the we se the inexat to 1, otherwsie
> > we set inexat bit to 0? are this valid?
> >
> > we can also do it for a * b, a - b, a / b.
> >
>
> That does seem plausible, for all of the normal values for which we would
> apply
> the hard-float optimization anyway. But we already check for the
> exceptional
> cases:
>
> if (unlikely(f32_is_inf(ur))) {
> s->float_exception_flags |= float_flag_overflow;
> } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) {
> if (post == NULL || post(ua, ub)) {
> goto soft;
> }
> }
>
> I means remove of all thse exceptional cases, and detecting float
exception by hard float operation.
>
> r~
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 1820 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: R: About hardfloat in ppc
2020-04-30 7:26 ` 罗勇刚(Yonggang Luo)
@ 2020-04-30 8:11 ` Alex Bennée
0 siblings, 0 replies; 40+ messages in thread
From: Alex Bennée @ 2020-04-30 8:11 UTC (permalink / raw)
To: luoyonggang
Cc: Richard Henderson, Mark Cave-Ayland, qemu-devel, Programmingkid,
qemu-ppc, Howard Spoelstra, Dino Papararo
罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> On Thu, Apr 30, 2020 at 10:18 AM Richard Henderson <
> richard.henderson@linaro.org> wrote:
>
>> On 4/29/20 5:20 PM, 罗勇刚(Yonggang Luo) wrote:
>> > Question, in hard-float, if we don't want to read the fp register.
>> > for example: If we wanna compute c = a + b in fp32
>> > if c = a + b In hard float
>> > and if b1 = c - a in hard float
>> > if b1 != b at bitwise level, the we se the inexat to 1, otherwsie
>> > we set inexat bit to 0? are this valid?
>> >
>> > we can also do it for a * b, a - b, a / b.
>> >
>>
>> That does seem plausible, for all of the normal values for which we would
>> apply
>> the hard-float optimization anyway. But we already check for the
>> exceptional
>> cases:
>>
>> if (unlikely(f32_is_inf(ur))) {
>> s->float_exception_flags |= float_flag_overflow;
>> } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) {
>> if (post == NULL || post(ua, ub)) {
>> goto soft;
>> }
>> }
>>
> I means remove of all thse exceptional cases, and detecting float
> exception by hard float operation.
When this was originally done it was found to be faster testing for the
float conditions in software (which are basically bitops) than reading
the FP exception register which can be a high latency operation.
--
Alex Bennée
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: R: About hardfloat in ppc
2020-04-30 2:18 ` Richard Henderson
2020-04-30 7:26 ` 罗勇刚(Yonggang Luo)
@ 2020-04-30 8:13 ` 罗勇刚(Yonggang Luo)
2020-04-30 15:35 ` BALATON Zoltan
1 sibling, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-30 8:13 UTC (permalink / raw)
To: Richard Henderson
Cc: Dino Papararo, Mark Cave-Ayland, qemu-devel, Programmingkid,
qemu-ppc, Howard Spoelstra, Alex Bennée
[-- Attachment #1: Type: text/plain, Size: 1912 bytes --]
I propose a new way to computing the float flags,
We preserve a float computing cash
typedef struct FpRecord {
uint8_t op;
float32 A;
float32 B;
} FpRecord;
FpRecord fp_cache[1024];
int fp_cache_length;
uint32_t fp_exceptions;
1. For each new fp operation we push it to the fp_cache,
2. Once we read the fp_exceptions , then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.
and clear fp_cache_length.
3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
clear fp_exceptions.
4. If the fp_cache are full, then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.
Would this be a general method to use hard-float?
The consued time should be 2*hard_float.
Considerating read fp_exceptions are rare, then the amortized time
complexity
would be 1 * hard_float.
On Thu, Apr 30, 2020 at 10:18 AM Richard Henderson <
richard.henderson@linaro.org> wrote:
> On 4/29/20 5:20 PM, 罗勇刚(Yonggang Luo) wrote:
> > Question, in hard-float, if we don't want to read the fp register.
> > for example: If we wanna compute c = a + b in fp32
> > if c = a + b In hard float
> > and if b1 = c - a in hard float
> > if b1 != b at bitwise level, the we se the inexat to 1, otherwsie
> > we set inexat bit to 0? are this valid?
> >
> > we can also do it for a * b, a - b, a / b.
> >
>
> That does seem plausible, for all of the normal values for which we would
> apply
> the hard-float optimization anyway. But we already check for the
> exceptional
> cases:
>
> if (unlikely(f32_is_inf(ur))) {
> s->float_exception_flags |= float_flag_overflow;
> } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) {
> if (post == NULL || post(ua, ub)) {
> goto soft;
> }
> }
>
>
> r~
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 2735 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: R: About hardfloat in ppc
2020-04-30 8:13 ` 罗勇刚(Yonggang Luo)
@ 2020-04-30 15:35 ` BALATON Zoltan
2020-04-30 16:34 ` R: " Dino Papararo
0 siblings, 1 reply; 40+ messages in thread
From: BALATON Zoltan @ 2020-04-30 15:35 UTC (permalink / raw)
To: 罗勇刚(Yonggang Luo)
Cc: Alex Bennée, Richard Henderson, qemu-devel, Programmingkid,
qemu-ppc, Howard Spoelstra, Dino Papararo
[-- Attachment #1: Type: text/plain, Size: 2328 bytes --]
On Thu, 30 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> I propose a new way to computing the float flags,
> We preserve a float computing cash
> typedef struct FpRecord {
> uint8_t op;
> float32 A;
> float32 B;
> } FpRecord;
> FpRecord fp_cache[1024];
> int fp_cache_length;
> uint32_t fp_exceptions;
>
> 1. For each new fp operation we push it to the fp_cache,
> 2. Once we read the fp_exceptions , then we re-compute
> the fp_exceptions by re-running the fp FpRecord sequence.
> and clear fp_cache_length.
> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
> clear fp_exceptions.
> 4. If the fp_cache are full, then we re-compute
> the fp_exceptions by re-running the fp FpRecord sequence.
>
> Would this be a general method to use hard-float?
> The consued time should be 2*hard_float.
> Considerating read fp_exceptions are rare, then the amortized time
> complexity
> would be 1 * hard_float.
It's hard to guess what the hit rate of such cache would be and if it's
low then managing the cache is probably more expensive than running with
softfloat. So to evaluate any proposed patch we also need some benchmarks
which we can experiment with to tell if the results are good or not
otherwise we're just guessing. Are there some existing tests and
benchmarks that we can use? Alex mentioned fp-bench I think and to
evaluate the correctness of the FP implementation I've seen this other
conversation:
https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05107.html
https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05126.html
Is that something we can use for PPC as well to check the correctness?
So I think before implementing any potential solution that came up in this
brainstorming the first step would be to get and compile (or write if not
available) some tests and benchmarks:
1. testing host behaviour for inexact and compare that for different archs
2. some FP tests that can be used to compare results with QEMU and real
CPU to check correctness of emulation (if these check for inexact
differences then could be used instead of 1.)
3. some benchmarks to evaluate QEMU performance (these could be same as FP
tests or some real world FP heavy applications).
Then we can see if the proposed solution is faster and still correct.
Regards,
BALATON Zoltan
^ permalink raw reply [flat|nested] 40+ messages in thread
* R: R: R: About hardfloat in ppc
2020-04-30 15:35 ` BALATON Zoltan
@ 2020-04-30 16:34 ` Dino Papararo
2020-05-01 1:59 ` Programmingkid
0 siblings, 1 reply; 40+ messages in thread
From: Dino Papararo @ 2020-04-30 16:34 UTC (permalink / raw)
To: BALATON Zoltan, 罗勇刚(Yonggang Luo)
Cc: Richard Henderson, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Alex Bennée
Maybe the fastest way to implement hardfloats for ppc could be run them by default and until some fpu instruction request for FPSCR register.
At this time probably we want to check for some exception.. so QEMU could come back to last fpu instruction executed and re-execute it in softfloat taking care this time of FPSCR flags, then continue in hardfloats unitl another instruction looking for FPSCR register and so on..
Dino
-----Messaggio originale-----
Da: BALATON Zoltan <balaton@eik.bme.hu>
Inviato: giovedì 30 aprile 2020 17:36
A: 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com>
Cc: Richard Henderson <richard.henderson@linaro.org>; Dino Papararo <skizzato73@msn.com>; qemu-devel@nongnu.org; Programmingkid <programmingkidx@gmail.com>; qemu-ppc@nongnu.org; Howard Spoelstra <hsp.cat7@gmail.com>; Alex Bennée <alex.bennee@linaro.org>
Oggetto: Re: R: R: About hardfloat in ppc
On Thu, 30 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> I propose a new way to computing the float flags, We preserve a float
> computing cash typedef struct FpRecord { uint8_t op;
> float32 A;
> float32 B;
> } FpRecord;
> FpRecord fp_cache[1024];
> int fp_cache_length;
> uint32_t fp_exceptions;
>
> 1. For each new fp operation we push it to the fp_cache, 2. Once we
> read the fp_exceptions , then we re-compute the fp_exceptions by
> re-running the fp FpRecord sequence.
> and clear fp_cache_length.
> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0
> and clear fp_exceptions.
> 4. If the fp_cache are full, then we re-compute the fp_exceptions by
> re-running the fp FpRecord sequence.
>
> Would this be a general method to use hard-float?
> The consued time should be 2*hard_float.
> Considerating read fp_exceptions are rare, then the amortized time
> complexity would be 1 * hard_float.
It's hard to guess what the hit rate of such cache would be and if it's low then managing the cache is probably more expensive than running with softfloat. So to evaluate any proposed patch we also need some benchmarks which we can experiment with to tell if the results are good or not otherwise we're just guessing. Are there some existing tests and benchmarks that we can use? Alex mentioned fp-bench I think and to evaluate the correctness of the FP implementation I've seen this other
conversation:
https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05107.html
https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05126.html
Is that something we can use for PPC as well to check the correctness?
So I think before implementing any potential solution that came up in this brainstorming the first step would be to get and compile (or write if not
available) some tests and benchmarks:
1. testing host behaviour for inexact and compare that for different archs 2. some FP tests that can be used to compare results with QEMU and real CPU to check correctness of emulation (if these check for inexact differences then could be used instead of 1.) 3. some benchmarks to evaluate QEMU performance (these could be same as FP tests or some real world FP heavy applications).
Then we can see if the proposed solution is faster and still correct.
Regards,
BALATON Zoltan
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-04-30 16:34 ` R: " Dino Papararo
@ 2020-05-01 1:59 ` Programmingkid
2020-05-01 2:21 ` 罗勇刚(Yonggang Luo)
0 siblings, 1 reply; 40+ messages in thread
From: Programmingkid @ 2020-05-01 1:59 UTC (permalink / raw)
To: Dino Papararo
Cc: Richard Henderson, qemu-devel,
"罗勇刚(Yonggang Luo)",
qemu-ppc, Howard Spoelstra, Alex Bennée
> On Apr 30, 2020, at 12:34 PM, Dino Papararo <skizzato73@msn.com> wrote:
>
> Maybe the fastest way to implement hardfloats for ppc could be run them by default and until some fpu instruction request for FPSCR register.
> At this time probably we want to check for some exception.. so QEMU could come back to last fpu instruction executed and re-execute it in softfloat taking care this time of FPSCR flags, then continue in hardfloats unitl another instruction looking for FPSCR register and so on..
>
> Dino
That sounds like a good idea.
> -----Messaggio originale-----
> Da: BALATON Zoltan <balaton@eik.bme.hu>
> Inviato: giovedì 30 aprile 2020 17:36
> A: 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>; Dino Papararo <skizzato73@msn.com>; qemu-devel@nongnu.org; Programmingkid <programmingkidx@gmail.com>; qemu-ppc@nongnu.org; Howard Spoelstra <hsp.cat7@gmail.com>; Alex Bennée <alex.bennee@linaro.org>
> Oggetto: Re: R: R: About hardfloat in ppc
>
> On Thu, 30 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>> I propose a new way to computing the float flags, We preserve a float
>> computing cash typedef struct FpRecord { uint8_t op;
>> float32 A;
>> float32 B;
>> } FpRecord;
>> FpRecord fp_cache[1024];
>> int fp_cache_length;
>> uint32_t fp_exceptions;
>>
>> 1. For each new fp operation we push it to the fp_cache, 2. Once we
>> read the fp_exceptions , then we re-compute the fp_exceptions by
>> re-running the fp FpRecord sequence.
>> and clear fp_cache_length.
>> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0
>> and clear fp_exceptions.
>> 4. If the fp_cache are full, then we re-compute the fp_exceptions by
>> re-running the fp FpRecord sequence.
>>
>> Would this be a general method to use hard-float?
>> The consued time should be 2*hard_float.
>> Considerating read fp_exceptions are rare, then the amortized time
>> complexity would be 1 * hard_float.
>
> It's hard to guess what the hit rate of such cache would be and if it's low then managing the cache is probably more expensive than running with softfloat. So to evaluate any proposed patch we also need some benchmarks which we can experiment with to tell if the results are good or not otherwise we're just guessing. Are there some existing tests and benchmarks that we can use? Alex mentioned fp-bench I think and to evaluate the correctness of the FP implementation I've seen this other
> conversation:
>
> https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05107.html
> https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05126.html
>
> Is that something we can use for PPC as well to check the correctness?
>
> So I think before implementing any potential solution that came up in this brainstorming the first step would be to get and compile (or write if not
> available) some tests and benchmarks:
>
> 1. testing host behaviour for inexact and compare that for different archs 2. some FP tests that can be used to compare results with QEMU and real CPU to check correctness of emulation (if these check for inexact differences then could be used instead of 1.) 3. some benchmarks to evaluate QEMU performance (these could be same as FP tests or some real world FP heavy applications).
>
> Then we can see if the proposed solution is faster and still correct.
>
> Regards,
> BALATON Zoltan
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 1:59 ` Programmingkid
@ 2020-05-01 2:21 ` 罗勇刚(Yonggang Luo)
2020-05-01 11:58 ` BALATON Zoltan
0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01 2:21 UTC (permalink / raw)
To: Programmingkid
Cc: Alex Bennée, Richard Henderson, qemu-devel, qemu-ppc,
Howard Spoelstra, Dino Papararo
[-- Attachment #1: Type: text/plain, Size: 4633 bytes --]
That's what I suggested,
We preserve a float computing cache
typedef struct FpRecord {
uint8_t op;
float32 A;
float32 B;
} FpRecord;
FpRecord fp_cache[1024];
int fp_cache_length;
uint32_t fp_exceptions;
1. For each new fp operation we push it to the fp_cache,
2. Once we read the fp_exceptions , then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.
and clear fp_cache_length.
3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
clear fp_exceptions.
4. If the fp_cache are full, then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.
Now the keypoint is how to tracking the read and write of FPSCR register,
The current code are
cpu_fpscr = tcg_global_mem_new(cpu_env,
offsetof(CPUPPCState, fpscr), "fpscr");
On Fri, May 1, 2020 at 9:59 AM Programmingkid <programmingkidx@gmail.com>
wrote:
>
> > On Apr 30, 2020, at 12:34 PM, Dino Papararo <skizzato73@msn.com> wrote:
> >
> > Maybe the fastest way to implement hardfloats for ppc could be run them
> by default and until some fpu instruction request for FPSCR register.
> > At this time probably we want to check for some exception.. so QEMU
> could come back to last fpu instruction executed and re-execute it in
> softfloat taking care this time of FPSCR flags, then continue in hardfloats
> unitl another instruction looking for FPSCR register and so on..
> >
> > Dino
>
> That sounds like a good idea.
>
> > -----Messaggio originale-----
> > Da: BALATON Zoltan <balaton@eik.bme.hu>
> > Inviato: giovedì 30 aprile 2020 17:36
> > A: 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com>
> > Cc: Richard Henderson <richard.henderson@linaro.org>; Dino Papararo <
> skizzato73@msn.com>; qemu-devel@nongnu.org; Programmingkid <
> programmingkidx@gmail.com>; qemu-ppc@nongnu.org; Howard Spoelstra <
> hsp.cat7@gmail.com>; Alex Bennée <alex.bennee@linaro.org>
> > Oggetto: Re: R: R: About hardfloat in ppc
> >
> > On Thu, 30 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> >> I propose a new way to computing the float flags, We preserve a float
> >> computing cash typedef struct FpRecord { uint8_t op;
> >> float32 A;
> >> float32 B;
> >> } FpRecord;
> >> FpRecord fp_cache[1024];
> >> int fp_cache_length;
> >> uint32_t fp_exceptions;
> >>
> >> 1. For each new fp operation we push it to the fp_cache, 2. Once we
> >> read the fp_exceptions , then we re-compute the fp_exceptions by
> >> re-running the fp FpRecord sequence.
> >> and clear fp_cache_length.
> >> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0
> >> and clear fp_exceptions.
> >> 4. If the fp_cache are full, then we re-compute the fp_exceptions by
> >> re-running the fp FpRecord sequence.
> >>
> >> Would this be a general method to use hard-float?
> >> The consued time should be 2*hard_float.
> >> Considerating read fp_exceptions are rare, then the amortized time
> >> complexity would be 1 * hard_float.
> >
> > It's hard to guess what the hit rate of such cache would be and if it's
> low then managing the cache is probably more expensive than running with
> softfloat. So to evaluate any proposed patch we also need some benchmarks
> which we can experiment with to tell if the results are good or not
> otherwise we're just guessing. Are there some existing tests and benchmarks
> that we can use? Alex mentioned fp-bench I think and to evaluate the
> correctness of the FP implementation I've seen this other
> > conversation:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05107.html
> > https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05126.html
> >
> > Is that something we can use for PPC as well to check the correctness?
> >
> > So I think before implementing any potential solution that came up in
> this brainstorming the first step would be to get and compile (or write if
> not
> > available) some tests and benchmarks:
> >
> > 1. testing host behaviour for inexact and compare that for different
> archs 2. some FP tests that can be used to compare results with QEMU and
> real CPU to check correctness of emulation (if these check for inexact
> differences then could be used instead of 1.) 3. some benchmarks to
> evaluate QEMU performance (these could be same as FP tests or some real
> world FP heavy applications).
> >
> > Then we can see if the proposed solution is faster and still correct.
> >
> > Regards,
> > BALATON Zoltan
>
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 6819 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 2:21 ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 11:58 ` BALATON Zoltan
2020-05-01 12:04 ` 罗勇刚(Yonggang Luo)
0 siblings, 1 reply; 40+ messages in thread
From: BALATON Zoltan @ 2020-05-01 11:58 UTC (permalink / raw)
To: 罗勇刚(Yonggang Luo)
Cc: Dino Papararo, Richard Henderson, qemu-devel, Programmingkid,
qemu-ppc, Howard Spoelstra, Alex Bennée
[-- Attachment #1: Type: text/plain, Size: 1752 bytes --]
On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
> That's what I suggested,
> We preserve a float computing cache
> typedef struct FpRecord {
> uint8_t op;
> float32 A;
> float32 B;
> } FpRecord;
> FpRecord fp_cache[1024];
> int fp_cache_length;
> uint32_t fp_exceptions;
>
> 1. For each new fp operation we push it to the fp_cache,
> 2. Once we read the fp_exceptions , then we re-compute
> the fp_exceptions by re-running the fp FpRecord sequence.
> and clear fp_cache_length.
Why do you need to store more than the last fp op? The cumulative bits can
be tracked like it's done for other targets by not clearing fp_status then
you can read it from there. Only the non-sticky FI bit needs to be
computed but that's only determined by the last op so it's enough to
remember that and run that with softfloat (or even hardfloat after
clearing status but softfloat may be faster for this) to get the bits for
last op when status is read.
> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
> clear fp_exceptions.
> 4. If the fp_cache are full, then we re-compute
> the fp_exceptions by re-running the fp FpRecord sequence.
All this cache management and more than one element seems unnecessary to
me although I may be missing something.
> Now the keypoint is how to tracking the read and write of FPSCR register,
> The current code are
> cpu_fpscr = tcg_global_mem_new(cpu_env,
> offsetof(CPUPPCState, fpscr), "fpscr");
Maybe you could search where the value is read which should be the places
where we need to handle it but changes may be needed to make a clear API
for this between target/ppc, TCG and softfloat which likely does not
exist yet.
Regards,
BALATON Zoltan
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 11:58 ` BALATON Zoltan
@ 2020-05-01 12:04 ` 罗勇刚(Yonggang Luo)
2020-05-01 13:10 ` Alex Bennée
0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01 12:04 UTC (permalink / raw)
To: BALATON Zoltan
Cc: Dino Papararo, Richard Henderson, qemu-devel, Programmingkid,
qemu-ppc, Howard Spoelstra, Alex Bennée
[-- Attachment #1: Type: text/plain, Size: 2240 bytes --]
On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
> > That's what I suggested,
> > We preserve a float computing cache
> > typedef struct FpRecord {
> > uint8_t op;
> > float32 A;
> > float32 B;
> > } FpRecord;
> > FpRecord fp_cache[1024];
> > int fp_cache_length;
> > uint32_t fp_exceptions;
> >
> > 1. For each new fp operation we push it to the fp_cache,
> > 2. Once we read the fp_exceptions , then we re-compute
> > the fp_exceptions by re-running the fp FpRecord sequence.
> > and clear fp_cache_length.
>
> Why do you need to store more than the last fp op? The cumulative bits can
> be tracked like it's done for other targets by not clearing fp_status then
> you can read it from there. Only the non-sticky FI bit needs to be
> computed but that's only determined by the last op so it's enough to
> remember that and run that with softfloat (or even hardfloat after
> clearing status but softfloat may be faster for this) to get the bits for
> last op when status is read.
>
Yeap, store only the last fp op is also an option. Do you means that store
the last fp op,
and calculate it when necessary? I am thinking about a general fp
optmize method that suite
for all target.
>
> > 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
> > clear fp_exceptions.
> > 4. If the fp_cache are full, then we re-compute
> > the fp_exceptions by re-running the fp FpRecord sequence.
>
> All this cache management and more than one element seems unnecessary to
> me although I may be missing something.
>
> > Now the keypoint is how to tracking the read and write of FPSCR register,
> > The current code are
> > cpu_fpscr = tcg_global_mem_new(cpu_env,
> > offsetof(CPUPPCState, fpscr), "fpscr");
>
> Maybe you could search where the value is read which should be the places
> where we need to handle it but changes may be needed to make a clear API
> for this between target/ppc, TCG and softfloat which likely does not
> exist yet.
>
> Regards,
> BALATON Zoltan
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 3016 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 12:04 ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 13:10 ` Alex Bennée
2020-05-01 13:39 ` BALATON Zoltan
2020-05-01 14:18 ` Richard Henderson
0 siblings, 2 replies; 40+ messages in thread
From: Alex Bennée @ 2020-05-01 13:10 UTC (permalink / raw)
To: luoyonggang
Cc: Richard Henderson, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Dino Papararo
罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>
>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>> > That's what I suggested,
>> > We preserve a float computing cache
>> > typedef struct FpRecord {
>> > uint8_t op;
>> > float32 A;
>> > float32 B;
>> > } FpRecord;
>> > FpRecord fp_cache[1024];
>> > int fp_cache_length;
>> > uint32_t fp_exceptions;
>> >
>> > 1. For each new fp operation we push it to the fp_cache,
>> > 2. Once we read the fp_exceptions , then we re-compute
>> > the fp_exceptions by re-running the fp FpRecord sequence.
>> > and clear fp_cache_length.
>>
>> Why do you need to store more than the last fp op? The cumulative bits can
>> be tracked like it's done for other targets by not clearing fp_status then
>> you can read it from there. Only the non-sticky FI bit needs to be
>> computed but that's only determined by the last op so it's enough to
>> remember that and run that with softfloat (or even hardfloat after
>> clearing status but softfloat may be faster for this) to get the bits for
>> last op when status is read.
>>
> Yeap, store only the last fp op is also an option. Do you means that store
> the last fp op,
> and calculate it when necessary? I am thinking about a general fp
> optmize method that suite
> for all target.
I think that's getting a little ahead of yourself. Let's prove the
technique is valuable for PPC (given it has the most to gain). We can
always generalise later if it's worthwhile.
Rather than creating a new structure I would suggest creating 3 new tcg
globals (op, inA, inB) and re-factor the front-end code so each FP op
loaded the TCG globals. The TCG optimizer should pick up aliased loads
and automatically eliminate the dead ones. We might need some new
machinery for the TCG to avoid spilling the values over potentially
faulting loads/stores but that is likely a phase 2 problem.
Next you will want to find places that care about the per-op bits of
cpu_fpscr and call a helper with the new globals to re-run the
computation and feed the values in.
That would give you a reasonable working prototype to start doing some
measurements of overhead and if it makes a difference.
>
>>
>> > 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
>> > clear fp_exceptions.
>> > 4. If the fp_cache are full, then we re-compute
>> > the fp_exceptions by re-running the fp FpRecord sequence.
>>
>> All this cache management and more than one element seems unnecessary to
>> me although I may be missing something.
>>
>> > Now the keypoint is how to tracking the read and write of FPSCR register,
>> > The current code are
>> > cpu_fpscr = tcg_global_mem_new(cpu_env,
>> > offsetof(CPUPPCState, fpscr), "fpscr");
>>
>> Maybe you could search where the value is read which should be the places
>> where we need to handle it but changes may be needed to make a clear API
>> for this between target/ppc, TCG and softfloat which likely does not
>> exist yet.
Once the per-op calculation is fixed in the PPC front-end I thing the
only change needed is to remove the #if defined(TARGET_PPC) in
softfloat.c - it's only really there because it avoids the overhead of
checking flags which we always know to be clear in it's case.
--
Alex Bennée
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 13:10 ` Alex Bennée
@ 2020-05-01 13:39 ` BALATON Zoltan
2020-05-01 14:01 ` Alex Bennée
2020-05-01 14:18 ` Richard Henderson
1 sibling, 1 reply; 40+ messages in thread
From: BALATON Zoltan @ 2020-05-01 13:39 UTC (permalink / raw)
To: Alex Bennée
Cc: Richard Henderson, qemu-devel, Programmingkid, luoyonggang,
qemu-ppc, Howard Spoelstra, Dino Papararo
[-- Attachment #1: Type: text/plain, Size: 4967 bytes --]
On Fri, 1 May 2020, Alex Bennée wrote:
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>>>> That's what I suggested,
>>>> We preserve a float computing cache
>>>> typedef struct FpRecord {
>>>> uint8_t op;
>>>> float32 A;
>>>> float32 B;
>>>> } FpRecord;
>>>> FpRecord fp_cache[1024];
>>>> int fp_cache_length;
>>>> uint32_t fp_exceptions;
>>>>
>>>> 1. For each new fp operation we push it to the fp_cache,
>>>> 2. Once we read the fp_exceptions , then we re-compute
>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>> and clear fp_cache_length.
>>>
>>> Why do you need to store more than the last fp op? The cumulative bits can
>>> be tracked like it's done for other targets by not clearing fp_status then
>>> you can read it from there. Only the non-sticky FI bit needs to be
>>> computed but that's only determined by the last op so it's enough to
>>> remember that and run that with softfloat (or even hardfloat after
>>> clearing status but softfloat may be faster for this) to get the bits for
>>> last op when status is read.
>>>
>> Yeap, store only the last fp op is also an option. Do you means that store
>> the last fp op,
>> and calculate it when necessary? I am thinking about a general fp
>> optmize method that suite
>> for all target.
>
> I think that's getting a little ahead of yourself. Let's prove the
> technique is valuable for PPC (given it has the most to gain). We can
> always generalise later if it's worthwhile.
>
> Rather than creating a new structure I would suggest creating 3 new tcg
> globals (op, inA, inB) and re-factor the front-end code so each FP op
> loaded the TCG globals.
So that's basically wherever you see helper_reset_fpstatus() in target/ppc
we would need to replace it with saving op and args to globals? Or just
repurpose this helper to do that. This is called before every fp op but
not before sub ops within vector ops. Is that correct? Probably it is, as
vector ops are a single op but how do we detect changes in flags by sub
ops for those? These might have some existing bugs I think.
> The TCG optimizer should pick up aliased loads
> and automatically eliminate the dead ones. We might need some new
> machinery for the TCG to avoid spilling the values over potentially
> faulting loads/stores but that is likely a phase 2 problem.
I have no idea how to do this or even where to look. Some more detailed
explanation may be needed here.
> Next you will want to find places that care about the per-op bits of
> cpu_fpscr and call a helper with the new globals to re-run the
> computation and feed the values in.
So the code that cares about these bits are in guest thus we would need to
compute it if we detect the guest accessing these. Detecting when the
individual bits are accessed might be difficult so at first we could go
for checking if the fpscr is read and recompute FI bit then before
returning value. You previously said these might be when fpscr is read or
when generating exceptions but not sure where exactly are these done for
ppc. (I'd expect to have mffpscr but there seem to be different other ops
instead accessing parts of fpscr which are found in
target/ppc/fp-impl.inc.c:567 so this would need studying the PPC docs to
understand how the guest can access the FI bit of fpscr reg.)
> That would give you a reasonable working prototype to start doing some
> measurements of overhead and if it makes a difference.
>
>>
>>>
>>>> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
>>>> clear fp_exceptions.
>>>> 4. If the fp_cache are full, then we re-compute
>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>
>>> All this cache management and more than one element seems unnecessary to
>>> me although I may be missing something.
>>>
>>>> Now the keypoint is how to tracking the read and write of FPSCR register,
>>>> The current code are
>>>> cpu_fpscr = tcg_global_mem_new(cpu_env,
>>>> offsetof(CPUPPCState, fpscr), "fpscr");
>>>
>>> Maybe you could search where the value is read which should be the places
>>> where we need to handle it but changes may be needed to make a clear API
>>> for this between target/ppc, TCG and softfloat which likely does not
>>> exist yet.
>
> Once the per-op calculation is fixed in the PPC front-end I thing the
> only change needed is to remove the #if defined(TARGET_PPC) in
> softfloat.c - it's only really there because it avoids the overhead of
> checking flags which we always know to be clear in it's case.
That's the theory but I've found that removing that define currently makes
general fp ops slower but vector ops faster so I think there may be some
bugs that would need to be found and fixed. So testing with some proper
test suite might be needed.
Regards,
BALATON Zoltan
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 13:39 ` BALATON Zoltan
@ 2020-05-01 14:01 ` Alex Bennée
0 siblings, 0 replies; 40+ messages in thread
From: Alex Bennée @ 2020-05-01 14:01 UTC (permalink / raw)
To: BALATON Zoltan
Cc: Richard Henderson, qemu-devel, Programmingkid, luoyonggang,
qemu-ppc, Howard Spoelstra, Dino Papararo
BALATON Zoltan <balaton@eik.bme.hu> writes:
> On Fri, 1 May 2020, Alex Bennée wrote:
>> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>>> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>>>>> That's what I suggested,
>>>>> We preserve a float computing cache
>>>>> typedef struct FpRecord {
>>>>> uint8_t op;
>>>>> float32 A;
>>>>> float32 B;
>>>>> } FpRecord;
>>>>> FpRecord fp_cache[1024];
>>>>> int fp_cache_length;
>>>>> uint32_t fp_exceptions;
>>>>>
>>>>> 1. For each new fp operation we push it to the fp_cache,
>>>>> 2. Once we read the fp_exceptions , then we re-compute
>>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>>> and clear fp_cache_length.
>>>>
>>>> Why do you need to store more than the last fp op? The cumulative bits can
>>>> be tracked like it's done for other targets by not clearing fp_status then
>>>> you can read it from there. Only the non-sticky FI bit needs to be
>>>> computed but that's only determined by the last op so it's enough to
>>>> remember that and run that with softfloat (or even hardfloat after
>>>> clearing status but softfloat may be faster for this) to get the bits for
>>>> last op when status is read.
>>>>
>>> Yeap, store only the last fp op is also an option. Do you means that store
>>> the last fp op,
>>> and calculate it when necessary? I am thinking about a general fp
>>> optmize method that suite
>>> for all target.
>>
>> I think that's getting a little ahead of yourself. Let's prove the
>> technique is valuable for PPC (given it has the most to gain). We can
>> always generalise later if it's worthwhile.
>>
>> Rather than creating a new structure I would suggest creating 3 new tcg
>> globals (op, inA, inB) and re-factor the front-end code so each FP op
>> loaded the TCG globals.
>
> So that's basically wherever you see helper_reset_fpstatus() in
> target/ppc we would need to replace it with saving op and args to
> globals? Or just repurpose this helper to do that. This is called
> before every fp op but not before sub ops within vector ops. Is that
> correct? Probably it is, as vector ops are a single op but how do we
> detect changes in flags by sub ops for those? These might have some
> existing bugs I think.
I'll defer to the PPC front end experts on this. I'm not familiar with
how it all goes together at all.
>
>> The TCG optimizer should pick up aliased loads
>> and automatically eliminate the dead ones. We might need some new
>> machinery for the TCG to avoid spilling the values over potentially
>> faulting loads/stores but that is likely a phase 2 problem.
>
> I have no idea how to do this or even where to look. Some more
> detailed explanation may be needed here.
Don't worry about it now. Let's worry about it when we see how often
faulting instructions are interleaved with fp ops.
>
>> Next you will want to find places that care about the per-op bits of
>> cpu_fpscr and call a helper with the new globals to re-run the
>> computation and feed the values in.
>
> So the code that cares about these bits are in guest thus we would
> need to compute it if we detect the guest accessing these. Detecting
> when the individual bits are accessed might be difficult so at first
> we could go for checking if the fpscr is read and recompute FI bit
> then before returning value. You previously said these might be when
> fpscr is read or when generating exceptions but not sure where exactly
> are these done for ppc. (I'd expect to have mffpscr but there seem to
> be different other ops instead accessing parts of fpscr which are
> found in target/ppc/fp-impl.inc.c:567 so this would need studying the
> PPC docs to understand how the guest can access the FI bit of fpscr
> reg.)
>
>> That would give you a reasonable working prototype to start doing some
>> measurements of overhead and if it makes a difference.
>>
>>>
>>>>
>>>>> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
>>>>> clear fp_exceptions.
>>>>> 4. If the fp_cache are full, then we re-compute
>>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>>
>>>> All this cache management and more than one element seems unnecessary to
>>>> me although I may be missing something.
>>>>
>>>>> Now the keypoint is how to tracking the read and write of FPSCR register,
>>>>> The current code are
>>>>> cpu_fpscr = tcg_global_mem_new(cpu_env,
>>>>> offsetof(CPUPPCState, fpscr), "fpscr");
>>>>
>>>> Maybe you could search where the value is read which should be the places
>>>> where we need to handle it but changes may be needed to make a clear API
>>>> for this between target/ppc, TCG and softfloat which likely does not
>>>> exist yet.
>>
>> Once the per-op calculation is fixed in the PPC front-end I thing the
>> only change needed is to remove the #if defined(TARGET_PPC) in
>> softfloat.c - it's only really there because it avoids the overhead of
>> checking flags which we always know to be clear in it's case.
>
> That's the theory but I've found that removing that define currently
> makes general fp ops slower but vector ops faster so I think there may
> be some bugs that would need to be found and fixed. So testing with
> some proper test suite might be needed.
You might want to do what Laurent did and hack up a testfloat with
"system" implementations:
https://github.com/vivier/m68k-testfloat/blob/master/testfloat/M68K-Linux-GCC/systfloat.c
I would be nice to plumb that sort of support into our existing
testfloat fork in the code base (tests/fp) but I suspect getting an
out-of-tree fork building and running first would be the quickest way
forward.
>
> Regards,
> BALATON Zoltan
--
Alex Bennée
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 13:10 ` Alex Bennée
2020-05-01 13:39 ` BALATON Zoltan
@ 2020-05-01 14:18 ` Richard Henderson
2020-05-01 16:25 ` 罗勇刚(Yonggang Luo)
2020-05-01 16:29 ` 罗勇刚(Yonggang Luo)
1 sibling, 2 replies; 40+ messages in thread
From: Richard Henderson @ 2020-05-01 14:18 UTC (permalink / raw)
To: Alex Bennée, luoyonggang
Cc: qemu-devel, Programmingkid, qemu-ppc, Howard Spoelstra, Dino Papararo
On 5/1/20 6:10 AM, Alex Bennée wrote:
>
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>
>> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>
>>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>>>> That's what I suggested,
>>>> We preserve a float computing cache
>>>> typedef struct FpRecord {
>>>> uint8_t op;
>>>> float32 A;
>>>> float32 B;
>>>> } FpRecord;
>>>> FpRecord fp_cache[1024];
>>>> int fp_cache_length;
>>>> uint32_t fp_exceptions;
>>>>
>>>> 1. For each new fp operation we push it to the fp_cache,
>>>> 2. Once we read the fp_exceptions , then we re-compute
>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>> and clear fp_cache_length.
>>>
>>> Why do you need to store more than the last fp op? The cumulative bits can
>>> be tracked like it's done for other targets by not clearing fp_status then
>>> you can read it from there. Only the non-sticky FI bit needs to be
>>> computed but that's only determined by the last op so it's enough to
>>> remember that and run that with softfloat (or even hardfloat after
>>> clearing status but softfloat may be faster for this) to get the bits for
>>> last op when status is read.
>>>
>> Yeap, store only the last fp op is also an option. Do you means that store
>> the last fp op,
>> and calculate it when necessary? I am thinking about a general fp
>> optmize method that suite
>> for all target.
>
> I think that's getting a little ahead of yourself. Let's prove the
> technique is valuable for PPC (given it has the most to gain). We can
> always generalise later if it's worthwhile.
Indeed.
> Rather than creating a new structure I would suggest creating 3 new tcg
> globals (op, inA, inB) and re-factor the front-end code so each FP op
> loaded the TCG globals. The TCG optimizer should pick up aliased loads
> and automatically eliminate the dead ones. We might need some new
> machinery for the TCG to avoid spilling the values over potentially
> faulting loads/stores but that is likely a phase 2 problem.
There's no point in new tcg globals.
Every fp operation can raise an exception, and therefore every fp operation
will flush tcg globals to memory. Therefore there is no optimization to be
done at the tcg opcode level.
However, every fp operation calls a helper function, and the quickest thing to
do is store the inputs to env->(op, inA, inB, inC) in the helper before
performing the operation.
> Next you will want to find places that care about the per-op bits of
> cpu_fpscr and call a helper with the new globals to re-run the
> computation and feed the values in.
Before we even get to this deferred fp operation thing, there are several giant
improvements to ppc emulation that can be made:
Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
I've mentioned this before, that it's possible to leave the steady-state of
env->fp_status.exception_flags == 0, so there's no need for a separate function
call. I suspect this is worth a decent speedup by itself.
Step 2 is to notice when all fp exceptions are masked, so that no exception can
be raised, and set a tb_flags bit. This is the default fp environment that
libc enables and therefore extremely common.
Currently, ppc has 3 helpers called per fp operation. If step 1 is handled
correctly, then we're down to 2 fp helpers per fp operation. If no exceptions
need raising, then we can perform the entire operation with a single function call.
We would require a parallel set of fp helpers that (1) performs the operation
and (2) does any post-processing of the exception bits straight away, but (3)
without raising any exceptions. Sort of like helper_fadd +
do_float_check_status, but less. IIRC the only real extra work is categorizing
invalid exceptions. We could even plausibly extend softfloat to do that while
it is recording the invalid exception.
Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute inexact
from the inverse hardfloat operation. This would let us relax the restriction
of only using hardfloat when we have already have an accrued inexact exception.
Only after all of these are done is it worth experimenting with caching the
last fp operation.
r~
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 14:18 ` Richard Henderson
@ 2020-05-01 16:25 ` 罗勇刚(Yonggang Luo)
2020-05-01 19:33 ` Alex Bennée
2020-05-01 16:29 ` 罗勇刚(Yonggang Luo)
1 sibling, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01 16:25 UTC (permalink / raw)
To: Richard Henderson
Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Alex Bennée
[-- Attachment #1: Type: text/plain, Size: 6472 bytes --]
On Fri, May 1, 2020 at 10:18 PM Richard Henderson <
richard.henderson@linaro.org> wrote:
> On 5/1/20 6:10 AM, Alex Bennée wrote:
> >
> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >
> >> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu>
> wrote:
> >>
> >>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
> >>>> That's what I suggested,
> >>>> We preserve a float computing cache
> >>>> typedef struct FpRecord {
> >>>> uint8_t op;
> >>>> float32 A;
> >>>> float32 B;
> >>>> } FpRecord;
> >>>> FpRecord fp_cache[1024];
> >>>> int fp_cache_length;
> >>>> uint32_t fp_exceptions;
> >>>>
> >>>> 1. For each new fp operation we push it to the fp_cache,
> >>>> 2. Once we read the fp_exceptions , then we re-compute
> >>>> the fp_exceptions by re-running the fp FpRecord sequence.
> >>>> and clear fp_cache_length.
> >>>
> >>> Why do you need to store more than the last fp op? The cumulative bits
> can
> >>> be tracked like it's done for other targets by not clearing fp_status
> then
> >>> you can read it from there. Only the non-sticky FI bit needs to be
> >>> computed but that's only determined by the last op so it's enough to
> >>> remember that and run that with softfloat (or even hardfloat after
> >>> clearing status but softfloat may be faster for this) to get the bits
> for
> >>> last op when status is read.
> >>>
> >> Yeap, store only the last fp op is also an option. Do you means that
> store
> >> the last fp op,
> >> and calculate it when necessary? I am thinking about a general fp
> >> optmize method that suite
> >> for all target.
> >
> > I think that's getting a little ahead of yourself. Let's prove the
> > technique is valuable for PPC (given it has the most to gain). We can
> > always generalise later if it's worthwhile.
>
> Indeed.
>
> > Rather than creating a new structure I would suggest creating 3 new tcg
> > globals (op, inA, inB) and re-factor the front-end code so each FP op
> > loaded the TCG globals. The TCG optimizer should pick up aliased loads
> > and automatically eliminate the dead ones. We might need some new
> > machinery for the TCG to avoid spilling the values over potentially
> > faulting loads/stores but that is likely a phase 2 problem.
>
> There's no point in new tcg globals.
>
> Every fp operation can raise an exception, and therefore every fp operation
> will flush tcg globals to memory. Therefore there is no optimization to be
> done at the tcg opcode level.
>
> However, every fp operation calls a helper function, and the quickest
> thing to
> do is store the inputs to env->(op, inA, inB, inC) in the helper before
> performing the operation.
>
I thinks there is a possibility to add the tcg ops to optimize the floating
point; For example
WebAssembly doesn't support for float point exception and fp round mode at
all, I suppose most fp execution are no need care about
round mode and fp expcetion, and for this path we can use tcg-op to
abstract it,
and for all other condition we can downgrading to soft-float. As a final
path to optmize to fp accel of
QEMU, we can split the tcg-op into two path. one is hard-float with result
cache for lazy fp flags calculating
And one is pure soft-float path.
For lazy fp flags calculating, cause we have stick flags
```
float_flag_invalid = 1,
float_flag_divbyzero = 4,
float_flag_overflow = 8,
float_flag_underflow = 16,
float_flag_inexact = 32,
```
We can skip the calculation of these flags when these flags are already
marked to 1.
For these five flags, we can split to 5 calculating function, One function
only check one of the flags.
And once the flags are set to 1, then we won't call the functon any more,
unless the flag are cleared.
We will reduce a lot of branch prediction. And the function would only be
called when the
fp flags are requested.
This is my final goal to optimize fp in QEMU, before that, we can do
simpler things to optimize fp in QEMU
And besides these type of optimization, we can also offloading the fp
exception calculating to other CPU core, so
we can making single threading performance be better, cause single core
performance are hard to improve, but multiple core
system are more and more used in these days, for Ryzen 2/ Threadripper we
even have 64-core /128 threads.
>
> > Next you will want to find places that care about the per-op bits of
> > cpu_fpscr and call a helper with the new globals to re-run the
> > computation and feed the values in.
>
> Before we even get to this deferred fp operation thing, there are several
> giant
> improvements to ppc emulation that can be made:
>
> Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
> I've mentioned this before, that it's possible to leave the steady-state of
> env->fp_status.exception_flags == 0, so there's no need for a separate
> function
> call. I suspect this is worth a decent speedup by itself.
>
I would like to start the fp optimize from here.
>
> Step 2 is to notice when all fp exceptions are masked, so that no
> exception can
> be raised, and set a tb_flags bit. This is the default fp environment that
> libc enables and therefore extremely common.
>
> Currently, ppc has 3 helpers called per fp operation. If step 1 is handled
> correctly, then we're down to 2 fp helpers per fp operation. If no
> exceptions
> need raising, then we can perform the entire operation with a single
> function call.
>
> We would require a parallel set of fp helpers that (1) performs the
> operation
> and (2) does any post-processing of the exception bits straight away, but
> (3)
> without raising any exceptions. Sort of like helper_fadd +
> do_float_check_status, but less. IIRC the only real extra work is
> categorizing
> invalid exceptions. We could even plausibly extend softfloat to do that
> while
> it is recording the invalid exception.
>
> Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute
> inexact
> from the inverse hardfloat operation. This would let us relax the
> restriction
> of only using hardfloat when we have already have an accrued inexact
> exception.
>
> Only after all of these are done is it worth experimenting with caching the
> last fp operation.
>
>
> r~
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 8108 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 16:25 ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 19:33 ` Alex Bennée
0 siblings, 0 replies; 40+ messages in thread
From: Alex Bennée @ 2020-05-01 19:33 UTC (permalink / raw)
To: luoyonggang
Cc: Richard Henderson, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Dino Papararo
罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> On Fri, May 1, 2020 at 10:18 PM Richard Henderson <
> richard.henderson@linaro.org> wrote:
>
>> On 5/1/20 6:10 AM, Alex Bennée wrote:
>> >
>> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>> >
>> >> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu>
>> wrote:
>> >>
>> >>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>> >>>> That's what I suggested,
>> >>>> We preserve a float computing cache
>> >>>> typedef struct FpRecord {
>> >>>> uint8_t op;
>> >>>> float32 A;
>> >>>> float32 B;
>> >>>> } FpRecord;
>> >>>> FpRecord fp_cache[1024];
>> >>>> int fp_cache_length;
>> >>>> uint32_t fp_exceptions;
>> >>>>
>> >>>> 1. For each new fp operation we push it to the fp_cache,
>> >>>> 2. Once we read the fp_exceptions , then we re-compute
>> >>>> the fp_exceptions by re-running the fp FpRecord sequence.
>> >>>> and clear fp_cache_length.
>> >>>
>> >>> Why do you need to store more than the last fp op? The cumulative bits
>> can
>> >>> be tracked like it's done for other targets by not clearing fp_status
>> then
>> >>> you can read it from there. Only the non-sticky FI bit needs to be
>> >>> computed but that's only determined by the last op so it's enough to
>> >>> remember that and run that with softfloat (or even hardfloat after
>> >>> clearing status but softfloat may be faster for this) to get the bits
>> for
>> >>> last op when status is read.
>> >>>
>> >> Yeap, store only the last fp op is also an option. Do you means that
>> store
>> >> the last fp op,
>> >> and calculate it when necessary? I am thinking about a general fp
>> >> optmize method that suite
>> >> for all target.
>> >
>> > I think that's getting a little ahead of yourself. Let's prove the
>> > technique is valuable for PPC (given it has the most to gain). We can
>> > always generalise later if it's worthwhile.
>>
>> Indeed.
>>
>> > Rather than creating a new structure I would suggest creating 3 new tcg
>> > globals (op, inA, inB) and re-factor the front-end code so each FP op
>> > loaded the TCG globals. The TCG optimizer should pick up aliased loads
>> > and automatically eliminate the dead ones. We might need some new
>> > machinery for the TCG to avoid spilling the values over potentially
>> > faulting loads/stores but that is likely a phase 2 problem.
>>
>> There's no point in new tcg globals.
>>
>> Every fp operation can raise an exception, and therefore every fp operation
>> will flush tcg globals to memory. Therefore there is no optimization to be
>> done at the tcg opcode level.
>>
>> However, every fp operation calls a helper function, and the quickest
>> thing to
>> do is store the inputs to env->(op, inA, inB, inC) in the helper before
>> performing the operation.
>>
> I thinks there is a possibility to add the tcg ops to optimize the floating
> point; For example
> WebAssembly doesn't support for float point exception and fp round mode at
> all, I suppose most fp execution are no need care about
> round mode and fp expcetion, and for this path we can use tcg-op to
> abstract it,
> and for all other condition we can downgrading to soft-float. As a final
> path to optmize to fp accel of
> QEMU, we can split the tcg-op into two path. one is hard-float with result
> cache for lazy fp flags calculating
> And one is pure soft-float path.
We have talked about adding support for floating point TCG ops in the
past but I think we would need to be a fair bit farther down the road
before we can attempt that. The overhead of the helper call is
relatively minimal compared to that of the executing the operation
itself. As you can see from all the various front end wrappings around
the softfloat code there is a fair amount of implementation details
you'd need to abstract away into the TCG generation code to make it
useful for all our guests.
> For lazy fp flags calculating, cause we have stick flags
> ```
> float_flag_invalid = 1,
> float_flag_divbyzero = 4,
> float_flag_overflow = 8,
> float_flag_underflow = 16,
> float_flag_inexact = 32,
> ```
> We can skip the calculation of these flags when these flags are already
> marked to 1.
> For these five flags, we can split to 5 calculating function, One function
> only check one of the flags.
> And once the flags are set to 1, then we won't call the functon any more,
> unless the flag are cleared.
> We will reduce a lot of branch prediction. And the function would only be
> called when the
> fp flags are requested.
> This is my final goal to optimize fp in QEMU, before that, we can do
> simpler things to optimize fp in QEMU
>
> And besides these type of optimization, we can also offloading the fp
> exception calculating to other CPU core, so
> we can making single threading performance be better, cause single core
> performance are hard to improve, but multiple core
> system are more and more used in these days, for Ryzen 2/ Threadripper we
> even have 64-core /128 threads.
I would take some convincing that offloading exception calculation to
another thread would make a difference - surely there would be
inter-thread syncing required? Our main approach to threading has been
trying to improve scalability for softmmu so we can emulate more vCPUs
in the system.
>
>
>
>>
>> > Next you will want to find places that care about the per-op bits of
>> > cpu_fpscr and call a helper with the new globals to re-run the
>> > computation and feed the values in.
>>
>> Before we even get to this deferred fp operation thing, there are several
>> giant
>> improvements to ppc emulation that can be made:
>>
>> Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
>> I've mentioned this before, that it's possible to leave the steady-state of
>> env->fp_status.exception_flags == 0, so there's no need for a separate
>> function
>> call. I suspect this is worth a decent speedup by itself.
>>
> I would like to start the fp optimize from here.
>
>
>>
>> Step 2 is to notice when all fp exceptions are masked, so that no
>> exception can
>> be raised, and set a tb_flags bit. This is the default fp environment that
>> libc enables and therefore extremely common.
>>
>> Currently, ppc has 3 helpers called per fp operation. If step 1 is handled
>> correctly, then we're down to 2 fp helpers per fp operation. If no
>> exceptions
>> need raising, then we can perform the entire operation with a single
>> function call.
>>
>> We would require a parallel set of fp helpers that (1) performs the
>> operation
>> and (2) does any post-processing of the exception bits straight away, but
>> (3)
>> without raising any exceptions. Sort of like helper_fadd +
>> do_float_check_status, but less. IIRC the only real extra work is
>> categorizing
>> invalid exceptions. We could even plausibly extend softfloat to do that
>> while
>> it is recording the invalid exception.
>>
>> Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute
>> inexact
>> from the inverse hardfloat operation. This would let us relax the
>> restriction
>> of only using hardfloat when we have already have an accrued inexact
>> exception.
>>
>> Only after all of these are done is it worth experimenting with caching the
>> last fp operation.
>>
>>
>> r~
>>
--
Alex Bennée
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 14:18 ` Richard Henderson
2020-05-01 16:25 ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 16:29 ` 罗勇刚(Yonggang Luo)
2020-05-01 16:51 ` Richard Henderson
1 sibling, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01 16:29 UTC (permalink / raw)
To: Richard Henderson
Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Alex Bennée
[-- Attachment #1: Type: text/plain, Size: 4876 bytes --]
On Fri, May 1, 2020 at 10:18 PM Richard Henderson <
richard.henderson@linaro.org> wrote:
> On 5/1/20 6:10 AM, Alex Bennée wrote:
> >
> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >
> >> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu>
> wrote:
> >>
> >>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
> >>>> That's what I suggested,
> >>>> We preserve a float computing cache
> >>>> typedef struct FpRecord {
> >>>> uint8_t op;
> >>>> float32 A;
> >>>> float32 B;
> >>>> } FpRecord;
> >>>> FpRecord fp_cache[1024];
> >>>> int fp_cache_length;
> >>>> uint32_t fp_exceptions;
> >>>>
> >>>> 1. For each new fp operation we push it to the fp_cache,
> >>>> 2. Once we read the fp_exceptions , then we re-compute
> >>>> the fp_exceptions by re-running the fp FpRecord sequence.
> >>>> and clear fp_cache_length.
> >>>
> >>> Why do you need to store more than the last fp op? The cumulative bits
> can
> >>> be tracked like it's done for other targets by not clearing fp_status
> then
> >>> you can read it from there. Only the non-sticky FI bit needs to be
> >>> computed but that's only determined by the last op so it's enough to
> >>> remember that and run that with softfloat (or even hardfloat after
> >>> clearing status but softfloat may be faster for this) to get the bits
> for
> >>> last op when status is read.
> >>>
> >> Yeap, store only the last fp op is also an option. Do you means that
> store
> >> the last fp op,
> >> and calculate it when necessary? I am thinking about a general fp
> >> optmize method that suite
> >> for all target.
> >
> > I think that's getting a little ahead of yourself. Let's prove the
> > technique is valuable for PPC (given it has the most to gain). We can
> > always generalise later if it's worthwhile.
>
> Indeed.
>
> > Rather than creating a new structure I would suggest creating 3 new tcg
> > globals (op, inA, inB) and re-factor the front-end code so each FP op
> > loaded the TCG globals. The TCG optimizer should pick up aliased loads
> > and automatically eliminate the dead ones. We might need some new
> > machinery for the TCG to avoid spilling the values over potentially
> > faulting loads/stores but that is likely a phase 2 problem.
>
> There's no point in new tcg globals.
>
> Every fp operation can raise an exception, and therefore every fp operation
> will flush tcg globals to memory. Therefore there is no optimization to be
> done at the tcg opcode level.
>
> However, every fp operation calls a helper function, and the quickest
> thing to
> do is store the inputs to env->(op, inA, inB, inC) in the helper before
> performing the operation.
>
>
> > Next you will want to find places that care about the per-op bits of
> > cpu_fpscr and call a helper with the new globals to re-run the
> > computation and feed the values in.
>
> Before we even get to this deferred fp operation thing, there are several
> giant
> improvements to ppc emulation that can be made:
>
> Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
> I've mentioned this before, that it's possible to leave the steady-state of
> env->fp_status.exception_flags == 0, so there's no need for a separate
> function
> call. I suspect this is worth a decent speedup by itself.
>
Hi Richard, what kinds of rearrange the fp need to be done? Can you give me
a more detailed
example? I am still not get the idea.
>
> Step 2 is to notice when all fp exceptions are masked, so that no
> exception can
> be raised, and set a tb_flags bit. This is the default fp environment that
> libc enables and therefore extremely common.
>
> Currently, ppc has 3 helpers called per fp operation. If step 1 is handled
> correctly, then we're down to 2 fp helpers per fp operation. If no
> exceptions
> need raising, then we can perform the entire operation with a single
> function call.
>
> We would require a parallel set of fp helpers that (1) performs the
> operation
> and (2) does any post-processing of the exception bits straight away, but
> (3)
> without raising any exceptions. Sort of like helper_fadd +
> do_float_check_status, but less. IIRC the only real extra work is
> categorizing
> invalid exceptions. We could even plausibly extend softfloat to do that
> while
> it is recording the invalid exception.
>
> Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute
> inexact
> from the inverse hardfloat operation. This would let us relax the
> restriction
> of only using hardfloat when we have already have an accrued inexact
> exception.
>
> Only after all of these are done is it worth experimenting with caching the
> last fp operation.
>
>
> r~
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 6145 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 16:29 ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 16:51 ` Richard Henderson
2020-05-01 17:49 ` 罗勇刚(Yonggang Luo)
0 siblings, 1 reply; 40+ messages in thread
From: Richard Henderson @ 2020-05-01 16:51 UTC (permalink / raw)
To: luoyonggang
Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Alex Bennée
On 5/1/20 9:29 AM, 罗勇刚(Yonggang Luo) wrote:
> On Fri, May 1, 2020 at 10:18 PM Richard Henderson <richard.henderson@linaro.org
> Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
> I've mentioned this before, that it's possible to leave the steady-state of
> env->fp_status.exception_flags == 0, so there's no need for a separate function
> call. I suspect this is worth a decent speedup by itself.
>
> Hi Richard, what kinds of rearrange the fp need to be done? Can you give me a
> more detailed example? I am still not get the idea.
See target/openrisc, helper_update_fpcsr.
This is like target/ppc helper_float_check_status, in that it is called after
the primary fpu helper, after the fpu result is written back to the
architectural register, to process fpu exceptions.
Note that if get_float_exception_flags returns non-zero, we immediately reset
them to zero. Thus the exception flags are only ever non-zero in between the
primary fpu operation and the update of the fpscr.
Thus, no need for a separate helper_reset_fpstatus.
r~
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 16:51 ` Richard Henderson
@ 2020-05-01 17:49 ` 罗勇刚(Yonggang Luo)
2020-05-01 20:35 ` Richard Henderson
0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01 17:49 UTC (permalink / raw)
To: Richard Henderson
Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Alex Bennée
[-- Attachment #1: Type: text/plain, Size: 1858 bytes --]
On Sat, May 2, 2020 at 12:51 AM Richard Henderson <
richard.henderson@linaro.org> wrote:
> On 5/1/20 9:29 AM, 罗勇刚(Yonggang Luo) wrote:
> > On Fri, May 1, 2020 at 10:18 PM Richard Henderson <
> richard.henderson@linaro.org
> > Step 1 is to rearrange the fp helpers to eliminate
> helper_reset_fpstatus().
> > I've mentioned this before, that it's possible to leave the
> steady-state of
> > env->fp_status.exception_flags == 0, so there's no need for a
> separate function
> > call. I suspect this is worth a decent speedup by itself.
> >
> > Hi Richard, what kinds of rearrange the fp need to be done? Can you give
> me a
> > more detailed example? I am still not get the idea.
>
> See target/openrisc, helper_update_fpcsr.
>
> This is like target/ppc helper_float_check_status, in that it is called
> after
> the primary fpu helper, after the fpu result is written back to the
> architectural register, to process fpu exceptions.
>
> Note that if get_float_exception_flags returns non-zero, we immediately
> reset
> them to zero. Thus the exception flags are only ever non-zero in between
> the
> primary fpu operation and the update of the fpscr.
>
According to
```
void HELPER(update_fpcsr)(CPUOpenRISCState *env)
{
int tmp = get_float_exception_flags(&env->fp_status);
if (tmp) {
set_float_exception_flags(0, &env->fp_status);
tmp = ieee_ex_to_openrisc(tmp);
if (tmp) {
env->fpcsr |= tmp;
if (env->fpcsr & FPCSR_FPEE) {
helper_exception(env, EXCP_FPE);
}
}
}
}
```
The openrisc also clearing the flags before each fp operation?
>
> Thus, no need for a separate helper_reset_fpstatus.
>
>
> r~
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 2677 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: About hardfloat in ppc
2020-05-01 17:49 ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 20:35 ` Richard Henderson
0 siblings, 0 replies; 40+ messages in thread
From: Richard Henderson @ 2020-05-01 20:35 UTC (permalink / raw)
To: luoyonggang
Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Alex Bennée
On 5/1/20 10:49 AM, 罗勇刚(Yonggang Luo) wrote:
>
>
> On Sat, May 2, 2020 at 12:51 AM Richard Henderson <richard.henderson@linaro.org
> <mailto:richard.henderson@linaro.org>> wrote:
>
> On 5/1/20 9:29 AM, 罗勇刚(Yonggang Luo) wrote:
> > On Fri, May 1, 2020 at 10:18 PM Richard Henderson
> <richard.henderson@linaro.org <mailto:richard.henderson@linaro.org>
> > Step 1 is to rearrange the fp helpers to eliminate
> helper_reset_fpstatus().
> > I've mentioned this before, that it's possible to leave the
> steady-state of
> > env->fp_status.exception_flags == 0, so there's no need for a
> separate function
> > call. I suspect this is worth a decent speedup by itself.
> >
> > Hi Richard, what kinds of rearrange the fp need to be done? Can you give me a
> > more detailed example? I am still not get the idea.
>
> See target/openrisc, helper_update_fpcsr.
>
> This is like target/ppc helper_float_check_status, in that it is called after
> the primary fpu helper, after the fpu result is written back to the
> architectural register, to process fpu exceptions.
>
> Note that if get_float_exception_flags returns non-zero, we immediately reset
> them to zero. Thus the exception flags are only ever non-zero in between the
> primary fpu operation and the update of the fpscr.
>
> According to
> ```
> void HELPER(update_fpcsr)(CPUOpenRISCState *env)
> {
> int tmp = get_float_exception_flags(&env->fp_status);
>
> if (tmp) {
> set_float_exception_flags(0, &env->fp_status);
> tmp = ieee_ex_to_openrisc(tmp);
> if (tmp) {
> env->fpcsr |= tmp;
> if (env->fpcsr & FPCSR_FPEE) {
> helper_exception(env, EXCP_FPE);
> }
> }
> }
> }
> ```
> The openrisc also clearing the flags before each fp operation?
No. Please re-read my description above.
OpenRISC is clearing the flags *after* each fp operation, at the same time that
it processes the flags from the current fp operation.
There are two calls at runtime for openrisc, e.g. do_fp2:
fn(cpu_R(dc, a->d), cpu_env, cpu_R(dc, a->a));
gen_helper_update_fpcsr(cpu_env);
Whereas for ppc there are between 2 and 5 calls at runtime, e.g. in _GEN_FLOAT_ACB:
> gen_reset_fpstatus(); [1]
> get_fpr(t0, rA(ctx->opcode));
> get_fpr(t1, rC(ctx->opcode));
> get_fpr(t2, rB(ctx->opcode));
> gen_helper_f##op(t3, cpu_env, t0, t1, t2); [2]
> if (isfloat) {
> gen_helper_frsp(t3, cpu_env, t3); [3]
> }
> set_fpr(rD(ctx->opcode), t3);
> if (set_fprf) {
> gen_compute_fprf_float64(t3); [4]
> }
> if (unlikely(Rc(ctx->opcode) != 0)) {
> gen_set_cr1_from_fpscr(ctx); [5]
> }
For step 1, we're talking about removing the call to gen_reset_fpstatus.
It might be worth adding a debugging check to the beginning of each helper of
the form [2] to assert that the exception flags are in fact zero. This check
might be removed later, in relation to future improvements, but it can help
ensure that the value of set_fprf is correct, and validate that step 1 isn't
breaking anything.
r~
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: R: About hardfloat in ppc
2020-04-29 11:57 ` Alex Bennée
2020-04-29 12:33 ` 罗勇刚(Yonggang Luo)
2020-04-29 14:31 ` R: " Dino Papararo
@ 2020-04-29 23:12 ` 罗勇刚(Yonggang Luo)
2 siblings, 0 replies; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-29 23:12 UTC (permalink / raw)
To: Alex Bennée
Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
Howard Spoelstra, Dino Papararo
[-- Attachment #1: Type: text/plain, Size: 10879 bytes --]
On Wed, Apr 29, 2020 at 7:57 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Dino Papararo <skizzato73@msn.com> writes:
>
> > Hello,
> > about handling of PPC fpu exceptions and Hard Floats support we could
> consider a different approach for different instructions.
> > i.e. not all fpu instructions take care about inexact or exceptions
> bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding
> f1+f2 into f1 register and no one will check about inexact or exception
> bits raised into FPSCR register.
> > Instead if I'll take fadd. f0,f1,f2 the dot following the add
> instructions means I want take inexact or exceptions bits into account.
> > So I could use hard floats for first case and softfloats for second case.
> > Could this be a fast solution to start implement hard floats for PPC??
>
> While it may be true that normal software practice is not to read the
> exception registers for every operation we can't base our emulation on
> that. We must always be able to re-create the state of the exception
> registers whenever they may be read by the program. There are 3 cases
> this may happen:
>
> - a direct read of the inexact register
> - checking the sigcontext of a synchronous exception (e.g. fault)
> - checking the sigcontext of an asynchronous exception (e.g. timer/IPI)
>
> Given the way the translator works we can simplify the asynchronous case
> because we know they are only ever delivered at the start of translated
> blocks. We must have a fully rectified system state at the end of every
> block. So lets consider some cases:
>
> fpOpA
> clear flags
> fpOpB
> clear flags
> fpOpC
> read flags
>
I am thinking about a new way to do optimize if InstCombine are possible in
tcg, like InstCombine in LLVM
suppose we have
clearFlagsFpOpA
clearFlagsFpOpB
clearFlagsFpOpC
clearFlagsFpOpD
Then we can instCombine into
FpOpA
FpObB
FpOpC
clearFlagsFpOpD,
Are this would be a possible idea?
I think TCG have BasicBlock, and we can optimize
TCG at the basic block level.
>
> Assuming we know the fpOps can't generate exceptions we can know that
> only fpOpC will ever generate a user visible floating point flags so we
> can indeed use hardfloat for fpOpA and fpOpB. However if we see the
> pattern:
>
> fpOpA
> ld/st
> clear flags
> fpOpB
> read flags
>
> we must have the fully rectified version of the flags because the ld/st
> may fault. However it's not guaranteed it will fault so we could defer
> the flag calculation for fpOpA until such time as we need it. The
> easiest way would be to save the values going into the operation and
> then re-run it in softfloat when required (hopefully never ;-).
>
> A lot will depend on the behaviour of the architecture. For example:
>
> fpOpA
> fpOpB
> read flags
>
> whether or not we need to be able to calculate the flags for fpOpA will
> depend on if fpOpB completely resets the flags visible or if the result
> is additive.
>
> So in short I think there may be scope for using hardfloat but it will
> require knowledge of front-end knowing if it is safe to skip flag
> calculation in particular cases. We might even need support within TCG
> for saving (and marking) temporaries over potentially faulting
> boundaries so these lazy evaluations can be done. We can certainly add a
> fp-status less set of primitives to softfloat which can use the
> hardfloat path when we know we are using normal numbers.
>
> >
> > A little of documentation here:
> http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/PPCNumerics/PPCNumerics-154.html
> >
> > Regards,
> > Dino Papararo
> >
> > -----Messaggio originale-----
> > Da: Qemu-devel <qemu-devel-bounces+skizzato73=msn.com@nongnu.org> Per
> conto di Alex Bennée
> > Inviato: martedì 28 aprile 2020 10:37
> > A: luoyonggang@gmail.com
> > Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
> > Oggetto: Re: About hardfloat in ppc
> >
> >
> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >
> >> I am confusing why only inexact are set then we can use hard-float.
> >
> > The inexact behaviour of the host hardware may be different from the
> guest architecture we are trying to emulate and the host hardware may not
> be configurable to emulate the guest mode.
> >
> > Have a look in softfloat.c and see all the places where
> float_flag_inexact is set. Can you convince yourself that the host hardware
> will do the same?
> >
> >> And PPC always clearing inexact flag before calling to soft-float
> >> funcitons. so we can not optimize it with hard-float.
> >> I need some resouces about ineact flag and why always clearing inexcat
> >> in PPC FP simualtion.
> >
> > Because that is the behaviour of the PPC floating point unit. The
> inexact flag will represent the last operation done.
> >
> >> I am looking for two possible solution:
> >> 1. do not clear inexact flag in PPC simulation 2. even the inexact are
> >> cleared, we can still use alternative hard-float.
> >>
> >> But now I am the beginner, Have no clue about all the things.
> >
> > Well you'll need to learn about floating point because these are rather
> fundamental aspects of it's behaviour. In the old days QEMU used to use the
> host floating point processor with it's template based translation.
> > However this led to lots of weird bugs because the floating point
> answers under qemu where different from the target it was trying to
> emulate. It was for this reason softfloat was introduced. The hardfloat
> optimisation can only be done when we are confident that we will get the
> exact same answer of the target we are trying to emulate - a "faster but
> incorrect" mode is just going to cause confusion as discussed in the
> previous thread. Have you read that yet?
> >
> >>
> >> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org>
> wrote:
> >>
> >>>
> >>> BALATON Zoltan <balaton@eik.bme.hu> writes:
> >>>
> >>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
> >>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, So
> >>> >>> is that possible to optimize the performance when
> >>> float_flag_inexact
> >>> >>> are cleared?
> >>> >>
> >>> >> There was some discussion about this in the last thread about
> >>> >> enabling hardfloat for PPC. See the thread:
> >>> >>
> >>> >> Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >>> >> Date: Tue, 18 Feb 2020 18:10:16 +0100
> >>> >> Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
> >>> >
> >>> > I've answered this already with link to that thread here:
> >>> >
> >>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
> >>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
> >>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
> >>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
> >>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
> >>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
> >>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >>> > :
> >>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> >>> > :> Are this stable now? I'd like to see hard float to be landed:)
> >>> > :
> >>> > : If you want to see hardfloat for PPC then you should read the
> >>> > replies to : this patch which can be found here:
> >>> > :
> >>> > : http://patchwork.ozlabs.org/patch/1240235/
> >>> > :
> >>> > : to understand what's needed then try to implement the solution
> >>> > with FP : exceptions cached in a global that maybe could work. I
> >>> > won't be able to : do that as said here:
> >>> > :
> >>> > :
> >>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.htm
> >>> > l
> >>> > :
> >>> > : because I don't have time to learn all the details needed. I think
> :
> >>> > others are in the same situation so unless somebody puts in the :
> >>> > necessary effort this won't change.
> >>> >
> >>> > Which also had a proposed solution to the problem that you could
> >>> > try to implement, in particular see this message:
> >>> >
> >>> >
> >>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.9
> >>> 79F074637D@zero.eik.bme.hu/#2375124
> >>> >
> >>> > amd Richard's reply immediately below that. In short to optimise
> >>> > FPU emulation we would either find a way to compute inexact flag
> >>> > quickly without reading the FPU status (this may not be possible)
> >>> > or somehow get status from the FPU but the obvious way of claring
> >>> > the flag and reading them after each operation is too slow. So
> >>> > maybe using exceptions and only clearing when actually there's a
> >>> > change could be faster.
> >>> >
> >>> > As to how to use exceptions see this message in above thread:
> >>> >
> >>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.htm
> >>> > l
> >>> >
> >>> > But that's only to show how to hook in an exception handler what it
> >>> > does needs to be implemented. Then tested and benchmarked.
> >>> >
> >>> > I still don't know where are the extensive PPC floating point tests
> >>> > to use for checking results though as that was never answered.
> >>>
> >>> Specifically for PPC we don't have them. We use the softfloat test
> >>> cases to exercise our softfloat/hardfloat code as part of "make
> >>> check-softfloat". You can also re-build fp-bench for each guest
> >>> target to measure raw throughput.
> >>>
> >>> >> However in short the problem is if the float_flag_inexact is clear
> >>> >> you must use softfloat so you can properly calculate the inexact
> >>> >> status. We can't take advantage of the inexact stickiness without
> >>> >> loosing the fidelity of the calculation.
> >>> >
> >>> > I still don't get why can't we use hardware via exception handler
> >>> > to detect flags for us and why do we only use hardfloat in some
> >>> > corner cases. If reading the status is too costly then we could
> >>> > mirror it in a global which is set by an FP exception handler.
> >>> > Shouldn't that be faster? Is there a reason that can't work?
> >>>
> >>> It would work but it would be slow. Almost every FP operation sets
> >>> the inexact flag so it would generate an exception and exceptions
> >>> take time to process.
> >>>
> >>> For the guests where we use hardfloat operations with inexact already
> >>> latched is not a corner case - it is the common case which is why it
> >>> helps.
> >>>
> >>> >
> >>> > Regards,
> >>> > BALATON Zoltan
> >>>
> >>>
> >>> --
> >>> Alex Bennée
> >>>
>
>
> --
> Alex Bennée
>
--
此致
礼
罗勇刚
Yours
sincerely,
Yonggang Luo
[-- Attachment #2: Type: text/html, Size: 15134 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread