All of lore.kernel.org
 help / color / mirror / Atom feed
* About hardfloat in ppc
@ 2020-04-27  6:39 罗勇刚(Yonggang Luo)
  2020-04-27  9:42 ` Alex Bennée
  0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-27  6:39 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 229 bytes --]

Because ppc fpu-helper are always clearing float_flag_inexact,
So is that possible to optimize the performance when  float_flag_inexact
are cleared?

-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 394 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-27  6:39 About hardfloat in ppc 罗勇刚(Yonggang Luo)
@ 2020-04-27  9:42 ` Alex Bennée
  2020-04-27 10:34   ` BALATON Zoltan
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Bennée @ 2020-04-27  9:42 UTC (permalink / raw)
  To: luoyonggang; +Cc: qemu-ppc, qemu-devel


罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:

> Because ppc fpu-helper are always clearing float_flag_inexact,
> So is that possible to optimize the performance when  float_flag_inexact
> are cleared?

There was some discussion about this in the last thread about enabling
hardfloat for PPC. See the thread:

  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
  Date: Tue, 18 Feb 2020 18:10:16 +0100
  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>

However in short the problem is if the float_flag_inexact is clear you
must use softfloat so you can properly calculate the inexact status. We
can't take advantage of the inexact stickiness without loosing the
fidelity of the calculation.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-27  9:42 ` Alex Bennée
@ 2020-04-27 10:34   ` BALATON Zoltan
  2020-04-27 11:10     ` Alex Bennée
  0 siblings, 1 reply; 40+ messages in thread
From: BALATON Zoltan @ 2020-04-27 10:34 UTC (permalink / raw)
  To: Alex Bennée; +Cc: luoyonggang, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3314 bytes --]

On Mon, 27 Apr 2020, Alex Bennée wrote:
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>> Because ppc fpu-helper are always clearing float_flag_inexact,
>> So is that possible to optimize the performance when  float_flag_inexact
>> are cleared?
>
> There was some discussion about this in the last thread about enabling
> hardfloat for PPC. See the thread:
>
>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>  Date: Tue, 18 Feb 2020 18:10:16 +0100
>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>

I've answered this already with link to that thread here:

On Fri, 10 Apr 2020, BALATON Zoltan wrote:
: Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
: From: BALATON Zoltan <balaton@eik.bme.hu>
: To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
: Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle, qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
: Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
:
: On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
:> Are this stable now? I'd like to see hard float to be landed:)
:
: If you want to see hardfloat for PPC then you should read the replies to 
: this patch which can be found here:
:
: http://patchwork.ozlabs.org/patch/1240235/
:
: to understand what's needed then try to implement the solution with FP 
: exceptions cached in a global that maybe could work. I won't be able to 
: do that as said here:
:
: https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.html
:
: because I don't have time to learn all the details needed. I think 
: others are in the same situation so unless somebody puts in the 
: necessary effort this won't change.

Which also had a proposed solution to the problem that you could try to 
implement, in particular see this message:

http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.979F074637D@zero.eik.bme.hu/#2375124

amd Richard's reply immediately below that. In short to optimise FPU 
emulation we would either find a way to compute inexact flag quickly 
without reading the FPU status (this may not be possible) or somehow get 
status from the FPU but the obvious way of claring the flag and reading 
them after each operation is too slow. So maybe using exceptions and only 
clearing when actually there's a change could be faster.

As to how to use exceptions see this message in above thread:

https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.html

But that's only to show how to hook in an exception handler what it does 
needs to be implemented. Then tested and benchmarked.

I still don't know where are the extensive PPC floating point tests to use 
for checking results though as that was never answered.

> However in short the problem is if the float_flag_inexact is clear you
> must use softfloat so you can properly calculate the inexact status. We
> can't take advantage of the inexact stickiness without loosing the
> fidelity of the calculation.

I still don't get why can't we use hardware via exception handler to 
detect flags for us and why do we only use hardfloat in some corner cases. 
If reading the status is too costly then we could mirror it in a global 
which is set by an FP exception handler. Shouldn't that be faster? Is 
there a reason that can't work?

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-27 10:34   ` BALATON Zoltan
@ 2020-04-27 11:10     ` Alex Bennée
  2020-04-27 21:18       ` 罗勇刚(Yonggang Luo)
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Bennée @ 2020-04-27 11:10 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: luoyonggang, qemu-ppc, qemu-devel


BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Mon, 27 Apr 2020, Alex Bennée wrote:
>> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>>> Because ppc fpu-helper are always clearing float_flag_inexact,
>>> So is that possible to optimize the performance when  float_flag_inexact
>>> are cleared?
>>
>> There was some discussion about this in the last thread about enabling
>> hardfloat for PPC. See the thread:
>>
>>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>>  Date: Tue, 18 Feb 2020 18:10:16 +0100
>>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
>
> I've answered this already with link to that thread here:
>
> On Fri, 10 Apr 2020, BALATON Zoltan wrote:
> : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
> : From: BALATON Zoltan <balaton@eik.bme.hu>
> : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
> : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle, qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
> : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> :
> : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> :> Are this stable now? I'd like to see hard float to be landed:)
> :
> : If you want to see hardfloat for PPC then you should read the
> replies to : this patch which can be found here:
> :
> : http://patchwork.ozlabs.org/patch/1240235/
> :
> : to understand what's needed then try to implement the solution with
> FP : exceptions cached in a global that maybe could work. I won't be
> able to : do that as said here:
> :
> : https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.html
> :
> : because I don't have time to learn all the details needed. I think :
> others are in the same situation so unless somebody puts in the :
> necessary effort this won't change.
>
> Which also had a proposed solution to the problem that you could try
> to implement, in particular see this message:
>
> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.979F074637D@zero.eik.bme.hu/#2375124
>
> amd Richard's reply immediately below that. In short to optimise FPU
> emulation we would either find a way to compute inexact flag quickly 
> without reading the FPU status (this may not be possible) or somehow
> get status from the FPU but the obvious way of claring the flag and
> reading them after each operation is too slow. So maybe using
> exceptions and only clearing when actually there's a change could be
> faster.
>
> As to how to use exceptions see this message in above thread:
>
> https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.html
>
> But that's only to show how to hook in an exception handler what it
> does needs to be implemented. Then tested and benchmarked.
>
> I still don't know where are the extensive PPC floating point tests to
> use for checking results though as that was never answered.

Specifically for PPC we don't have them. We use the softfloat test cases
to exercise our softfloat/hardfloat code as part of "make
check-softfloat". You can also re-build fp-bench for each guest target
to measure raw throughput.

>> However in short the problem is if the float_flag_inexact is clear you
>> must use softfloat so you can properly calculate the inexact status. We
>> can't take advantage of the inexact stickiness without loosing the
>> fidelity of the calculation.
>
> I still don't get why can't we use hardware via exception handler to
> detect flags for us and why do we only use hardfloat in some corner
> cases. If reading the status is too costly then we could mirror it in
> a global which is set by an FP exception handler. Shouldn't that be
> faster? Is there a reason that can't work?

It would work but it would be slow. Almost every FP operation sets
the inexact flag so it would generate an exception and exceptions take
time to process.

For the guests where we use hardfloat operations with inexact already
latched is not a corner case - it is the common case which is why it
helps.

>
> Regards,
> BALATON Zoltan


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-27 11:10     ` Alex Bennée
@ 2020-04-27 21:18       ` 罗勇刚(Yonggang Luo)
  2020-04-28  8:36         ` Alex Bennée
  0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-27 21:18 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5002 bytes --]

I am confusing why only  inexact  are set then we can use hard-float.
And PPC always clearing inexact  flag before calling to soft-float
funcitons. so we can not
optimize it with hard-float.
I need some resouces about ineact flag and why always clearing inexcat in
PPC FP simualtion.
I am looking for two possible solution:
1. do not clear inexact flag in PPC simulation
2. even the inexact are cleared, we can still use alternative hard-float.

But now I am the beginner, Have no clue about all the things.

On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org> wrote:

>
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>
> > On Mon, 27 Apr 2020, Alex Bennée wrote:
> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >>> Because ppc fpu-helper are always clearing float_flag_inexact,
> >>> So is that possible to optimize the performance when
> float_flag_inexact
> >>> are cleared?
> >>
> >> There was some discussion about this in the last thread about enabling
> >> hardfloat for PPC. See the thread:
> >>
> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
> >>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
> >
> > I've answered this already with link to that thread here:
> >
> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
> > : From: BALATON Zoltan <balaton@eik.bme.hu>
> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> > :
> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> > :> Are this stable now? I'd like to see hard float to be landed:)
> > :
> > : If you want to see hardfloat for PPC then you should read the
> > replies to : this patch which can be found here:
> > :
> > : http://patchwork.ozlabs.org/patch/1240235/
> > :
> > : to understand what's needed then try to implement the solution with
> > FP : exceptions cached in a global that maybe could work. I won't be
> > able to : do that as said here:
> > :
> > : https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.html
> > :
> > : because I don't have time to learn all the details needed. I think :
> > others are in the same situation so unless somebody puts in the :
> > necessary effort this won't change.
> >
> > Which also had a proposed solution to the problem that you could try
> > to implement, in particular see this message:
> >
> >
> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.979F074637D@zero.eik.bme.hu/#2375124
> >
> > amd Richard's reply immediately below that. In short to optimise FPU
> > emulation we would either find a way to compute inexact flag quickly
> > without reading the FPU status (this may not be possible) or somehow
> > get status from the FPU but the obvious way of claring the flag and
> > reading them after each operation is too slow. So maybe using
> > exceptions and only clearing when actually there's a change could be
> > faster.
> >
> > As to how to use exceptions see this message in above thread:
> >
> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.html
> >
> > But that's only to show how to hook in an exception handler what it
> > does needs to be implemented. Then tested and benchmarked.
> >
> > I still don't know where are the extensive PPC floating point tests to
> > use for checking results though as that was never answered.
>
> Specifically for PPC we don't have them. We use the softfloat test cases
> to exercise our softfloat/hardfloat code as part of "make
> check-softfloat". You can also re-build fp-bench for each guest target
> to measure raw throughput.
>
> >> However in short the problem is if the float_flag_inexact is clear you
> >> must use softfloat so you can properly calculate the inexact status. We
> >> can't take advantage of the inexact stickiness without loosing the
> >> fidelity of the calculation.
> >
> > I still don't get why can't we use hardware via exception handler to
> > detect flags for us and why do we only use hardfloat in some corner
> > cases. If reading the status is too costly then we could mirror it in
> > a global which is set by an FP exception handler. Shouldn't that be
> > faster? Is there a reason that can't work?
>
> It would work but it would be slow. Almost every FP operation sets
> the inexact flag so it would generate an exception and exceptions take
> time to process.
>
> For the guests where we use hardfloat operations with inexact already
> latched is not a corner case - it is the common case which is why it
> helps.
>
> >
> > Regards,
> > BALATON Zoltan
>
>
> --
> Alex Bennée
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 6956 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-27 21:18       ` 罗勇刚(Yonggang Luo)
@ 2020-04-28  8:36         ` Alex Bennée
  2020-04-28 14:29           ` 罗勇刚(Yonggang Luo)
                             ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Alex Bennée @ 2020-04-28  8:36 UTC (permalink / raw)
  To: luoyonggang; +Cc: qemu-ppc, qemu-devel


罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:

> I am confusing why only  inexact  are set then we can use hard-float.

The inexact behaviour of the host hardware may be different from the
guest architecture we are trying to emulate and the host hardware may
not be configurable to emulate the guest mode.

Have a look in softfloat.c and see all the places where
float_flag_inexact is set. Can you convince yourself that the host
hardware will do the same?

> And PPC always clearing inexact  flag before calling to soft-float
> funcitons. so we can not
> optimize it with hard-float.
> I need some resouces about ineact flag and why always clearing inexcat in
> PPC FP simualtion.

Because that is the behaviour of the PPC floating point unit. The
inexact flag will represent the last operation done.

> I am looking for two possible solution:
> 1. do not clear inexact flag in PPC simulation
> 2. even the inexact are cleared, we can still use alternative hard-float.
>
> But now I am the beginner, Have no clue about all the things.

Well you'll need to learn about floating point because these are rather
fundamental aspects of it's behaviour. In the old days QEMU used to use
the host floating point processor with it's template based translation.
However this led to lots of weird bugs because the floating point
answers under qemu where different from the target it was trying to
emulate. It was for this reason softfloat was introduced. The hardfloat
optimisation can only be done when we are confident that we will get the
exact same answer of the target we are trying to emulate - a "faster but
incorrect" mode is just going to cause confusion as discussed in the
previous thread. Have you read that yet?

>
> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>>
>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>
>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>> >>> Because ppc fpu-helper are always clearing float_flag_inexact,
>> >>> So is that possible to optimize the performance when
>> float_flag_inexact
>> >>> are cleared?
>> >>
>> >> There was some discussion about this in the last thread about enabling
>> >> hardfloat for PPC. See the thread:
>> >>
>> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
>> >>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
>> >
>> > I've answered this already with link to that thread here:
>> >
>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> > :
>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>> > :> Are this stable now? I'd like to see hard float to be landed:)
>> > :
>> > : If you want to see hardfloat for PPC then you should read the
>> > replies to : this patch which can be found here:
>> > :
>> > : http://patchwork.ozlabs.org/patch/1240235/
>> > :
>> > : to understand what's needed then try to implement the solution with
>> > FP : exceptions cached in a global that maybe could work. I won't be
>> > able to : do that as said here:
>> > :
>> > : https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.html
>> > :
>> > : because I don't have time to learn all the details needed. I think :
>> > others are in the same situation so unless somebody puts in the :
>> > necessary effort this won't change.
>> >
>> > Which also had a proposed solution to the problem that you could try
>> > to implement, in particular see this message:
>> >
>> >
>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.979F074637D@zero.eik.bme.hu/#2375124
>> >
>> > amd Richard's reply immediately below that. In short to optimise FPU
>> > emulation we would either find a way to compute inexact flag quickly
>> > without reading the FPU status (this may not be possible) or somehow
>> > get status from the FPU but the obvious way of claring the flag and
>> > reading them after each operation is too slow. So maybe using
>> > exceptions and only clearing when actually there's a change could be
>> > faster.
>> >
>> > As to how to use exceptions see this message in above thread:
>> >
>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.html
>> >
>> > But that's only to show how to hook in an exception handler what it
>> > does needs to be implemented. Then tested and benchmarked.
>> >
>> > I still don't know where are the extensive PPC floating point tests to
>> > use for checking results though as that was never answered.
>>
>> Specifically for PPC we don't have them. We use the softfloat test cases
>> to exercise our softfloat/hardfloat code as part of "make
>> check-softfloat". You can also re-build fp-bench for each guest target
>> to measure raw throughput.
>>
>> >> However in short the problem is if the float_flag_inexact is clear you
>> >> must use softfloat so you can properly calculate the inexact status. We
>> >> can't take advantage of the inexact stickiness without loosing the
>> >> fidelity of the calculation.
>> >
>> > I still don't get why can't we use hardware via exception handler to
>> > detect flags for us and why do we only use hardfloat in some corner
>> > cases. If reading the status is too costly then we could mirror it in
>> > a global which is set by an FP exception handler. Shouldn't that be
>> > faster? Is there a reason that can't work?
>>
>> It would work but it would be slow. Almost every FP operation sets
>> the inexact flag so it would generate an exception and exceptions take
>> time to process.
>>
>> For the guests where we use hardfloat operations with inexact already
>> latched is not a corner case - it is the common case which is why it
>> helps.
>>
>> >
>> > Regards,
>> > BALATON Zoltan
>>
>>
>> --
>> Alex Bennée
>>


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-28  8:36         ` Alex Bennée
@ 2020-04-28 14:29           ` 罗勇刚(Yonggang Luo)
  2020-04-29 10:17           ` R: " Dino Papararo
  2020-04-30 15:16           ` BALATON Zoltan
  2 siblings, 0 replies; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-28 14:29 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 7033 bytes --]

On Tue, Apr 28, 2020 at 4:36 PM Alex Bennée <alex.bennee@linaro.org> wrote:

>
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>
> > I am confusing why only  inexact  are set then we can use hard-float.
>
> The inexact behaviour of the host hardware may be different from the
> guest architecture we are trying to emulate and the host hardware may
> not be configurable to emulate the guest mode.
>
> Have a look in softfloat.c and see all the places where
> float_flag_inexact is set. Can you convince yourself that the host
> hardware will do the same?
>
> > And PPC always clearing inexact  flag before calling to soft-float
> > funcitons. so we can not
> > optimize it with hard-float.
> > I need some resouces about ineact flag and why always clearing inexcat in
> > PPC FP simualtion.
>
> Because that is the behaviour of the PPC floating point unit. The
> inexact flag will represent the last operation done.
>
> > I am looking for two possible solution:
> > 1. do not clear inexact flag in PPC simulation
> > 2. even the inexact are cleared, we can still use alternative hard-float.
> >
> > But now I am the beginner, Have no clue about all the things.
>
> Well you'll need to learn about floating point because these are rather
> fundamental aspects of it's behaviour. In the old days QEMU used to use
> the host floating point processor with it's template based translation.
> However this led to lots of weird bugs because the floating point
> answers under qemu where different from the target it was trying to
> emulate. It was for this reason softfloat was introduced. The hardfloat
> optimisation can only be done when we are confident that we will get the
> exact same answer of the target we are trying to emulate - a "faster but
> incorrect" mode is just going to cause confusion as discussed in the
> previous thread. Have you read that yet?
>
Yeap, I've alredy read that carefully, and I know for PPC now there is no
fast and correct way to
do hard float emulation, And my intention is to finding a possible way to
do fast and correct way to
do hard float emulation for PPC target at least under x86 host.

>
> >
> > On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org>
> wrote:
> >
> >>
> >> BALATON Zoltan <balaton@eik.bme.hu> writes:
> >>
> >> > On Mon, 27 Apr 2020, Alex Bennée wrote:
> >> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >> >>> Because ppc fpu-helper are always clearing float_flag_inexact,
> >> >>> So is that possible to optimize the performance when
> >> float_flag_inexact
> >> >>> are cleared?
> >> >>
> >> >> There was some discussion about this in the last thread about
> enabling
> >> >> hardfloat for PPC. See the thread:
> >> >>
> >> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
> >> >>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
> >> >
> >> > I've answered this already with link to that thread here:
> >> >
> >> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
> >> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
> >> > : From: BALATON Zoltan <balaton@eik.bme.hu>
> >> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
> >> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
> >> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
> >> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >> > :
> >> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> >> > :> Are this stable now? I'd like to see hard float to be landed:)
> >> > :
> >> > : If you want to see hardfloat for PPC then you should read the
> >> > replies to : this patch which can be found here:
> >> > :
> >> > : http://patchwork.ozlabs.org/patch/1240235/
> >> > :
> >> > : to understand what's needed then try to implement the solution with
> >> > FP : exceptions cached in a global that maybe could work. I won't be
> >> > able to : do that as said here:
> >> > :
> >> > :
> https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.html
> >> > :
> >> > : because I don't have time to learn all the details needed. I think :
> >> > others are in the same situation so unless somebody puts in the :
> >> > necessary effort this won't change.
> >> >
> >> > Which also had a proposed solution to the problem that you could try
> >> > to implement, in particular see this message:
> >> >
> >> >
> >>
> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.979F074637D@zero.eik.bme.hu/#2375124
> >> >
> >> > amd Richard's reply immediately below that. In short to optimise FPU
> >> > emulation we would either find a way to compute inexact flag quickly
> >> > without reading the FPU status (this may not be possible) or somehow
> >> > get status from the FPU but the obvious way of claring the flag and
> >> > reading them after each operation is too slow. So maybe using
> >> > exceptions and only clearing when actually there's a change could be
> >> > faster.
> >> >
> >> > As to how to use exceptions see this message in above thread:
> >> >
> >> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.html
> >> >
> >> > But that's only to show how to hook in an exception handler what it
> >> > does needs to be implemented. Then tested and benchmarked.
> >> >
> >> > I still don't know where are the extensive PPC floating point tests to
> >> > use for checking results though as that was never answered.
> >>
> >> Specifically for PPC we don't have them. We use the softfloat test cases
> >> to exercise our softfloat/hardfloat code as part of "make
> >> check-softfloat". You can also re-build fp-bench for each guest target
> >> to measure raw throughput.
> >>
> >> >> However in short the problem is if the float_flag_inexact is clear
> you
> >> >> must use softfloat so you can properly calculate the inexact status.
> We
> >> >> can't take advantage of the inexact stickiness without loosing the
> >> >> fidelity of the calculation.
> >> >
> >> > I still don't get why can't we use hardware via exception handler to
> >> > detect flags for us and why do we only use hardfloat in some corner
> >> > cases. If reading the status is too costly then we could mirror it in
> >> > a global which is set by an FP exception handler. Shouldn't that be
> >> > faster? Is there a reason that can't work?
> >>
> >> It would work but it would be slow. Almost every FP operation sets
> >> the inexact flag so it would generate an exception and exceptions take
> >> time to process.
> >>
> >> For the guests where we use hardfloat operations with inexact already
> >> latched is not a corner case - it is the common case which is why it
> >> helps.
> >>
> >> >
> >> > Regards,
> >> > BALATON Zoltan
> >>
> >>
> >> --
> >> Alex Bennée
> >>
>
>
> --
> Alex Bennée
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 9983 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* R: About hardfloat in ppc
  2020-04-28  8:36         ` Alex Bennée
  2020-04-28 14:29           ` 罗勇刚(Yonggang Luo)
@ 2020-04-29 10:17           ` Dino Papararo
  2020-04-29 10:31             ` Dino Papararo
  2020-04-29 11:57             ` Alex Bennée
  2020-04-30 15:16           ` BALATON Zoltan
  2 siblings, 2 replies; 40+ messages in thread
From: Dino Papararo @ 2020-04-29 10:17 UTC (permalink / raw)
  To: Alex Bennée, luoyonggang, BALATON Zoltan, Mark Cave-Ayland,
	Programmingkid, Howard Spoelstra
  Cc: qemu-ppc, qemu-devel

Hello,
about handling of PPC fpu exceptions and Hard Floats support we could consider a different approach for different instructions.
i.e. not all fpu instructions take care about inexact or exceptions bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 into f1 register and no one will check about inexact or exception bits raised into FPSCR register.
Instead if I'll take fadd. f0,f1,f2 the dot following the add instructions means I want take inexact or exceptions bits into account.
So I could use hard floats for first case and softfloats for second case.
Could this be a fast solution to start implement hard floats for PPC??

A little of documentation here: http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/PPCNumerics/PPCNumerics-154.html

Regards,
Dino Papararo

-----Messaggio originale-----
Da: Qemu-devel <qemu-devel-bounces+skizzato73=msn.com@nongnu.org> Per conto di Alex Bennée
Inviato: martedì 28 aprile 2020 10:37
A: luoyonggang@gmail.com
Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
Oggetto: Re: About hardfloat in ppc


罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:

> I am confusing why only  inexact  are set then we can use hard-float.

The inexact behaviour of the host hardware may be different from the guest architecture we are trying to emulate and the host hardware may not be configurable to emulate the guest mode.

Have a look in softfloat.c and see all the places where float_flag_inexact is set. Can you convince yourself that the host hardware will do the same?

> And PPC always clearing inexact  flag before calling to soft-float 
> funcitons. so we can not optimize it with hard-float.
> I need some resouces about ineact flag and why always clearing inexcat 
> in PPC FP simualtion.

Because that is the behaviour of the PPC floating point unit. The inexact flag will represent the last operation done.

> I am looking for two possible solution:
> 1. do not clear inexact flag in PPC simulation 2. even the inexact are 
> cleared, we can still use alternative hard-float.
>
> But now I am the beginner, Have no clue about all the things.

Well you'll need to learn about floating point because these are rather fundamental aspects of it's behaviour. In the old days QEMU used to use the host floating point processor with it's template based translation.
However this led to lots of weird bugs because the floating point answers under qemu where different from the target it was trying to emulate. It was for this reason softfloat was introduced. The hardfloat optimisation can only be done when we are confident that we will get the exact same answer of the target we are trying to emulate - a "faster but incorrect" mode is just going to cause confusion as discussed in the previous thread. Have you read that yet?

>
> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>>
>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>
>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, So 
>> >>> is that possible to optimize the performance when
>> float_flag_inexact
>> >>> are cleared?
>> >>
>> >> There was some discussion about this in the last thread about 
>> >> enabling hardfloat for PPC. See the thread:
>> >>
>> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
>> >>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
>> >
>> > I've answered this already with link to that thread here:
>> >
>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> > :
>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>> > :> Are this stable now? I'd like to see hard float to be landed:)
>> > :
>> > : If you want to see hardfloat for PPC then you should read the 
>> > replies to : this patch which can be found here:
>> > :
>> > : http://patchwork.ozlabs.org/patch/1240235/
>> > :
>> > : to understand what's needed then try to implement the solution 
>> > with FP : exceptions cached in a global that maybe could work. I 
>> > won't be able to : do that as said here:
>> > :
>> > : 
>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.htm
>> > l
>> > :
>> > : because I don't have time to learn all the details needed. I think :
>> > others are in the same situation so unless somebody puts in the :
>> > necessary effort this won't change.
>> >
>> > Which also had a proposed solution to the problem that you could 
>> > try to implement, in particular see this message:
>> >
>> >
>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.9
>> 79F074637D@zero.eik.bme.hu/#2375124
>> >
>> > amd Richard's reply immediately below that. In short to optimise 
>> > FPU emulation we would either find a way to compute inexact flag 
>> > quickly without reading the FPU status (this may not be possible) 
>> > or somehow get status from the FPU but the obvious way of claring 
>> > the flag and reading them after each operation is too slow. So 
>> > maybe using exceptions and only clearing when actually there's a 
>> > change could be faster.
>> >
>> > As to how to use exceptions see this message in above thread:
>> >
>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.htm
>> > l
>> >
>> > But that's only to show how to hook in an exception handler what it 
>> > does needs to be implemented. Then tested and benchmarked.
>> >
>> > I still don't know where are the extensive PPC floating point tests 
>> > to use for checking results though as that was never answered.
>>
>> Specifically for PPC we don't have them. We use the softfloat test 
>> cases to exercise our softfloat/hardfloat code as part of "make 
>> check-softfloat". You can also re-build fp-bench for each guest 
>> target to measure raw throughput.
>>
>> >> However in short the problem is if the float_flag_inexact is clear 
>> >> you must use softfloat so you can properly calculate the inexact 
>> >> status. We can't take advantage of the inexact stickiness without 
>> >> loosing the fidelity of the calculation.
>> >
>> > I still don't get why can't we use hardware via exception handler 
>> > to detect flags for us and why do we only use hardfloat in some 
>> > corner cases. If reading the status is too costly then we could 
>> > mirror it in a global which is set by an FP exception handler. 
>> > Shouldn't that be faster? Is there a reason that can't work?
>>
>> It would work but it would be slow. Almost every FP operation sets 
>> the inexact flag so it would generate an exception and exceptions 
>> take time to process.
>>
>> For the guests where we use hardfloat operations with inexact already 
>> latched is not a corner case - it is the common case which is why it 
>> helps.
>>
>> >
>> > Regards,
>> > BALATON Zoltan
>>
>>
>> --
>> Alex Bennée
>>


--
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* R: About hardfloat in ppc
  2020-04-29 10:17           ` R: " Dino Papararo
@ 2020-04-29 10:31             ` Dino Papararo
  2020-04-29 11:57             ` Alex Bennée
  1 sibling, 0 replies; 40+ messages in thread
From: Dino Papararo @ 2020-04-29 10:31 UTC (permalink / raw)
  To: Dino Papararo, Alex Bennée, luoyonggang, BALATON Zoltan,
	Mark Cave-Ayland, Programmingkid, Howard Spoelstra
  Cc: qemu-ppc, qemu-devel

Typo correction 😊 

" if I take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 into f0 register"

-----Messaggio originale-----
Da: Qemu-ppc <qemu-ppc-bounces+skizzato73=msn.com@nongnu.org> Per conto di Dino Papararo
Inviato: mercoledì 29 aprile 2020 12:18
A: Alex Bennée <alex.bennee@linaro.org>; luoyonggang@gmail.com; BALATON Zoltan <balaton@eik.bme.hu>; Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>; Programmingkid <programmingkidx@gmail.com>; Howard Spoelstra <hsp.cat7@gmail.com>
Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
Oggetto: R: About hardfloat in ppc

Hello,
about handling of PPC fpu exceptions and Hard Floats support we could consider a different approach for different instructions.
i.e. not all fpu instructions take care about inexact or exceptions bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 into f1 register and no one will check about inexact or exception bits raised into FPSCR register.
Instead if I'll take fadd. f0,f1,f2 the dot following the add instructions means I want take inexact or exceptions bits into account.
So I could use hard floats for first case and softfloats for second case.
Could this be a fast solution to start implement hard floats for PPC??

A little of documentation here: http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/PPCNumerics/PPCNumerics-154.html

Regards,
Dino Papararo

-----Messaggio originale-----
Da: Qemu-devel <qemu-devel-bounces+skizzato73=msn.com@nongnu.org> Per conto di Alex Bennée
Inviato: martedì 28 aprile 2020 10:37
A: luoyonggang@gmail.com
Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
Oggetto: Re: About hardfloat in ppc


罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:

> I am confusing why only  inexact  are set then we can use hard-float.

The inexact behaviour of the host hardware may be different from the guest architecture we are trying to emulate and the host hardware may not be configurable to emulate the guest mode.

Have a look in softfloat.c and see all the places where float_flag_inexact is set. Can you convince yourself that the host hardware will do the same?

> And PPC always clearing inexact  flag before calling to soft-float 
> funcitons. so we can not optimize it with hard-float.
> I need some resouces about ineact flag and why always clearing inexcat 
> in PPC FP simualtion.

Because that is the behaviour of the PPC floating point unit. The inexact flag will represent the last operation done.

> I am looking for two possible solution:
> 1. do not clear inexact flag in PPC simulation 2. even the inexact are 
> cleared, we can still use alternative hard-float.
>
> But now I am the beginner, Have no clue about all the things.

Well you'll need to learn about floating point because these are rather fundamental aspects of it's behaviour. In the old days QEMU used to use the host floating point processor with it's template based translation.
However this led to lots of weird bugs because the floating point answers under qemu where different from the target it was trying to emulate. It was for this reason softfloat was introduced. The hardfloat optimisation can only be done when we are confident that we will get the exact same answer of the target we are trying to emulate - a "faster but incorrect" mode is just going to cause confusion as discussed in the previous thread. Have you read that yet?

>
> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>>
>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>
>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, So 
>> >>> is that possible to optimize the performance when
>> float_flag_inexact
>> >>> are cleared?
>> >>
>> >> There was some discussion about this in the last thread about 
>> >> enabling hardfloat for PPC. See the thread:
>> >>
>> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
>> >>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
>> >
>> > I've answered this already with link to that thread here:
>> >
>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>> > :
>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>> > :> Are this stable now? I'd like to see hard float to be landed:)
>> > :
>> > : If you want to see hardfloat for PPC then you should read the 
>> > replies to : this patch which can be found here:
>> > :
>> > : http://patchwork.ozlabs.org/patch/1240235/
>> > :
>> > : to understand what's needed then try to implement the solution 
>> > with FP : exceptions cached in a global that maybe could work. I 
>> > won't be able to : do that as said here:
>> > :
>> > : 
>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.htm
>> > l
>> > :
>> > : because I don't have time to learn all the details needed. I think :
>> > others are in the same situation so unless somebody puts in the :
>> > necessary effort this won't change.
>> >
>> > Which also had a proposed solution to the problem that you could 
>> > try to implement, in particular see this message:
>> >
>> >
>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.9
>> 79F074637D@zero.eik.bme.hu/#2375124
>> >
>> > amd Richard's reply immediately below that. In short to optimise 
>> > FPU emulation we would either find a way to compute inexact flag 
>> > quickly without reading the FPU status (this may not be possible) 
>> > or somehow get status from the FPU but the obvious way of claring 
>> > the flag and reading them after each operation is too slow. So 
>> > maybe using exceptions and only clearing when actually there's a 
>> > change could be faster.
>> >
>> > As to how to use exceptions see this message in above thread:
>> >
>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.htm
>> > l
>> >
>> > But that's only to show how to hook in an exception handler what it 
>> > does needs to be implemented. Then tested and benchmarked.
>> >
>> > I still don't know where are the extensive PPC floating point tests 
>> > to use for checking results though as that was never answered.
>>
>> Specifically for PPC we don't have them. We use the softfloat test 
>> cases to exercise our softfloat/hardfloat code as part of "make 
>> check-softfloat". You can also re-build fp-bench for each guest 
>> target to measure raw throughput.
>>
>> >> However in short the problem is if the float_flag_inexact is clear 
>> >> you must use softfloat so you can properly calculate the inexact 
>> >> status. We can't take advantage of the inexact stickiness without 
>> >> loosing the fidelity of the calculation.
>> >
>> > I still don't get why can't we use hardware via exception handler 
>> > to detect flags for us and why do we only use hardfloat in some 
>> > corner cases. If reading the status is too costly then we could 
>> > mirror it in a global which is set by an FP exception handler.
>> > Shouldn't that be faster? Is there a reason that can't work?
>>
>> It would work but it would be slow. Almost every FP operation sets 
>> the inexact flag so it would generate an exception and exceptions 
>> take time to process.
>>
>> For the guests where we use hardfloat operations with inexact already 
>> latched is not a corner case - it is the common case which is why it 
>> helps.
>>
>> >
>> > Regards,
>> > BALATON Zoltan
>>
>>
>> --
>> Alex Bennée
>>


--
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: About hardfloat in ppc
  2020-04-29 10:17           ` R: " Dino Papararo
  2020-04-29 10:31             ` Dino Papararo
@ 2020-04-29 11:57             ` Alex Bennée
  2020-04-29 12:33               ` 罗勇刚(Yonggang Luo)
                                 ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Alex Bennée @ 2020-04-29 11:57 UTC (permalink / raw)
  To: Dino Papararo
  Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, luoyonggang,
	qemu-ppc, Howard Spoelstra


Dino Papararo <skizzato73@msn.com> writes:

> Hello,
> about handling of PPC fpu exceptions and Hard Floats support we could consider a different approach for different instructions.
> i.e. not all fpu instructions take care about inexact or exceptions bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 into f1 register and no one will check about inexact or exception bits raised into FPSCR register.
> Instead if I'll take fadd. f0,f1,f2 the dot following the add instructions means I want take inexact or exceptions bits into account.
> So I could use hard floats for first case and softfloats for second case.
> Could this be a fast solution to start implement hard floats for PPC??

While it may be true that normal software practice is not to read the
exception registers for every operation we can't base our emulation on
that. We must always be able to re-create the state of the exception
registers whenever they may be read by the program. There are 3 cases
this may happen:

  - a direct read of the inexact register
  - checking the sigcontext of a synchronous exception (e.g. fault)
  - checking the sigcontext of an asynchronous exception (e.g. timer/IPI)

Given the way the translator works we can simplify the asynchronous case
because we know they are only ever delivered at the start of translated
blocks. We must have a fully rectified system state at the end of every
block. So lets consider some cases:

  fpOpA
  clear flags
  fpOpB
  clear flags
  fpOpC
  read flags

Assuming we know the fpOps can't generate exceptions we can know that
only fpOpC will ever generate a user visible floating point flags so we
can indeed use hardfloat for fpOpA and fpOpB. However if we see the
pattern:

  fpOpA
  ld/st
  clear flags
  fpOpB
  read flags

we must have the fully rectified version of the flags because the ld/st
may fault. However it's not guaranteed it will fault so we could defer
the flag calculation for fpOpA until such time as we need it. The
easiest way would be to save the values going into the operation and
then re-run it in softfloat when required (hopefully never ;-).

A lot will depend on the behaviour of the architecture. For example:

  fpOpA
  fpOpB
  read flags

whether or not we need to be able to calculate the flags for fpOpA will
depend on if fpOpB completely resets the flags visible or if the result
is additive.

So in short I think there may be scope for using hardfloat but it will
require knowledge of front-end knowing if it is safe to skip flag
calculation in particular cases. We might even need support within TCG
for saving (and marking) temporaries over potentially faulting
boundaries so these lazy evaluations can be done. We can certainly add a
fp-status less set of primitives to softfloat which can use the
hardfloat path when we know we are using normal numbers.

>
> A little of documentation here: http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/PPCNumerics/PPCNumerics-154.html
>
> Regards,
> Dino Papararo
>
> -----Messaggio originale-----
> Da: Qemu-devel <qemu-devel-bounces+skizzato73=msn.com@nongnu.org> Per conto di Alex Bennée
> Inviato: martedì 28 aprile 2020 10:37
> A: luoyonggang@gmail.com
> Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
> Oggetto: Re: About hardfloat in ppc
>
>
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>
>> I am confusing why only  inexact  are set then we can use hard-float.
>
> The inexact behaviour of the host hardware may be different from the guest architecture we are trying to emulate and the host hardware may not be configurable to emulate the guest mode.
>
> Have a look in softfloat.c and see all the places where float_flag_inexact is set. Can you convince yourself that the host hardware will do the same?
>
>> And PPC always clearing inexact  flag before calling to soft-float 
>> funcitons. so we can not optimize it with hard-float.
>> I need some resouces about ineact flag and why always clearing inexcat 
>> in PPC FP simualtion.
>
> Because that is the behaviour of the PPC floating point unit. The inexact flag will represent the last operation done.
>
>> I am looking for two possible solution:
>> 1. do not clear inexact flag in PPC simulation 2. even the inexact are 
>> cleared, we can still use alternative hard-float.
>>
>> But now I am the beginner, Have no clue about all the things.
>
> Well you'll need to learn about floating point because these are rather fundamental aspects of it's behaviour. In the old days QEMU used to use the host floating point processor with it's template based translation.
> However this led to lots of weird bugs because the floating point answers under qemu where different from the target it was trying to emulate. It was for this reason softfloat was introduced. The hardfloat optimisation can only be done when we are confident that we will get the exact same answer of the target we are trying to emulate - a "faster but incorrect" mode is just going to cause confusion as discussed in the previous thread. Have you read that yet?
>
>>
>> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>>
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>
>>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
>>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, So 
>>> >>> is that possible to optimize the performance when
>>> float_flag_inexact
>>> >>> are cleared?
>>> >>
>>> >> There was some discussion about this in the last thread about 
>>> >> enabling hardfloat for PPC. See the thread:
>>> >>
>>> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>>> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
>>> >>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
>>> >
>>> > I've answered this already with link to that thread here:
>>> >
>>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
>>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
>>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
>>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
>>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
>>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
>>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>>> > :
>>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>>> > :> Are this stable now? I'd like to see hard float to be landed:)
>>> > :
>>> > : If you want to see hardfloat for PPC then you should read the 
>>> > replies to : this patch which can be found here:
>>> > :
>>> > : http://patchwork.ozlabs.org/patch/1240235/
>>> > :
>>> > : to understand what's needed then try to implement the solution 
>>> > with FP : exceptions cached in a global that maybe could work. I 
>>> > won't be able to : do that as said here:
>>> > :
>>> > : 
>>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.htm
>>> > l
>>> > :
>>> > : because I don't have time to learn all the details needed. I think :
>>> > others are in the same situation so unless somebody puts in the :
>>> > necessary effort this won't change.
>>> >
>>> > Which also had a proposed solution to the problem that you could 
>>> > try to implement, in particular see this message:
>>> >
>>> >
>>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.9
>>> 79F074637D@zero.eik.bme.hu/#2375124
>>> >
>>> > amd Richard's reply immediately below that. In short to optimise 
>>> > FPU emulation we would either find a way to compute inexact flag 
>>> > quickly without reading the FPU status (this may not be possible) 
>>> > or somehow get status from the FPU but the obvious way of claring 
>>> > the flag and reading them after each operation is too slow. So 
>>> > maybe using exceptions and only clearing when actually there's a 
>>> > change could be faster.
>>> >
>>> > As to how to use exceptions see this message in above thread:
>>> >
>>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.htm
>>> > l
>>> >
>>> > But that's only to show how to hook in an exception handler what it 
>>> > does needs to be implemented. Then tested and benchmarked.
>>> >
>>> > I still don't know where are the extensive PPC floating point tests 
>>> > to use for checking results though as that was never answered.
>>>
>>> Specifically for PPC we don't have them. We use the softfloat test 
>>> cases to exercise our softfloat/hardfloat code as part of "make 
>>> check-softfloat". You can also re-build fp-bench for each guest 
>>> target to measure raw throughput.
>>>
>>> >> However in short the problem is if the float_flag_inexact is clear 
>>> >> you must use softfloat so you can properly calculate the inexact 
>>> >> status. We can't take advantage of the inexact stickiness without 
>>> >> loosing the fidelity of the calculation.
>>> >
>>> > I still don't get why can't we use hardware via exception handler 
>>> > to detect flags for us and why do we only use hardfloat in some 
>>> > corner cases. If reading the status is too costly then we could 
>>> > mirror it in a global which is set by an FP exception handler. 
>>> > Shouldn't that be faster? Is there a reason that can't work?
>>>
>>> It would work but it would be slow. Almost every FP operation sets 
>>> the inexact flag so it would generate an exception and exceptions 
>>> take time to process.
>>>
>>> For the guests where we use hardfloat operations with inexact already 
>>> latched is not a corner case - it is the common case which is why it 
>>> helps.
>>>
>>> >
>>> > Regards,
>>> > BALATON Zoltan
>>>
>>>
>>> --
>>> Alex Bennée
>>>


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: About hardfloat in ppc
  2020-04-29 11:57             ` Alex Bennée
@ 2020-04-29 12:33               ` 罗勇刚(Yonggang Luo)
  2020-04-29 13:38                 ` Alex Bennée
  2020-04-29 14:31               ` R: " Dino Papararo
  2020-04-29 23:12               ` R: " 罗勇刚(Yonggang Luo)
  2 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-29 12:33 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Dino Papararo

[-- Attachment #1: Type: text/plain, Size: 10859 bytes --]

On Wed, Apr 29, 2020 at 7:57 PM Alex Bennée <alex.bennee@linaro.org> wrote:

>
> Dino Papararo <skizzato73@msn.com> writes:
>
> > Hello,
> > about handling of PPC fpu exceptions and Hard Floats support we could
> consider a different approach for different instructions.
> > i.e. not all fpu instructions take care about inexact or exceptions
> bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding
> f1+f2 into f1 register and no one will check about inexact or exception
> bits raised into FPSCR register.
> > Instead if I'll take fadd. f0,f1,f2 the dot following the add
> instructions means I want take inexact or exceptions bits into account.
> > So I could use hard floats for first case and softfloats for second case.
> > Could this be a fast solution to start implement hard floats for PPC??
>
> While it may be true that normal software practice is not to read the
> exception registers for every operation we can't base our emulation on
> that. We must always be able to re-create the state of the exception
> registers whenever they may be read by the program. There are 3 cases
> this may happen:
>
>   - a direct read of the inexact register
>   - checking the sigcontext of a synchronous exception (e.g. fault)
>   - checking the sigcontext of an asynchronous exception (e.g. timer/IPI)
>
> Given the way the translator works we can simplify the asynchronous case
> because we know they are only ever delivered at the start of translated
> blocks. We must have a fully rectified system state at the end of every
> block. So lets consider some cases:
>
>   fpOpA
>   clear flags
>   fpOpB
>   clear flags
>   fpOpC
>   read flags
>
So we only need clear flags for before the fp op that are running before
the read flags are
triggered?  So the key point is finding all the read flags op, and find the
latest clear flags op
before the latest fp op instuction that before the read flags. May this be
expressed in TCG ops?



>
> Assuming we know the fpOps can't generate exceptions we can know that
> only fpOpC will ever generate a user visible floating point flags so we
> can indeed use hardfloat for fpOpA and fpOpB. However if we see the
> pattern:
>
>   fpOpA
>   ld/st
>
What does ld/st means? load and store float point values?


>   clear flags
>   fpOpB
>   read flags
>
> we must have the fully rectified version of the flags because the ld/st
> may fault. However it's not guaranteed it will fault so we could defer
> the flag calculation for fpOpA until such time as we need it. The
> easiest way would be to save the values going into the operation and
> then re-run it in softfloat when required (hopefully never ;-).
>
> A lot will depend on the behaviour of the architecture. For example:
>
>   fpOpA
>   fpOpB
>   read flags
>
> whether or not we need to be able to calculate the flags for fpOpA will
> depend on if fpOpB completely resets the flags visible or if the result
> is additive.
>
> So in short I think there may be scope for using hardfloat but it will
> require knowledge of front-end knowing if it is safe to skip flag
> calculation in particular cases. We might even need support within TCG
> for saving (and marking) temporaries over potentially faulting
> boundaries so these lazy evaluations can be done. We can certainly add a
> fp-status less set of primitives to softfloat which can use the
> hardfloat path when we know we are using normal numbers.
>
> >
> > A little of documentation here:
> http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/PPCNumerics/PPCNumerics-154.html
> >
> > Regards,
> > Dino Papararo
> >
> > -----Messaggio originale-----
> > Da: Qemu-devel <qemu-devel-bounces+skizzato73=msn.com@nongnu.org> Per
> conto di Alex Bennée
> > Inviato: martedì 28 aprile 2020 10:37
> > A: luoyonggang@gmail.com
> > Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
> > Oggetto: Re: About hardfloat in ppc
> >
> >
> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >
> >> I am confusing why only  inexact  are set then we can use hard-float.
> >
> > The inexact behaviour of the host hardware may be different from the
> guest architecture we are trying to emulate and the host hardware may not
> be configurable to emulate the guest mode.
> >
> > Have a look in softfloat.c and see all the places where
> float_flag_inexact is set. Can you convince yourself that the host hardware
> will do the same?
> >
> >> And PPC always clearing inexact  flag before calling to soft-float
> >> funcitons. so we can not optimize it with hard-float.
> >> I need some resouces about ineact flag and why always clearing inexcat
> >> in PPC FP simualtion.
> >
> > Because that is the behaviour of the PPC floating point unit. The
> inexact flag will represent the last operation done.
> >
> >> I am looking for two possible solution:
> >> 1. do not clear inexact flag in PPC simulation 2. even the inexact are
> >> cleared, we can still use alternative hard-float.
> >>
> >> But now I am the beginner, Have no clue about all the things.
> >
> > Well you'll need to learn about floating point because these are rather
> fundamental aspects of it's behaviour. In the old days QEMU used to use the
> host floating point processor with it's template based translation.
> > However this led to lots of weird bugs because the floating point
> answers under qemu where different from the target it was trying to
> emulate. It was for this reason softfloat was introduced. The hardfloat
> optimisation can only be done when we are confident that we will get the
> exact same answer of the target we are trying to emulate - a "faster but
> incorrect" mode is just going to cause confusion as discussed in the
> previous thread. Have you read that yet?
> >
> >>
> >> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org>
> wrote:
> >>
> >>>
> >>> BALATON Zoltan <balaton@eik.bme.hu> writes:
> >>>
> >>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
> >>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, So
> >>> >>> is that possible to optimize the performance when
> >>> float_flag_inexact
> >>> >>> are cleared?
> >>> >>
> >>> >> There was some discussion about this in the last thread about
> >>> >> enabling hardfloat for PPC. See the thread:
> >>> >>
> >>> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >>> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
> >>> >>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
> >>> >
> >>> > I've answered this already with link to that thread here:
> >>> >
> >>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
> >>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
> >>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
> >>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
> >>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
> >>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
> >>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >>> > :
> >>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> >>> > :> Are this stable now? I'd like to see hard float to be landed:)
> >>> > :
> >>> > : If you want to see hardfloat for PPC then you should read the
> >>> > replies to : this patch which can be found here:
> >>> > :
> >>> > : http://patchwork.ozlabs.org/patch/1240235/
> >>> > :
> >>> > : to understand what's needed then try to implement the solution
> >>> > with FP : exceptions cached in a global that maybe could work. I
> >>> > won't be able to : do that as said here:
> >>> > :
> >>> > :
> >>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.htm
> >>> > l
> >>> > :
> >>> > : because I don't have time to learn all the details needed. I think
> :
> >>> > others are in the same situation so unless somebody puts in the :
> >>> > necessary effort this won't change.
> >>> >
> >>> > Which also had a proposed solution to the problem that you could
> >>> > try to implement, in particular see this message:
> >>> >
> >>> >
> >>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.9
> >>> 79F074637D@zero.eik.bme.hu/#2375124
> >>> >
> >>> > amd Richard's reply immediately below that. In short to optimise
> >>> > FPU emulation we would either find a way to compute inexact flag
> >>> > quickly without reading the FPU status (this may not be possible)
> >>> > or somehow get status from the FPU but the obvious way of claring
> >>> > the flag and reading them after each operation is too slow. So
> >>> > maybe using exceptions and only clearing when actually there's a
> >>> > change could be faster.
> >>> >
> >>> > As to how to use exceptions see this message in above thread:
> >>> >
> >>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.htm
> >>> > l
> >>> >
> >>> > But that's only to show how to hook in an exception handler what it
> >>> > does needs to be implemented. Then tested and benchmarked.
> >>> >
> >>> > I still don't know where are the extensive PPC floating point tests
> >>> > to use for checking results though as that was never answered.
> >>>
> >>> Specifically for PPC we don't have them. We use the softfloat test
> >>> cases to exercise our softfloat/hardfloat code as part of "make
> >>> check-softfloat". You can also re-build fp-bench for each guest
> >>> target to measure raw throughput.
> >>>
> >>> >> However in short the problem is if the float_flag_inexact is clear
> >>> >> you must use softfloat so you can properly calculate the inexact
> >>> >> status. We can't take advantage of the inexact stickiness without
> >>> >> loosing the fidelity of the calculation.
> >>> >
> >>> > I still don't get why can't we use hardware via exception handler
> >>> > to detect flags for us and why do we only use hardfloat in some
> >>> > corner cases. If reading the status is too costly then we could
> >>> > mirror it in a global which is set by an FP exception handler.
> >>> > Shouldn't that be faster? Is there a reason that can't work?
> >>>
> >>> It would work but it would be slow. Almost every FP operation sets
> >>> the inexact flag so it would generate an exception and exceptions
> >>> take time to process.
> >>>
> >>> For the guests where we use hardfloat operations with inexact already
> >>> latched is not a corner case - it is the common case which is why it
> >>> helps.
> >>>
> >>> >
> >>> > Regards,
> >>> > BALATON Zoltan
> >>>
> >>>
> >>> --
> >>> Alex Bennée
> >>>
>
>
> --
> Alex Bennée
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 15136 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: About hardfloat in ppc
  2020-04-29 12:33               ` 罗勇刚(Yonggang Luo)
@ 2020-04-29 13:38                 ` Alex Bennée
  0 siblings, 0 replies; 40+ messages in thread
From: Alex Bennée @ 2020-04-29 13:38 UTC (permalink / raw)
  To: luoyonggang
  Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Dino Papararo


罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:

> On Wed, Apr 29, 2020 at 7:57 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>>
>> Dino Papararo <skizzato73@msn.com> writes:
>>
>> > Hello,
>> > about handling of PPC fpu exceptions and Hard Floats support we could
>> consider a different approach for different instructions.
>> > i.e. not all fpu instructions take care about inexact or exceptions
>> bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding
>> f1+f2 into f1 register and no one will check about inexact or exception
>> bits raised into FPSCR register.
>> > Instead if I'll take fadd. f0,f1,f2 the dot following the add
>> instructions means I want take inexact or exceptions bits into account.
>> > So I could use hard floats for first case and softfloats for second case.
>> > Could this be a fast solution to start implement hard floats for PPC??
>>
>> While it may be true that normal software practice is not to read the
>> exception registers for every operation we can't base our emulation on
>> that. We must always be able to re-create the state of the exception
>> registers whenever they may be read by the program. There are 3 cases
>> this may happen:
>>
>>   - a direct read of the inexact register
>>   - checking the sigcontext of a synchronous exception (e.g. fault)
>>   - checking the sigcontext of an asynchronous exception (e.g. timer/IPI)
>>
>> Given the way the translator works we can simplify the asynchronous case
>> because we know they are only ever delivered at the start of translated
>> blocks. We must have a fully rectified system state at the end of every
>> block. So lets consider some cases:
>>
>>   fpOpA
>>   clear flags
>>   fpOpB
>>   clear flags
>>   fpOpC
>>   read flags
>>
> So we only need clear flags for before the fp op that are running before
> the read flags are
> triggered?  So the key point is finding all the read flags op, and find the
> latest clear flags op
> before the latest fp op instuction that before the read flags. May this be
> expressed in TCG ops?

In the simple case of flags not being able to be read from a chain of
operations this could all be handled in the front end by using a
different set of helpers (or maybe tweaking the helper to handle a NULL
fpst?) when it knows the values won't be needed.

The trouble is scanning forward enough to know this is the case as the
way the decoders currently work is by dealing with an instruction at a
time. There are some cases where we use tcg_last_op() to save the
location of an operations and then tcg_set_insn_param() update a
parameter after the fact. Your could save the location of every fpOp
with tcg_last_op() and then go through each on updating the parameters
to the helper to indicate if you care about calculating the flags or
not.

>> Assuming we know the fpOps can't generate exceptions we can know that
>> only fpOpC will ever generate a user visible floating point flags so we
>> can indeed use hardfloat for fpOpA and fpOpB. However if we see the
>> pattern:
>>
>>   fpOpA
>>   ld/st
>>
> What does ld/st means? load and store float point values?

Generally any load or store to memory has the potential to fault
regardless of what it is actually storing. There may be other
potentially faulting instructions as well - it will depend on your
architecture.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* R: R: About hardfloat in ppc
  2020-04-29 11:57             ` Alex Bennée
  2020-04-29 12:33               ` 罗勇刚(Yonggang Luo)
@ 2020-04-29 14:31               ` Dino Papararo
  2020-04-29 14:49                 ` Peter Maydell
  2020-04-29 18:25                 ` R: " Alex Bennée
  2020-04-29 23:12               ` R: " 罗勇刚(Yonggang Luo)
  2 siblings, 2 replies; 40+ messages in thread
From: Dino Papararo @ 2020-04-29 14:31 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, luoyonggang,
	qemu-ppc, Howard Spoelstra

Hi Alex,
maybe a pseudo code can show better what I mean 😊

if (ppc_fpu_instruction == USE_FPSCR) /* instruction have dot '.' so FPSCR will be updated and we need have care about it */
	soft_decode (ppc_fpu_instruction)
else  /* instruction have not dot '.' and FPSCR will be never updated and we don't need to have care about it -> maxspeed */
	hard_decode (ppc_fpu_instruction)

In ppc assembly all instructions who needs to take care of inexact flag and/or exception flags, are processed prior than test instructions, look at following exception handling example:

   fadd. f0,f1,f2 # f1 + f2 = f0. CR1 contains except.summary
   bta   4,error  # if bit 0 of CR1 is set, go to error
                  # bit 0 is set if any exception occurs
   .              # if clear, continue operation
   .
   .
error:
   mcrfs 2,1   # copy FPSCR bits 4-7 to CR field 2
               # now CR1 and CR2 (bits 6 through 10)
               # contain all exception bits from FPSCR
   bta   6,invalid   # CR bit 6 signals invalid
   bta   7,overflow  # CR bit 7 signals overflow
   bta   8,underflow # CR bit 8 signals underflow
   bta   9,divbyzero # CR bit 9 signals divide-by-zero
   bta   10,inexact  # CR bit 10 signals inexact

invalid:
   mcrfs 2,2   # copy FPSCR bits 8-11 to CR field 2
   mcrfs 3,3   # copy FPSCR bits 12-15 to CR field 3
   mcrfs 4,5   # copy FPSCR bits 20-23 to CR field 4
               # invalid bits are now CR bits 11-16 and bit 23

   # now do exception handling based on which invalid bit
   # is set

overflow:
   # do exception handling for overflow exception

underflow:
   # do exception handling for underflow exception

divbyzero:
   #do exception handling for the divide-by-zero exception

inexact:
   # do exception handling for the inexact exception

In this way you can know as soon as possible if you can go with hardfloats or not.

I leave to you TCG's experts how it works and how to implement it, I'm only tryng to explain a possible fast way to go (if ever possible) 😊
..Large majority of software don't check for exceptions at all and if I really want to pursue max precision I'll go for a software multiprecision library like GMP or MPFR Libraries.
So the hardfloats 'should' be set as first choice and only if instruction requires precision/error check process it in softfloats.

I hope to have added some new ideas to discussion, thank a lot Alex!

Dino

-----Messaggio originale-----
Da: Alex Bennée <alex.bennee@linaro.org> 
Inviato: mercoledì 29 aprile 2020 13:57
A: Dino Papararo <skizzato73@msn.com>
Cc: luoyonggang@gmail.com; BALATON Zoltan <balaton@eik.bme.hu>; Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>; Programmingkid <programmingkidx@gmail.com>; Howard Spoelstra <hsp.cat7@gmail.com>; qemu-ppc@nongnu.org; qemu-devel@nongnu.org
Oggetto: Re: R: About hardfloat in ppc


Dino Papararo <skizzato73@msn.com> writes:

> Hello,
> about handling of PPC fpu exceptions and Hard Floats support we could consider a different approach for different instructions.
> i.e. not all fpu instructions take care about inexact or exceptions bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding f1+f2 into f1 register and no one will check about inexact or exception bits raised into FPSCR register.
> Instead if I'll take fadd. f0,f1,f2 the dot following the add instructions means I want take inexact or exceptions bits into account.
> So I could use hard floats for first case and softfloats for second case.
> Could this be a fast solution to start implement hard floats for PPC??

While it may be true that normal software practice is not to read the exception registers for every operation we can't base our emulation on that. We must always be able to re-create the state of the exception registers whenever they may be read by the program. There are 3 cases this may happen:

  - a direct read of the inexact register
  - checking the sigcontext of a synchronous exception (e.g. fault)
  - checking the sigcontext of an asynchronous exception (e.g. timer/IPI)

Given the way the translator works we can simplify the asynchronous case because we know they are only ever delivered at the start of translated blocks. We must have a fully rectified system state at the end of every block. So lets consider some cases:

  fpOpA
  clear flags
  fpOpB
  clear flags
  fpOpC
  read flags

Assuming we know the fpOps can't generate exceptions we can know that only fpOpC will ever generate a user visible floating point flags so we can indeed use hardfloat for fpOpA and fpOpB. However if we see the
pattern:

  fpOpA
  ld/st
  clear flags
  fpOpB
  read flags

we must have the fully rectified version of the flags because the ld/st may fault. However it's not guaranteed it will fault so we could defer the flag calculation for fpOpA until such time as we need it. The easiest way would be to save the values going into the operation and then re-run it in softfloat when required (hopefully never ;-).

A lot will depend on the behaviour of the architecture. For example:

  fpOpA
  fpOpB
  read flags

whether or not we need to be able to calculate the flags for fpOpA will depend on if fpOpB completely resets the flags visible or if the result is additive.

So in short I think there may be scope for using hardfloat but it will require knowledge of front-end knowing if it is safe to skip flag calculation in particular cases. We might even need support within TCG for saving (and marking) temporaries over potentially faulting boundaries so these lazy evaluations can be done. We can certainly add a fp-status less set of primitives to softfloat which can use the hardfloat path when we know we are using normal numbers.

>
> A little of documentation here: 
> http://mirror.informatimago.com/next/developer.apple.com/documentation
> /mac/PPCNumerics/PPCNumerics-154.html
>
> Regards,
> Dino Papararo
>
> -----Messaggio originale-----
> Da: Qemu-devel <qemu-devel-bounces+skizzato73=msn.com@nongnu.org> Per 
> conto di Alex Bennée
> Inviato: martedì 28 aprile 2020 10:37
> A: luoyonggang@gmail.com
> Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
> Oggetto: Re: About hardfloat in ppc
>
>
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>
>> I am confusing why only  inexact  are set then we can use hard-float.
>
> The inexact behaviour of the host hardware may be different from the guest architecture we are trying to emulate and the host hardware may not be configurable to emulate the guest mode.
>
> Have a look in softfloat.c and see all the places where float_flag_inexact is set. Can you convince yourself that the host hardware will do the same?
>
>> And PPC always clearing inexact  flag before calling to soft-float 
>> funcitons. so we can not optimize it with hard-float.
>> I need some resouces about ineact flag and why always clearing 
>> inexcat in PPC FP simualtion.
>
> Because that is the behaviour of the PPC floating point unit. The inexact flag will represent the last operation done.
>
>> I am looking for two possible solution:
>> 1. do not clear inexact flag in PPC simulation 2. even the inexact 
>> are cleared, we can still use alternative hard-float.
>>
>> But now I am the beginner, Have no clue about all the things.
>
> Well you'll need to learn about floating point because these are rather fundamental aspects of it's behaviour. In the old days QEMU used to use the host floating point processor with it's template based translation.
> However this led to lots of weird bugs because the floating point answers under qemu where different from the target it was trying to emulate. It was for this reason softfloat was introduced. The hardfloat optimisation can only be done when we are confident that we will get the exact same answer of the target we are trying to emulate - a "faster but incorrect" mode is just going to cause confusion as discussed in the previous thread. Have you read that yet?
>
>>
>> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>>
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>
>>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
>>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, 
>>> >>> So is that possible to optimize the performance when
>>> float_flag_inexact
>>> >>> are cleared?
>>> >>
>>> >> There was some discussion about this in the last thread about 
>>> >> enabling hardfloat for PPC. See the thread:
>>> >>
>>> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>>> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
>>> >>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
>>> >
>>> > I've answered this already with link to that thread here:
>>> >
>>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
>>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
>>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
>>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
>>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
>>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
>>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>>> > :
>>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>>> > :> Are this stable now? I'd like to see hard float to be landed:)
>>> > :
>>> > : If you want to see hardfloat for PPC then you should read the 
>>> > replies to : this patch which can be found here:
>>> > :
>>> > : http://patchwork.ozlabs.org/patch/1240235/
>>> > :
>>> > : to understand what's needed then try to implement the solution 
>>> > with FP : exceptions cached in a global that maybe could work. I 
>>> > won't be able to : do that as said here:
>>> > :
>>> > : 
>>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.ht
>>> > m
>>> > l
>>> > :
>>> > : because I don't have time to learn all the details needed. I think :
>>> > others are in the same situation so unless somebody puts in the :
>>> > necessary effort this won't change.
>>> >
>>> > Which also had a proposed solution to the problem that you could 
>>> > try to implement, in particular see this message:
>>> >
>>> >
>>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.
>>> 9
>>> 79F074637D@zero.eik.bme.hu/#2375124
>>> >
>>> > amd Richard's reply immediately below that. In short to optimise 
>>> > FPU emulation we would either find a way to compute inexact flag 
>>> > quickly without reading the FPU status (this may not be possible) 
>>> > or somehow get status from the FPU but the obvious way of claring 
>>> > the flag and reading them after each operation is too slow. So 
>>> > maybe using exceptions and only clearing when actually there's a 
>>> > change could be faster.
>>> >
>>> > As to how to use exceptions see this message in above thread:
>>> >
>>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.ht
>>> > m
>>> > l
>>> >
>>> > But that's only to show how to hook in an exception handler what 
>>> > it does needs to be implemented. Then tested and benchmarked.
>>> >
>>> > I still don't know where are the extensive PPC floating point 
>>> > tests to use for checking results though as that was never answered.
>>>
>>> Specifically for PPC we don't have them. We use the softfloat test 
>>> cases to exercise our softfloat/hardfloat code as part of "make 
>>> check-softfloat". You can also re-build fp-bench for each guest 
>>> target to measure raw throughput.
>>>
>>> >> However in short the problem is if the float_flag_inexact is 
>>> >> clear you must use softfloat so you can properly calculate the 
>>> >> inexact status. We can't take advantage of the inexact stickiness 
>>> >> without loosing the fidelity of the calculation.
>>> >
>>> > I still don't get why can't we use hardware via exception handler 
>>> > to detect flags for us and why do we only use hardfloat in some 
>>> > corner cases. If reading the status is too costly then we could 
>>> > mirror it in a global which is set by an FP exception handler.
>>> > Shouldn't that be faster? Is there a reason that can't work?
>>>
>>> It would work but it would be slow. Almost every FP operation sets 
>>> the inexact flag so it would generate an exception and exceptions 
>>> take time to process.
>>>
>>> For the guests where we use hardfloat operations with inexact 
>>> already latched is not a corner case - it is the common case which 
>>> is why it helps.
>>>
>>> >
>>> > Regards,
>>> > BALATON Zoltan
>>>
>>>
>>> --
>>> Alex Bennée
>>>


--
Alex Bennée

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: About hardfloat in ppc
  2020-04-29 14:31               ` R: " Dino Papararo
@ 2020-04-29 14:49                 ` Peter Maydell
  2020-04-29 18:25                 ` R: " Alex Bennée
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Maydell @ 2020-04-29 14:49 UTC (permalink / raw)
  To: Dino Papararo
  Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, luoyonggang,
	qemu-ppc, Howard Spoelstra, Alex Bennée

On Wed, 29 Apr 2020 at 15:33, Dino Papararo <skizzato73@msn.com> wrote:
>
> Hi Alex,
> maybe a pseudo code can show better what I mean
>
> if (ppc_fpu_instruction == USE_FPSCR) /* instruction have dot '.' so FPSCR will be updated and we need have care about it */
>         soft_decode (ppc_fpu_instruction)
> else  /* instruction have not dot '.' and FPSCR will be never updated and we don't need to have care about it -> maxspeed */
>         hard_decode (ppc_fpu_instruction)

My understanding was that the '.' indicates whether
the instruction updates CR1 (the condition register),
which is separate from whether it updates FPSCR
flags. So all insns update FPSCR flags; insns with
a '.' additionally update CR state which can be
tested by a following branch insn. (I'm not a PPC
expert but that's what my reading of the ISA spec is.)

> In ppc assembly all instructions who needs to take care of inexact flag and/or exception flags, are processed prior than test instructions, look at following exception handling example:
>
>    fadd. f0,f1,f2 # f1 + f2 = f0. CR1 contains except.summary
>    bta   4,error  # if bit 0 of CR1 is set, go to error
>                   # bit 0 is set if any exception occurs
>    .              # if clear, continue operation
>    .
>    .
> error:
>    mcrfs 2,1   # copy FPSCR bits 4-7 to CR field 2
>                # now CR1 and CR2 (bits 6 through 10)
>                # contain all exception bits from FPSCR

This may be a common pattern, but the architecture doesn't
require it. You could equally do

    fadd f0,f1,f2   # insn which sets fpscr bits
    mffs 30         # copy whole fpscr to a gp register
    # now do stuff based on that value

So unless you can tell for certain that nothing in
the future guest execution can the relevant FPSCR bits
before they're overwritten, you have to generate them
correctly; or be able to re-generate them later, if
you want to get fancy (you could imagine a scheme
similar to how we handle CPU condition flags on
some guests, where instead of calculating them every
time we make a note of what the operation that should
have set them was, so that at the point where the
guest actually does read the fpscr or do something
else that demands the real flag value we can recreate
them, in this case by repeating the fp operation via
softfloat. Getting that working would be a non-trivial
project, though.)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: R: About hardfloat in ppc
  2020-04-29 14:31               ` R: " Dino Papararo
  2020-04-29 14:49                 ` Peter Maydell
@ 2020-04-29 18:25                 ` Alex Bennée
  2020-04-30  0:20                   ` 罗勇刚(Yonggang Luo)
  1 sibling, 1 reply; 40+ messages in thread
From: Alex Bennée @ 2020-04-29 18:25 UTC (permalink / raw)
  To: Dino Papararo
  Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, luoyonggang,
	qemu-ppc, Howard Spoelstra


Dino Papararo <skizzato73@msn.com> writes:

> Hi Alex,
<snip>
>
> I leave to you TCG's experts how it works and how to implement it, I'm
> only tryng to explain a possible fast way to go (if ever possible) 😊

This is all a theoretical discussion unless someone cares enough to
improve the situation. While I have an interest in improving TCG
performance I'm afraid there are many more easier wins before tackling a
target specific hack for which I'm not familiar. No doubt this thread
will be referred to next time someone wants something done about it.

> ..Large majority of software don't check for exceptions at all and if
> I really want to pursue max precision I'll go for a software
> multiprecision library like GMP or MPFR Libraries.

However for QEMU we regard failure to correctly emulate the architecture
as a bug - we don't code to common software patterns because there is
plenty of software out there that doesn't follow it.

> So the hardfloats 'should' be set as first choice and only if
> instruction requires precision/error check process it in softfloats.

Sure but someone will have to do the work to support that.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: About hardfloat in ppc
  2020-04-29 11:57             ` Alex Bennée
  2020-04-29 12:33               ` 罗勇刚(Yonggang Luo)
  2020-04-29 14:31               ` R: " Dino Papararo
@ 2020-04-29 23:12               ` 罗勇刚(Yonggang Luo)
  2 siblings, 0 replies; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-29 23:12 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Dino Papararo

[-- Attachment #1: Type: text/plain, Size: 10879 bytes --]

On Wed, Apr 29, 2020 at 7:57 PM Alex Bennée <alex.bennee@linaro.org> wrote:

>
> Dino Papararo <skizzato73@msn.com> writes:
>
> > Hello,
> > about handling of PPC fpu exceptions and Hard Floats support we could
> consider a different approach for different instructions.
> > i.e. not all fpu instructions take care about inexact or exceptions
> bits: if I take a simple fadd f0,f1,f2 I'll copy value derived from adding
> f1+f2 into f1 register and no one will check about inexact or exception
> bits raised into FPSCR register.
> > Instead if I'll take fadd. f0,f1,f2 the dot following the add
> instructions means I want take inexact or exceptions bits into account.
> > So I could use hard floats for first case and softfloats for second case.
> > Could this be a fast solution to start implement hard floats for PPC??
>
> While it may be true that normal software practice is not to read the
> exception registers for every operation we can't base our emulation on
> that. We must always be able to re-create the state of the exception
> registers whenever they may be read by the program. There are 3 cases
> this may happen:
>
>   - a direct read of the inexact register
>   - checking the sigcontext of a synchronous exception (e.g. fault)
>   - checking the sigcontext of an asynchronous exception (e.g. timer/IPI)
>
> Given the way the translator works we can simplify the asynchronous case
> because we know they are only ever delivered at the start of translated
> blocks. We must have a fully rectified system state at the end of every
> block. So lets consider some cases:
>
>   fpOpA
>   clear flags
>   fpOpB
>   clear flags
>   fpOpC
>   read flags
>
I am thinking about a new way to do optimize if InstCombine are possible in
tcg, like InstCombine in LLVM
suppose we have
clearFlagsFpOpA
clearFlagsFpOpB
clearFlagsFpOpC
clearFlagsFpOpD
Then we can instCombine into
FpOpA
FpObB
FpOpC
clearFlagsFpOpD,
Are this would be a possible idea?
I think TCG have BasicBlock, and we can optimize
TCG at the basic block level.



>
> Assuming we know the fpOps can't generate exceptions we can know that
> only fpOpC will ever generate a user visible floating point flags so we
> can indeed use hardfloat for fpOpA and fpOpB. However if we see the
> pattern:
>
>   fpOpA
>   ld/st
>   clear flags
>   fpOpB
>   read flags
>
> we must have the fully rectified version of the flags because the ld/st
> may fault. However it's not guaranteed it will fault so we could defer
> the flag calculation for fpOpA until such time as we need it. The
> easiest way would be to save the values going into the operation and
> then re-run it in softfloat when required (hopefully never ;-).
>
> A lot will depend on the behaviour of the architecture. For example:
>
>   fpOpA
>   fpOpB
>   read flags
>
> whether or not we need to be able to calculate the flags for fpOpA will
> depend on if fpOpB completely resets the flags visible or if the result
> is additive.
>
> So in short I think there may be scope for using hardfloat but it will
> require knowledge of front-end knowing if it is safe to skip flag
> calculation in particular cases. We might even need support within TCG
> for saving (and marking) temporaries over potentially faulting
> boundaries so these lazy evaluations can be done. We can certainly add a
> fp-status less set of primitives to softfloat which can use the
> hardfloat path when we know we are using normal numbers.
>
> >
> > A little of documentation here:
> http://mirror.informatimago.com/next/developer.apple.com/documentation/mac/PPCNumerics/PPCNumerics-154.html
> >
> > Regards,
> > Dino Papararo
> >
> > -----Messaggio originale-----
> > Da: Qemu-devel <qemu-devel-bounces+skizzato73=msn.com@nongnu.org> Per
> conto di Alex Bennée
> > Inviato: martedì 28 aprile 2020 10:37
> > A: luoyonggang@gmail.com
> > Cc: qemu-ppc@nongnu.org; qemu-devel@nongnu.org
> > Oggetto: Re: About hardfloat in ppc
> >
> >
> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >
> >> I am confusing why only  inexact  are set then we can use hard-float.
> >
> > The inexact behaviour of the host hardware may be different from the
> guest architecture we are trying to emulate and the host hardware may not
> be configurable to emulate the guest mode.
> >
> > Have a look in softfloat.c and see all the places where
> float_flag_inexact is set. Can you convince yourself that the host hardware
> will do the same?
> >
> >> And PPC always clearing inexact  flag before calling to soft-float
> >> funcitons. so we can not optimize it with hard-float.
> >> I need some resouces about ineact flag and why always clearing inexcat
> >> in PPC FP simualtion.
> >
> > Because that is the behaviour of the PPC floating point unit. The
> inexact flag will represent the last operation done.
> >
> >> I am looking for two possible solution:
> >> 1. do not clear inexact flag in PPC simulation 2. even the inexact are
> >> cleared, we can still use alternative hard-float.
> >>
> >> But now I am the beginner, Have no clue about all the things.
> >
> > Well you'll need to learn about floating point because these are rather
> fundamental aspects of it's behaviour. In the old days QEMU used to use the
> host floating point processor with it's template based translation.
> > However this led to lots of weird bugs because the floating point
> answers under qemu where different from the target it was trying to
> emulate. It was for this reason softfloat was introduced. The hardfloat
> optimisation can only be done when we are confident that we will get the
> exact same answer of the target we are trying to emulate - a "faster but
> incorrect" mode is just going to cause confusion as discussed in the
> previous thread. Have you read that yet?
> >
> >>
> >> On Mon, Apr 27, 2020 at 7:10 PM Alex Bennée <alex.bennee@linaro.org>
> wrote:
> >>
> >>>
> >>> BALATON Zoltan <balaton@eik.bme.hu> writes:
> >>>
> >>> > On Mon, 27 Apr 2020, Alex Bennée wrote:
> >>> >> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >>> >>> Because ppc fpu-helper are always clearing float_flag_inexact, So
> >>> >>> is that possible to optimize the performance when
> >>> float_flag_inexact
> >>> >>> are cleared?
> >>> >>
> >>> >> There was some discussion about this in the last thread about
> >>> >> enabling hardfloat for PPC. See the thread:
> >>> >>
> >>> >>  Subject: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >>> >>  Date: Tue, 18 Feb 2020 18:10:16 +0100
> >>> >>  Message-Id: <20200218171702.979F074637D@zero.eik.bme.hu>
> >>> >
> >>> > I've answered this already with link to that thread here:
> >>> >
> >>> > On Fri, 10 Apr 2020, BALATON Zoltan wrote:
> >>> > : Date: Fri, 10 Apr 2020 20:04:53 +0200 (CEST)
> >>> > : From: BALATON Zoltan <balaton@eik.bme.hu>
> >>> > : To: "罗勇刚(Yonggang Luo)" <luoyonggang@gmail.com>
> >>> > : Cc: qemu-devel@nongnu.org, Mark Cave-Ayland, John Arbuckle,
> >>> qemu-ppc@nongnu.org, Paul Clarke, Howard Spoelstra, David Gibson
> >>> > : Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
> >>> > :
> >>> > : On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> >>> > :> Are this stable now? I'd like to see hard float to be landed:)
> >>> > :
> >>> > : If you want to see hardfloat for PPC then you should read the
> >>> > replies to : this patch which can be found here:
> >>> > :
> >>> > : http://patchwork.ozlabs.org/patch/1240235/
> >>> > :
> >>> > : to understand what's needed then try to implement the solution
> >>> > with FP : exceptions cached in a global that maybe could work. I
> >>> > won't be able to : do that as said here:
> >>> > :
> >>> > :
> >>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.htm
> >>> > l
> >>> > :
> >>> > : because I don't have time to learn all the details needed. I think
> :
> >>> > others are in the same situation so unless somebody puts in the :
> >>> > necessary effort this won't change.
> >>> >
> >>> > Which also had a proposed solution to the problem that you could
> >>> > try to implement, in particular see this message:
> >>> >
> >>> >
> >>> http://patchwork.ozlabs.org/project/qemu-devel/patch/20200218171702.9
> >>> 79F074637D@zero.eik.bme.hu/#2375124
> >>> >
> >>> > amd Richard's reply immediately below that. In short to optimise
> >>> > FPU emulation we would either find a way to compute inexact flag
> >>> > quickly without reading the FPU status (this may not be possible)
> >>> > or somehow get status from the FPU but the obvious way of claring
> >>> > the flag and reading them after each operation is too slow. So
> >>> > maybe using exceptions and only clearing when actually there's a
> >>> > change could be faster.
> >>> >
> >>> > As to how to use exceptions see this message in above thread:
> >>> >
> >>> > https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00005.htm
> >>> > l
> >>> >
> >>> > But that's only to show how to hook in an exception handler what it
> >>> > does needs to be implemented. Then tested and benchmarked.
> >>> >
> >>> > I still don't know where are the extensive PPC floating point tests
> >>> > to use for checking results though as that was never answered.
> >>>
> >>> Specifically for PPC we don't have them. We use the softfloat test
> >>> cases to exercise our softfloat/hardfloat code as part of "make
> >>> check-softfloat". You can also re-build fp-bench for each guest
> >>> target to measure raw throughput.
> >>>
> >>> >> However in short the problem is if the float_flag_inexact is clear
> >>> >> you must use softfloat so you can properly calculate the inexact
> >>> >> status. We can't take advantage of the inexact stickiness without
> >>> >> loosing the fidelity of the calculation.
> >>> >
> >>> > I still don't get why can't we use hardware via exception handler
> >>> > to detect flags for us and why do we only use hardfloat in some
> >>> > corner cases. If reading the status is too costly then we could
> >>> > mirror it in a global which is set by an FP exception handler.
> >>> > Shouldn't that be faster? Is there a reason that can't work?
> >>>
> >>> It would work but it would be slow. Almost every FP operation sets
> >>> the inexact flag so it would generate an exception and exceptions
> >>> take time to process.
> >>>
> >>> For the guests where we use hardfloat operations with inexact already
> >>> latched is not a corner case - it is the common case which is why it
> >>> helps.
> >>>
> >>> >
> >>> > Regards,
> >>> > BALATON Zoltan
> >>>
> >>>
> >>> --
> >>> Alex Bennée
> >>>
>
>
> --
> Alex Bennée
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 15134 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: R: About hardfloat in ppc
  2020-04-29 18:25                 ` R: " Alex Bennée
@ 2020-04-30  0:20                   ` 罗勇刚(Yonggang Luo)
  2020-04-30  2:18                     ` Richard Henderson
  0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-30  0:20 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Dino Papararo

[-- Attachment #1: Type: text/plain, Size: 1740 bytes --]

Question, in hard-float, if we don't want to read the fp register.
for example: If we wanna compute c = a + b in fp32
if c = a + b In hard float
and if b1 = c - a in hard float
if b1 != b at bitwise level, the we se the inexat to 1, otherwsie
we set inexat bit to 0? are this valid?

we can also do it for a * b, a - b, a / b.


On Thu, Apr 30, 2020 at 2:25 AM Alex Bennée <alex.bennee@linaro.org> wrote:

>
> Dino Papararo <skizzato73@msn.com> writes:
>
> > Hi Alex,
> <snip>
> >
> > I leave to you TCG's experts how it works and how to implement it, I'm
> > only tryng to explain a possible fast way to go (if ever possible) 😊
>
> This is all a theoretical discussion unless someone cares enough to
> improve the situation. While I have an interest in improving TCG
> performance I'm afraid there are many more easier wins before tackling a
> target specific hack for which I'm not familiar. No doubt this thread
> will be referred to next time someone wants something done about it.
>
> > ..Large majority of software don't check for exceptions at all and if
> > I really want to pursue max precision I'll go for a software
> > multiprecision library like GMP or MPFR Libraries.
>
> However for QEMU we regard failure to correctly emulate the architecture
> as a bug - we don't code to common software patterns because there is
> plenty of software out there that doesn't follow it.
>
> > So the hardfloats 'should' be set as first choice and only if
> > instruction requires precision/error check process it in softfloats.
>
> Sure but someone will have to do the work to support that.
>
> --
> Alex Bennée
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 2405 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: R: About hardfloat in ppc
  2020-04-30  0:20                   ` 罗勇刚(Yonggang Luo)
@ 2020-04-30  2:18                     ` Richard Henderson
  2020-04-30  7:26                       ` 罗勇刚(Yonggang Luo)
  2020-04-30  8:13                       ` 罗勇刚(Yonggang Luo)
  0 siblings, 2 replies; 40+ messages in thread
From: Richard Henderson @ 2020-04-30  2:18 UTC (permalink / raw)
  To: luoyonggang, Alex Bennée
  Cc: Mark Cave-Ayland, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Dino Papararo

On 4/29/20 5:20 PM, 罗勇刚(Yonggang Luo) wrote:
> Question, in hard-float, if we don't want to read the fp register.
> for example: If we wanna compute c = a + b in fp32
> if c = a + b In hard float
> and if b1 = c - a in hard float
> if b1 != b at bitwise level, the we se the inexat to 1, otherwsie 
> we set inexat bit to 0? are this valid?
> 
> we can also do it for a * b, a - b, a / b. 
> 

That does seem plausible, for all of the normal values for which we would apply
the hard-float optimization anyway.  But we already check for the exceptional
cases:

    if (unlikely(f32_is_inf(ur))) {
        s->float_exception_flags |= float_flag_overflow;
    } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) {
        if (post == NULL || post(ua, ub)) {
            goto soft;
        }
    }


r~


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: R: About hardfloat in ppc
  2020-04-30  2:18                     ` Richard Henderson
@ 2020-04-30  7:26                       ` 罗勇刚(Yonggang Luo)
  2020-04-30  8:11                         ` Alex Bennée
  2020-04-30  8:13                       ` 罗勇刚(Yonggang Luo)
  1 sibling, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-30  7:26 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Dino Papararo, Mark Cave-Ayland, qemu-devel, Programmingkid,
	qemu-ppc, Howard Spoelstra, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 1155 bytes --]

On Thu, Apr 30, 2020 at 10:18 AM Richard Henderson <
richard.henderson@linaro.org> wrote:

> On 4/29/20 5:20 PM, 罗勇刚(Yonggang Luo) wrote:
> > Question, in hard-float, if we don't want to read the fp register.
> > for example: If we wanna compute c = a + b in fp32
> > if c = a + b In hard float
> > and if b1 = c - a in hard float
> > if b1 != b at bitwise level, the we se the inexat to 1, otherwsie
> > we set inexat bit to 0? are this valid?
> >
> > we can also do it for a * b, a - b, a / b.
> >
>
> That does seem plausible, for all of the normal values for which we would
> apply
> the hard-float optimization anyway.  But we already check for the
> exceptional
> cases:
>
>     if (unlikely(f32_is_inf(ur))) {
>         s->float_exception_flags |= float_flag_overflow;
>     } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) {
>         if (post == NULL || post(ua, ub)) {
>             goto soft;
>         }
>     }
>
> I means remove of all thse  exceptional cases, and detecting float
exception by hard float operation.

>
> r~
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 1820 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: R: About hardfloat in ppc
  2020-04-30  7:26                       ` 罗勇刚(Yonggang Luo)
@ 2020-04-30  8:11                         ` Alex Bennée
  0 siblings, 0 replies; 40+ messages in thread
From: Alex Bennée @ 2020-04-30  8:11 UTC (permalink / raw)
  To: luoyonggang
  Cc: Richard Henderson, Mark Cave-Ayland, qemu-devel, Programmingkid,
	qemu-ppc, Howard Spoelstra, Dino Papararo


罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:

> On Thu, Apr 30, 2020 at 10:18 AM Richard Henderson <
> richard.henderson@linaro.org> wrote:
>
>> On 4/29/20 5:20 PM, 罗勇刚(Yonggang Luo) wrote:
>> > Question, in hard-float, if we don't want to read the fp register.
>> > for example: If we wanna compute c = a + b in fp32
>> > if c = a + b In hard float
>> > and if b1 = c - a in hard float
>> > if b1 != b at bitwise level, the we se the inexat to 1, otherwsie
>> > we set inexat bit to 0? are this valid?
>> >
>> > we can also do it for a * b, a - b, a / b.
>> >
>>
>> That does seem plausible, for all of the normal values for which we would
>> apply
>> the hard-float optimization anyway.  But we already check for the
>> exceptional
>> cases:
>>
>>     if (unlikely(f32_is_inf(ur))) {
>>         s->float_exception_flags |= float_flag_overflow;
>>     } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) {
>>         if (post == NULL || post(ua, ub)) {
>>             goto soft;
>>         }
>>     }
>>
> I means remove of all thse exceptional cases, and detecting float
> exception by hard float operation.

When this was originally done it was found to be faster testing for the
float conditions in software (which are basically bitops) than reading
the FP exception register which can be a high latency operation.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: R: About hardfloat in ppc
  2020-04-30  2:18                     ` Richard Henderson
  2020-04-30  7:26                       ` 罗勇刚(Yonggang Luo)
@ 2020-04-30  8:13                       ` 罗勇刚(Yonggang Luo)
  2020-04-30 15:35                         ` BALATON Zoltan
  1 sibling, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-04-30  8:13 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Dino Papararo, Mark Cave-Ayland, qemu-devel, Programmingkid,
	qemu-ppc, Howard Spoelstra, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 1912 bytes --]

I propose a new way to computing the float flags,
We preserve a  float computing cash
typedef struct FpRecord {
  uint8_t op;
  float32 A;
  float32 B;
}  FpRecord;
FpRecord fp_cache[1024];
int fp_cache_length;
uint32_t fp_exceptions;

1. For each new fp operation we push it to the  fp_cache,
2. Once we read the fp_exceptions , then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.
 and clear  fp_cache_length.
3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
 clear  fp_exceptions.
4. If the  fp_cache are full, then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.

Would this be a general method to use hard-float?
The consued time should be  2*hard_float.
Considerating read fp_exceptions are rare, then the amortized time
complexity
would be 1 * hard_float.



On Thu, Apr 30, 2020 at 10:18 AM Richard Henderson <
richard.henderson@linaro.org> wrote:

> On 4/29/20 5:20 PM, 罗勇刚(Yonggang Luo) wrote:
> > Question, in hard-float, if we don't want to read the fp register.
> > for example: If we wanna compute c = a + b in fp32
> > if c = a + b In hard float
> > and if b1 = c - a in hard float
> > if b1 != b at bitwise level, the we se the inexat to 1, otherwsie
> > we set inexat bit to 0? are this valid?
> >
> > we can also do it for a * b, a - b, a / b.
> >
>
> That does seem plausible, for all of the normal values for which we would
> apply
> the hard-float optimization anyway.  But we already check for the
> exceptional
> cases:
>
>     if (unlikely(f32_is_inf(ur))) {
>         s->float_exception_flags |= float_flag_overflow;
>     } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) {
>         if (post == NULL || post(ua, ub)) {
>             goto soft;
>         }
>     }
>
>
> r~
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 2735 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-28  8:36         ` Alex Bennée
  2020-04-28 14:29           ` 罗勇刚(Yonggang Luo)
  2020-04-29 10:17           ` R: " Dino Papararo
@ 2020-04-30 15:16           ` BALATON Zoltan
  2020-04-30 18:59             ` Alex Bennée
  2 siblings, 1 reply; 40+ messages in thread
From: BALATON Zoltan @ 2020-04-30 15:16 UTC (permalink / raw)
  To: Alex Bennée; +Cc: luoyonggang, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2539 bytes --]

On Tue, 28 Apr 2020, Alex Bennée wrote:
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>> I am confusing why only  inexact  are set then we can use hard-float.
>
> The inexact behaviour of the host hardware may be different from the
> guest architecture we are trying to emulate and the host hardware may
> not be configurable to emulate the guest mode.
>
> Have a look in softfloat.c and see all the places where
> float_flag_inexact is set. Can you convince yourself that the host
> hardware will do the same?

Can you convince me that it won't? This all seems to be guessing without 
evidence so I think what we need first is some tests to prove it either 
way. Such tests could then also be used at runtime to decide if the host 
and guest FPU are compatible enough to enable hardfloat. Are such tests 
available somewhere or what would need to be done to implement them?

This may not solve the problem with PPC target with non-cumulative status 
bits but could improve hardfloat performance at least for some host-guest 
combinations. To see if it worth the effort we should run such test on 
common combinations (say x86_64. ARM and PPC hosts with at least these 
guests).

>> And PPC always clearing inexact  flag before calling to soft-float
>> funcitons. so we can not
>> optimize it with hard-float.
>> I need some resouces about ineact flag and why always clearing inexcat in
>> PPC FP simualtion.
>
> Because that is the behaviour of the PPC floating point unit. The
> inexact flag will represent the last operation done.

More precisely additional to the usual cumulative (or sticky) bits there 
are two non-sticky bits for inexact and rounded (latter of which is not 
emulated) that currently need clearing FP status before every FP op. I 
wonder if we can know when the guest reads these and rerun the last FP op 
in softfloat to compute them only if these are read, then it's enough to 
remember the last FP op. This could be relatively simple and may be used 
even if we don't detect accessing the bits within FPSCR just accessing the 
FPSCR as likely most guest code does not check that and any cross-platform 
code won't check PPC specific non-sticky bits so I'd exepect most guest 
code to be fine with hardfloat. Although what about FP exceptions? We also 
need to revert to softfloat it FP exceptions are enabled so maybe using 
host FP exception for managing status bits could be the way to go to let 
hardware manage this and we don't need to implement everything in 
software.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: R: R: About hardfloat in ppc
  2020-04-30  8:13                       ` 罗勇刚(Yonggang Luo)
@ 2020-04-30 15:35                         ` BALATON Zoltan
  2020-04-30 16:34                           ` R: " Dino Papararo
  0 siblings, 1 reply; 40+ messages in thread
From: BALATON Zoltan @ 2020-04-30 15:35 UTC (permalink / raw)
  To: 罗勇刚(Yonggang Luo)
  Cc: Alex Bennée, Richard Henderson, qemu-devel, Programmingkid,
	qemu-ppc, Howard Spoelstra, Dino Papararo

[-- Attachment #1: Type: text/plain, Size: 2328 bytes --]

On Thu, 30 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> I propose a new way to computing the float flags,
> We preserve a  float computing cash
> typedef struct FpRecord {
>  uint8_t op;
>  float32 A;
>  float32 B;
> }  FpRecord;
> FpRecord fp_cache[1024];
> int fp_cache_length;
> uint32_t fp_exceptions;
>
> 1. For each new fp operation we push it to the  fp_cache,
> 2. Once we read the fp_exceptions , then we re-compute
> the fp_exceptions by re-running the fp FpRecord sequence.
> and clear  fp_cache_length.
> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
> clear  fp_exceptions.
> 4. If the  fp_cache are full, then we re-compute
> the fp_exceptions by re-running the fp FpRecord sequence.
>
> Would this be a general method to use hard-float?
> The consued time should be  2*hard_float.
> Considerating read fp_exceptions are rare, then the amortized time
> complexity
> would be 1 * hard_float.

It's hard to guess what the hit rate of such cache would be and if it's 
low then managing the cache is probably more expensive than running with 
softfloat. So to evaluate any proposed patch we also need some benchmarks 
which we can experiment with to tell if the results are good or not 
otherwise we're just guessing. Are there some existing tests and 
benchmarks that we can use? Alex mentioned fp-bench I think and to 
evaluate the correctness of the FP implementation I've seen this other 
conversation:

https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05107.html
https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05126.html

Is that something we can use for PPC as well to check the correctness?

So I think before implementing any potential solution that came up in this 
brainstorming the first step would be to get and compile (or write if not 
available) some tests and benchmarks:

1. testing host behaviour for inexact and compare that for different archs
2. some FP tests that can be used to compare results with QEMU and real 
CPU to check correctness of emulation (if these check for inexact 
differences then could be used instead of 1.)
3. some benchmarks to evaluate QEMU performance (these could be same as FP 
tests or some real world FP heavy applications).

Then we can see if the proposed solution is faster and still correct.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* R: R: R: About hardfloat in ppc
  2020-04-30 15:35                         ` BALATON Zoltan
@ 2020-04-30 16:34                           ` Dino Papararo
  2020-05-01  1:59                             ` Programmingkid
  0 siblings, 1 reply; 40+ messages in thread
From: Dino Papararo @ 2020-04-30 16:34 UTC (permalink / raw)
  To: BALATON Zoltan, 罗勇刚(Yonggang Luo)
  Cc: Richard Henderson, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Alex Bennée

Maybe the fastest way to implement hardfloats for ppc could be run them by default and until some fpu instruction request for FPSCR register.
At this time probably we want to check for some exception.. so QEMU could come back to last fpu instruction executed and re-execute it in softfloat taking care this time of FPSCR flags, then continue in hardfloats unitl another instruction looking for FPSCR register and so on..

Dino

-----Messaggio originale-----
Da: BALATON Zoltan <balaton@eik.bme.hu> 
Inviato: giovedì 30 aprile 2020 17:36
A: 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com>
Cc: Richard Henderson <richard.henderson@linaro.org>; Dino Papararo <skizzato73@msn.com>; qemu-devel@nongnu.org; Programmingkid <programmingkidx@gmail.com>; qemu-ppc@nongnu.org; Howard Spoelstra <hsp.cat7@gmail.com>; Alex Bennée <alex.bennee@linaro.org>
Oggetto: Re: R: R: About hardfloat in ppc

On Thu, 30 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> I propose a new way to computing the float flags, We preserve a  float 
> computing cash typedef struct FpRecord {  uint8_t op;
>  float32 A;
>  float32 B;
> }  FpRecord;
> FpRecord fp_cache[1024];
> int fp_cache_length;
> uint32_t fp_exceptions;
>
> 1. For each new fp operation we push it to the  fp_cache, 2. Once we 
> read the fp_exceptions , then we re-compute the fp_exceptions by 
> re-running the fp FpRecord sequence.
> and clear  fp_cache_length.
> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 
> and clear  fp_exceptions.
> 4. If the  fp_cache are full, then we re-compute the fp_exceptions by 
> re-running the fp FpRecord sequence.
>
> Would this be a general method to use hard-float?
> The consued time should be  2*hard_float.
> Considerating read fp_exceptions are rare, then the amortized time 
> complexity would be 1 * hard_float.

It's hard to guess what the hit rate of such cache would be and if it's low then managing the cache is probably more expensive than running with softfloat. So to evaluate any proposed patch we also need some benchmarks which we can experiment with to tell if the results are good or not otherwise we're just guessing. Are there some existing tests and benchmarks that we can use? Alex mentioned fp-bench I think and to evaluate the correctness of the FP implementation I've seen this other
conversation:

https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05107.html
https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05126.html

Is that something we can use for PPC as well to check the correctness?

So I think before implementing any potential solution that came up in this brainstorming the first step would be to get and compile (or write if not
available) some tests and benchmarks:

1. testing host behaviour for inexact and compare that for different archs 2. some FP tests that can be used to compare results with QEMU and real CPU to check correctness of emulation (if these check for inexact differences then could be used instead of 1.) 3. some benchmarks to evaluate QEMU performance (these could be same as FP tests or some real world FP heavy applications).

Then we can see if the proposed solution is faster and still correct.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-30 15:16           ` BALATON Zoltan
@ 2020-04-30 18:59             ` Alex Bennée
  2020-04-30 20:17               ` BALATON Zoltan
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Bennée @ 2020-04-30 18:59 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: luoyonggang, Emilio G . Cota, qemu-ppc, qemu-devel


BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Tue, 28 Apr 2020, Alex Bennée wrote:
>> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>>> I am confusing why only  inexact  are set then we can use hard-float.
>>
>> The inexact behaviour of the host hardware may be different from the
>> guest architecture we are trying to emulate and the host hardware may
>> not be configurable to emulate the guest mode.
>>
>> Have a look in softfloat.c and see all the places where
>> float_flag_inexact is set. Can you convince yourself that the host
>> hardware will do the same?
>
> Can you convince me that it won't? This all seems to be guessing
> without evidence so I think what we need first is some tests to prove
> it either way. Such tests could then also be used at runtime to decide
> if the host and guest FPU are compatible enough to enable hardfloat.
> Are such tests available somewhere or what would need to be done to
> implement them?

I seem to recall it comes down to the various approaches that FPUs can
take when dealing with tiny numbers when rounding. Emilio did the
original work so I've CC'd him. The original paper is referenced in the
hardfloat commentary:

 Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions in a
 binary translator." Software: Practice and Experience 46.12 (2016):1591-1615.

which is worth a read if you can get hold of it.

Running tests on start up is not without precedent. We have a
softfloat_init which checks for a broken FMA implementation. However I'd
caution about adding too many checks in there.

> This may not solve the problem with PPC target with non-cumulative
> status bits but could improve hardfloat performance at least for some
> host-guest combinations. To see if it worth the effort we should run
> such test on common combinations (say x86_64. ARM and PPC hosts with
> at least these guests).

We already enable hardfloat for all hosts apart from PPC and FAST_MATHS.

>>> And PPC always clearing inexact  flag before calling to soft-float
>>> funcitons. so we can not
>>> optimize it with hard-float.
>>> I need some resouces about ineact flag and why always clearing inexcat in
>>> PPC FP simualtion.
>>
>> Because that is the behaviour of the PPC floating point unit. The
>> inexact flag will represent the last operation done.
>
> More precisely additional to the usual cumulative (or sticky) bits
> there are two non-sticky bits for inexact and rounded (latter of which
> is not emulated) that currently need clearing FP status before every
> FP op.

Thanks for the clarification.

> I wonder if we can know when the guest reads these and rerun
> the last FP op in softfloat to compute them only if these are read,
> then it's enough to remember the last FP op. This could be relatively
> simple and may be used even if we don't detect accessing the bits
> within FPSCR just accessing the FPSCR as likely most guest code does
> not check that and any cross-platform code won't check PPC specific
> non-sticky bits so I'd exepect most guest code to be fine with
> hardfloat.

You could go further if you know nothing in a block can fault you can
skip the calculation overhead of the per-op flags for all but the last
op in the block.

> Although what about FP exceptions? We also need to revert
> to softfloat it FP exceptions are enabled so maybe using host FP
> exception for managing status bits could be the way to go to let
> hardware manage this and we don't need to implement everything in 
> software.

Well for all apart from inexact handling (which would fault as soon as
set) all other exception types are detected before we pass them to
hardfloat anyway. Given the range of NaN types we would have to post
process and hardfloat operation anyway to give the right NaN.

>
> Regards,
> BALATON Zoltan


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-30 18:59             ` Alex Bennée
@ 2020-04-30 20:17               ` BALATON Zoltan
  0 siblings, 0 replies; 40+ messages in thread
From: BALATON Zoltan @ 2020-04-30 20:17 UTC (permalink / raw)
  To: Alex Bennée; +Cc: luoyonggang, Emilio G . Cota, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5540 bytes --]

On Thu, 30 Apr 2020, Alex Bennée wrote:
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>> On Tue, 28 Apr 2020, Alex Bennée wrote:
>>> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>>>> I am confusing why only  inexact  are set then we can use hard-float.
>>>
>>> The inexact behaviour of the host hardware may be different from the
>>> guest architecture we are trying to emulate and the host hardware may
>>> not be configurable to emulate the guest mode.
>>>
>>> Have a look in softfloat.c and see all the places where
>>> float_flag_inexact is set. Can you convince yourself that the host
>>> hardware will do the same?
>>
>> Can you convince me that it won't? This all seems to be guessing
>> without evidence so I think what we need first is some tests to prove
>> it either way. Such tests could then also be used at runtime to decide
>> if the host and guest FPU are compatible enough to enable hardfloat.
>> Are such tests available somewhere or what would need to be done to
>> implement them?
>
> I seem to recall it comes down to the various approaches that FPUs can
> take when dealing with tiny numbers when rounding. Emilio did the
> original work so I've CC'd him. The original paper is referenced in the
> hardfloat commentary:
>
> Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions in a
> binary translator." Software: Practice and Experience 46.12 (2016):1591-1615.
>
> which is worth a read if you can get hold of it.
>
> Running tests on start up is not without precedent. We have a
> softfloat_init which checks for a broken FMA implementation. However I'd
> caution about adding too many checks in there.

Sure the runtime check should be quick so likely the approach would be to 
write detailed tests to profile different FPU implementations then only 
include one quick check to tell at runtime if we're running on a known 
good host. Maybe if someone knows the different FPUs can tell this without 
tests but I don't know and finding out from docs seems more work than 
determining it empirically by testing. Does someone have some hints on 
what operations should be tested to check for different inexact handling 
in different FPUs?

>> This may not solve the problem with PPC target with non-cumulative
>> status bits but could improve hardfloat performance at least for some
>> host-guest combinations. To see if it worth the effort we should run
>> such test on common combinations (say x86_64. ARM and PPC hosts with
>> at least these guests).
>
> We already enable hardfloat for all hosts apart from PPC and FAST_MATHS.

Only if inexact is set which may be common but still not using softfloat 
ar all if host's implementation is good for guest could be even faster.

>>>> And PPC always clearing inexact  flag before calling to soft-float
>>>> funcitons. so we can not
>>>> optimize it with hard-float.
>>>> I need some resouces about ineact flag and why always clearing inexcat in
>>>> PPC FP simualtion.
>>>
>>> Because that is the behaviour of the PPC floating point unit. The
>>> inexact flag will represent the last operation done.
>>
>> More precisely additional to the usual cumulative (or sticky) bits
>> there are two non-sticky bits for inexact and rounded (latter of which
>> is not emulated) that currently need clearing FP status before every
>> FP op.
>
> Thanks for the clarification.
>
>> I wonder if we can know when the guest reads these and rerun
>> the last FP op in softfloat to compute them only if these are read,
>> then it's enough to remember the last FP op. This could be relatively
>> simple and may be used even if we don't detect accessing the bits
>> within FPSCR just accessing the FPSCR as likely most guest code does
>> not check that and any cross-platform code won't check PPC specific
>> non-sticky bits so I'd exepect most guest code to be fine with
>> hardfloat.
>
> You could go further if you know nothing in a block can fault you can
> skip the calculation overhead of the per-op flags for all but the last
> op in the block.

I think that's an additional optimisation that could be done once the 
simple case of just rerunning last op if flags are accessed works. Just to 
keep complexity low first then try more complex solution. (Although I'm 
not planning to try to do this so whatever complexity can be handled by 
whom will implement it is fine but less complexity means less bugs so I'd 
go for simple first.)

>> Although what about FP exceptions? We also need to revert
>> to softfloat it FP exceptions are enabled so maybe using host FP
>> exception for managing status bits could be the way to go to let
>> hardware manage this and we don't need to implement everything in
>> software.
>
> Well for all apart from inexact handling (which would fault as soon as
> set) all other exception types are detected before we pass them to
> hardfloat anyway. Given the range of NaN types we would have to post
> process and hardfloat operation anyway to give the right NaN.

Is checking for those exceptions beforehand really needed? Wouldn't it be 
easier to install an exception handler and let the hardware do those 
checks? It this is again done because of FPU implemenation differences but 
inexact is determined by looking at the FP status (that's why it's cleared 
on PPC) then that means that we always use the hosts inexact semantics and 
don't emulate guest correctly anyway, so we can skip the tests above. Then 
why can't we install an exception handler and set guest bits whenever 
that's raised?

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-04-30 16:34                           ` R: " Dino Papararo
@ 2020-05-01  1:59                             ` Programmingkid
  2020-05-01  2:21                               ` 罗勇刚(Yonggang Luo)
  0 siblings, 1 reply; 40+ messages in thread
From: Programmingkid @ 2020-05-01  1:59 UTC (permalink / raw)
  To: Dino Papararo
  Cc: Richard Henderson, qemu-devel,
	"罗勇刚(Yonggang Luo)",
	qemu-ppc, Howard Spoelstra, Alex Bennée


> On Apr 30, 2020, at 12:34 PM, Dino Papararo <skizzato73@msn.com> wrote:
> 
> Maybe the fastest way to implement hardfloats for ppc could be run them by default and until some fpu instruction request for FPSCR register.
> At this time probably we want to check for some exception.. so QEMU could come back to last fpu instruction executed and re-execute it in softfloat taking care this time of FPSCR flags, then continue in hardfloats unitl another instruction looking for FPSCR register and so on..
> 
> Dino

That sounds like a good idea.

> -----Messaggio originale-----
> Da: BALATON Zoltan <balaton@eik.bme.hu> 
> Inviato: giovedì 30 aprile 2020 17:36
> A: 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>; Dino Papararo <skizzato73@msn.com>; qemu-devel@nongnu.org; Programmingkid <programmingkidx@gmail.com>; qemu-ppc@nongnu.org; Howard Spoelstra <hsp.cat7@gmail.com>; Alex Bennée <alex.bennee@linaro.org>
> Oggetto: Re: R: R: About hardfloat in ppc
> 
> On Thu, 30 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
>> I propose a new way to computing the float flags, We preserve a  float 
>> computing cash typedef struct FpRecord {  uint8_t op;
>> float32 A;
>> float32 B;
>> }  FpRecord;
>> FpRecord fp_cache[1024];
>> int fp_cache_length;
>> uint32_t fp_exceptions;
>> 
>> 1. For each new fp operation we push it to the  fp_cache, 2. Once we 
>> read the fp_exceptions , then we re-compute the fp_exceptions by 
>> re-running the fp FpRecord sequence.
>> and clear  fp_cache_length.
>> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 
>> and clear  fp_exceptions.
>> 4. If the  fp_cache are full, then we re-compute the fp_exceptions by 
>> re-running the fp FpRecord sequence.
>> 
>> Would this be a general method to use hard-float?
>> The consued time should be  2*hard_float.
>> Considerating read fp_exceptions are rare, then the amortized time 
>> complexity would be 1 * hard_float.
> 
> It's hard to guess what the hit rate of such cache would be and if it's low then managing the cache is probably more expensive than running with softfloat. So to evaluate any proposed patch we also need some benchmarks which we can experiment with to tell if the results are good or not otherwise we're just guessing. Are there some existing tests and benchmarks that we can use? Alex mentioned fp-bench I think and to evaluate the correctness of the FP implementation I've seen this other
> conversation:
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05107.html
> https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05126.html
> 
> Is that something we can use for PPC as well to check the correctness?
> 
> So I think before implementing any potential solution that came up in this brainstorming the first step would be to get and compile (or write if not
> available) some tests and benchmarks:
> 
> 1. testing host behaviour for inexact and compare that for different archs 2. some FP tests that can be used to compare results with QEMU and real CPU to check correctness of emulation (if these check for inexact differences then could be used instead of 1.) 3. some benchmarks to evaluate QEMU performance (these could be same as FP tests or some real world FP heavy applications).
> 
> Then we can see if the proposed solution is faster and still correct.
> 
> Regards,
> BALATON Zoltan



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01  1:59                             ` Programmingkid
@ 2020-05-01  2:21                               ` 罗勇刚(Yonggang Luo)
  2020-05-01 11:58                                 ` BALATON Zoltan
  0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01  2:21 UTC (permalink / raw)
  To: Programmingkid
  Cc: Alex Bennée, Richard Henderson, qemu-devel, qemu-ppc,
	Howard Spoelstra, Dino Papararo

[-- Attachment #1: Type: text/plain, Size: 4633 bytes --]

That's what I suggested,
We preserve a  float computing cache
typedef struct FpRecord {
  uint8_t op;
  float32 A;
  float32 B;
}  FpRecord;
FpRecord fp_cache[1024];
int fp_cache_length;
uint32_t fp_exceptions;

1. For each new fp operation we push it to the  fp_cache,
2. Once we read the fp_exceptions , then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.
 and clear  fp_cache_length.
3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
 clear  fp_exceptions.
4. If the  fp_cache are full, then we re-compute
the fp_exceptions by re-running the fp FpRecord sequence.

Now the keypoint is how to tracking the read and write of FPSCR register,
The current code are
    cpu_fpscr = tcg_global_mem_new(cpu_env,
                                   offsetof(CPUPPCState, fpscr), "fpscr");

On Fri, May 1, 2020 at 9:59 AM Programmingkid <programmingkidx@gmail.com>
wrote:

>
> > On Apr 30, 2020, at 12:34 PM, Dino Papararo <skizzato73@msn.com> wrote:
> >
> > Maybe the fastest way to implement hardfloats for ppc could be run them
> by default and until some fpu instruction request for FPSCR register.
> > At this time probably we want to check for some exception.. so QEMU
> could come back to last fpu instruction executed and re-execute it in
> softfloat taking care this time of FPSCR flags, then continue in hardfloats
> unitl another instruction looking for FPSCR register and so on..
> >
> > Dino
>
> That sounds like a good idea.
>
> > -----Messaggio originale-----
> > Da: BALATON Zoltan <balaton@eik.bme.hu>
> > Inviato: giovedì 30 aprile 2020 17:36
> > A: 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com>
> > Cc: Richard Henderson <richard.henderson@linaro.org>; Dino Papararo <
> skizzato73@msn.com>; qemu-devel@nongnu.org; Programmingkid <
> programmingkidx@gmail.com>; qemu-ppc@nongnu.org; Howard Spoelstra <
> hsp.cat7@gmail.com>; Alex Bennée <alex.bennee@linaro.org>
> > Oggetto: Re: R: R: About hardfloat in ppc
> >
> > On Thu, 30 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> >> I propose a new way to computing the float flags, We preserve a  float
> >> computing cash typedef struct FpRecord {  uint8_t op;
> >> float32 A;
> >> float32 B;
> >> }  FpRecord;
> >> FpRecord fp_cache[1024];
> >> int fp_cache_length;
> >> uint32_t fp_exceptions;
> >>
> >> 1. For each new fp operation we push it to the  fp_cache, 2. Once we
> >> read the fp_exceptions , then we re-compute the fp_exceptions by
> >> re-running the fp FpRecord sequence.
> >> and clear  fp_cache_length.
> >> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0
> >> and clear  fp_exceptions.
> >> 4. If the  fp_cache are full, then we re-compute the fp_exceptions by
> >> re-running the fp FpRecord sequence.
> >>
> >> Would this be a general method to use hard-float?
> >> The consued time should be  2*hard_float.
> >> Considerating read fp_exceptions are rare, then the amortized time
> >> complexity would be 1 * hard_float.
> >
> > It's hard to guess what the hit rate of such cache would be and if it's
> low then managing the cache is probably more expensive than running with
> softfloat. So to evaluate any proposed patch we also need some benchmarks
> which we can experiment with to tell if the results are good or not
> otherwise we're just guessing. Are there some existing tests and benchmarks
> that we can use? Alex mentioned fp-bench I think and to evaluate the
> correctness of the FP implementation I've seen this other
> > conversation:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05107.html
> > https://lists.nongnu.org/archive/html/qemu-devel/2020-04/msg05126.html
> >
> > Is that something we can use for PPC as well to check the correctness?
> >
> > So I think before implementing any potential solution that came up in
> this brainstorming the first step would be to get and compile (or write if
> not
> > available) some tests and benchmarks:
> >
> > 1. testing host behaviour for inexact and compare that for different
> archs 2. some FP tests that can be used to compare results with QEMU and
> real CPU to check correctness of emulation (if these check for inexact
> differences then could be used instead of 1.) 3. some benchmarks to
> evaluate QEMU performance (these could be same as FP tests or some real
> world FP heavy applications).
> >
> > Then we can see if the proposed solution is faster and still correct.
> >
> > Regards,
> > BALATON Zoltan
>
>

-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 6819 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01  2:21                               ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 11:58                                 ` BALATON Zoltan
  2020-05-01 12:04                                   ` 罗勇刚(Yonggang Luo)
  0 siblings, 1 reply; 40+ messages in thread
From: BALATON Zoltan @ 2020-05-01 11:58 UTC (permalink / raw)
  To: 罗勇刚(Yonggang Luo)
  Cc: Dino Papararo, Richard Henderson, qemu-devel, Programmingkid,
	qemu-ppc, Howard Spoelstra, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 1752 bytes --]

On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
> That's what I suggested,
> We preserve a  float computing cache
> typedef struct FpRecord {
>  uint8_t op;
>  float32 A;
>  float32 B;
> }  FpRecord;
> FpRecord fp_cache[1024];
> int fp_cache_length;
> uint32_t fp_exceptions;
>
> 1. For each new fp operation we push it to the  fp_cache,
> 2. Once we read the fp_exceptions , then we re-compute
> the fp_exceptions by re-running the fp FpRecord sequence.
> and clear  fp_cache_length.

Why do you need to store more than the last fp op? The cumulative bits can 
be tracked like it's done for other targets by not clearing fp_status then 
you can read it from there. Only the non-sticky FI bit needs to be 
computed but that's only determined by the last op so it's enough to 
remember that and run that with softfloat (or even hardfloat after 
clearing status but softfloat may be faster for this) to get the bits for 
last op when status is read.

> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
> clear  fp_exceptions.
> 4. If the  fp_cache are full, then we re-compute
> the fp_exceptions by re-running the fp FpRecord sequence.

All this cache management and more than one element seems unnecessary to 
me although I may be missing something.

> Now the keypoint is how to tracking the read and write of FPSCR register,
> The current code are
>    cpu_fpscr = tcg_global_mem_new(cpu_env,
>                                   offsetof(CPUPPCState, fpscr), "fpscr");

Maybe you could search where the value is read which should be the places 
where we need to handle it but changes may be needed to make a clear API 
for this between target/ppc, TCG and softfloat which likely does not 
exist yet.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 11:58                                 ` BALATON Zoltan
@ 2020-05-01 12:04                                   ` 罗勇刚(Yonggang Luo)
  2020-05-01 13:10                                     ` Alex Bennée
  0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01 12:04 UTC (permalink / raw)
  To: BALATON Zoltan
  Cc: Dino Papararo, Richard Henderson, qemu-devel, Programmingkid,
	qemu-ppc, Howard Spoelstra, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 2240 bytes --]

On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:

> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
> > That's what I suggested,
> > We preserve a  float computing cache
> > typedef struct FpRecord {
> >  uint8_t op;
> >  float32 A;
> >  float32 B;
> > }  FpRecord;
> > FpRecord fp_cache[1024];
> > int fp_cache_length;
> > uint32_t fp_exceptions;
> >
> > 1. For each new fp operation we push it to the  fp_cache,
> > 2. Once we read the fp_exceptions , then we re-compute
> > the fp_exceptions by re-running the fp FpRecord sequence.
> > and clear  fp_cache_length.
>
> Why do you need to store more than the last fp op? The cumulative bits can
> be tracked like it's done for other targets by not clearing fp_status then
> you can read it from there. Only the non-sticky FI bit needs to be
> computed but that's only determined by the last op so it's enough to
> remember that and run that with softfloat (or even hardfloat after
> clearing status but softfloat may be faster for this) to get the bits for
> last op when status is read.
>
Yeap, store only the last fp op is also an option. Do you means that store
the last fp op,
and calculate it when necessary?  I am thinking about a general fp
optmize method that suite
for all target.

>
> > 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
> > clear  fp_exceptions.
> > 4. If the  fp_cache are full, then we re-compute
> > the fp_exceptions by re-running the fp FpRecord sequence.
>
> All this cache management and more than one element seems unnecessary to
> me although I may be missing something.
>
> > Now the keypoint is how to tracking the read and write of FPSCR register,
> > The current code are
> >    cpu_fpscr = tcg_global_mem_new(cpu_env,
> >                                   offsetof(CPUPPCState, fpscr), "fpscr");
>
> Maybe you could search where the value is read which should be the places
> where we need to handle it but changes may be needed to make a clear API
> for this between target/ppc, TCG and softfloat which likely does not
> exist yet.
>
> Regards,
> BALATON Zoltan



-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 3016 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 12:04                                   ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 13:10                                     ` Alex Bennée
  2020-05-01 13:39                                       ` BALATON Zoltan
  2020-05-01 14:18                                       ` Richard Henderson
  0 siblings, 2 replies; 40+ messages in thread
From: Alex Bennée @ 2020-05-01 13:10 UTC (permalink / raw)
  To: luoyonggang
  Cc: Richard Henderson, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Dino Papararo


罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:

> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>
>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>> > That's what I suggested,
>> > We preserve a  float computing cache
>> > typedef struct FpRecord {
>> >  uint8_t op;
>> >  float32 A;
>> >  float32 B;
>> > }  FpRecord;
>> > FpRecord fp_cache[1024];
>> > int fp_cache_length;
>> > uint32_t fp_exceptions;
>> >
>> > 1. For each new fp operation we push it to the  fp_cache,
>> > 2. Once we read the fp_exceptions , then we re-compute
>> > the fp_exceptions by re-running the fp FpRecord sequence.
>> > and clear  fp_cache_length.
>>
>> Why do you need to store more than the last fp op? The cumulative bits can
>> be tracked like it's done for other targets by not clearing fp_status then
>> you can read it from there. Only the non-sticky FI bit needs to be
>> computed but that's only determined by the last op so it's enough to
>> remember that and run that with softfloat (or even hardfloat after
>> clearing status but softfloat may be faster for this) to get the bits for
>> last op when status is read.
>>
> Yeap, store only the last fp op is also an option. Do you means that store
> the last fp op,
> and calculate it when necessary?  I am thinking about a general fp
> optmize method that suite
> for all target.

I think that's getting a little ahead of yourself. Let's prove the
technique is valuable for PPC (given it has the most to gain). We can
always generalise later if it's worthwhile.

Rather than creating a new structure I would suggest creating 3 new tcg
globals (op, inA, inB) and re-factor the front-end code so each FP op
loaded the TCG globals. The TCG optimizer should pick up aliased loads
and automatically eliminate the dead ones. We might need some new
machinery for the TCG to avoid spilling the values over potentially
faulting loads/stores but that is likely a phase 2 problem. 

Next you will want to find places that care about the per-op bits of
cpu_fpscr and call a helper with the new globals to re-run the
computation and feed the values in.

That would give you a reasonable working prototype to start doing some
measurements of overhead and if it makes a difference.

>
>>
>> > 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
>> > clear  fp_exceptions.
>> > 4. If the  fp_cache are full, then we re-compute
>> > the fp_exceptions by re-running the fp FpRecord sequence.
>>
>> All this cache management and more than one element seems unnecessary to
>> me although I may be missing something.
>>
>> > Now the keypoint is how to tracking the read and write of FPSCR register,
>> > The current code are
>> >    cpu_fpscr = tcg_global_mem_new(cpu_env,
>> >                                   offsetof(CPUPPCState, fpscr), "fpscr");
>>
>> Maybe you could search where the value is read which should be the places
>> where we need to handle it but changes may be needed to make a clear API
>> for this between target/ppc, TCG and softfloat which likely does not
>> exist yet.

Once the per-op calculation is fixed in the PPC front-end I thing the
only change needed is to remove the #if defined(TARGET_PPC) in
softfloat.c - it's only really there because it avoids the overhead of
checking flags which we always know to be clear in it's case.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 13:10                                     ` Alex Bennée
@ 2020-05-01 13:39                                       ` BALATON Zoltan
  2020-05-01 14:01                                         ` Alex Bennée
  2020-05-01 14:18                                       ` Richard Henderson
  1 sibling, 1 reply; 40+ messages in thread
From: BALATON Zoltan @ 2020-05-01 13:39 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Richard Henderson, qemu-devel, Programmingkid, luoyonggang,
	qemu-ppc, Howard Spoelstra, Dino Papararo

[-- Attachment #1: Type: text/plain, Size: 4967 bytes --]

On Fri, 1 May 2020, Alex Bennée wrote:
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>>>> That's what I suggested,
>>>> We preserve a  float computing cache
>>>> typedef struct FpRecord {
>>>>  uint8_t op;
>>>>  float32 A;
>>>>  float32 B;
>>>> }  FpRecord;
>>>> FpRecord fp_cache[1024];
>>>> int fp_cache_length;
>>>> uint32_t fp_exceptions;
>>>>
>>>> 1. For each new fp operation we push it to the  fp_cache,
>>>> 2. Once we read the fp_exceptions , then we re-compute
>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>> and clear  fp_cache_length.
>>>
>>> Why do you need to store more than the last fp op? The cumulative bits can
>>> be tracked like it's done for other targets by not clearing fp_status then
>>> you can read it from there. Only the non-sticky FI bit needs to be
>>> computed but that's only determined by the last op so it's enough to
>>> remember that and run that with softfloat (or even hardfloat after
>>> clearing status but softfloat may be faster for this) to get the bits for
>>> last op when status is read.
>>>
>> Yeap, store only the last fp op is also an option. Do you means that store
>> the last fp op,
>> and calculate it when necessary?  I am thinking about a general fp
>> optmize method that suite
>> for all target.
>
> I think that's getting a little ahead of yourself. Let's prove the
> technique is valuable for PPC (given it has the most to gain). We can
> always generalise later if it's worthwhile.
>
> Rather than creating a new structure I would suggest creating 3 new tcg
> globals (op, inA, inB) and re-factor the front-end code so each FP op
> loaded the TCG globals.

So that's basically wherever you see helper_reset_fpstatus() in target/ppc 
we would need to replace it with saving op and args to globals? Or just 
repurpose this helper to do that. This is called before every fp op but 
not before sub ops within vector ops. Is that correct? Probably it is, as 
vector ops are a single op but how do we detect changes in flags by sub 
ops for those? These might have some existing bugs I think.

> The TCG optimizer should pick up aliased loads
> and automatically eliminate the dead ones. We might need some new
> machinery for the TCG to avoid spilling the values over potentially
> faulting loads/stores but that is likely a phase 2 problem.

I have no idea how to do this or even where to look. Some more detailed 
explanation may be needed here.

> Next you will want to find places that care about the per-op bits of
> cpu_fpscr and call a helper with the new globals to re-run the
> computation and feed the values in.

So the code that cares about these bits are in guest thus we would need to 
compute it if we detect the guest accessing these. Detecting when the 
individual bits are accessed might be difficult so at first we could go 
for checking if the fpscr is read and recompute FI bit then before 
returning value. You previously said these might be when fpscr is read or 
when generating exceptions but not sure where exactly are these done for 
ppc. (I'd expect to have mffpscr but there seem to be different other ops 
instead accessing parts of fpscr which are found in 
target/ppc/fp-impl.inc.c:567 so this would need studying the PPC docs to 
understand how the guest can access the FI bit of fpscr reg.)

> That would give you a reasonable working prototype to start doing some
> measurements of overhead and if it makes a difference.
>
>>
>>>
>>>> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
>>>> clear  fp_exceptions.
>>>> 4. If the  fp_cache are full, then we re-compute
>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>
>>> All this cache management and more than one element seems unnecessary to
>>> me although I may be missing something.
>>>
>>>> Now the keypoint is how to tracking the read and write of FPSCR register,
>>>> The current code are
>>>>    cpu_fpscr = tcg_global_mem_new(cpu_env,
>>>>                                   offsetof(CPUPPCState, fpscr), "fpscr");
>>>
>>> Maybe you could search where the value is read which should be the places
>>> where we need to handle it but changes may be needed to make a clear API
>>> for this between target/ppc, TCG and softfloat which likely does not
>>> exist yet.
>
> Once the per-op calculation is fixed in the PPC front-end I thing the
> only change needed is to remove the #if defined(TARGET_PPC) in
> softfloat.c - it's only really there because it avoids the overhead of
> checking flags which we always know to be clear in it's case.

That's the theory but I've found that removing that define currently makes 
general fp ops slower but vector ops faster so I think there may be some 
bugs that would need to be found and fixed. So testing with some proper 
test suite might be needed.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 13:39                                       ` BALATON Zoltan
@ 2020-05-01 14:01                                         ` Alex Bennée
  0 siblings, 0 replies; 40+ messages in thread
From: Alex Bennée @ 2020-05-01 14:01 UTC (permalink / raw)
  To: BALATON Zoltan
  Cc: Richard Henderson, qemu-devel, Programmingkid, luoyonggang,
	qemu-ppc, Howard Spoelstra, Dino Papararo


BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Fri, 1 May 2020, Alex Bennée wrote:
>> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>>> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>>>>> That's what I suggested,
>>>>> We preserve a  float computing cache
>>>>> typedef struct FpRecord {
>>>>>  uint8_t op;
>>>>>  float32 A;
>>>>>  float32 B;
>>>>> }  FpRecord;
>>>>> FpRecord fp_cache[1024];
>>>>> int fp_cache_length;
>>>>> uint32_t fp_exceptions;
>>>>>
>>>>> 1. For each new fp operation we push it to the  fp_cache,
>>>>> 2. Once we read the fp_exceptions , then we re-compute
>>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>>> and clear  fp_cache_length.
>>>>
>>>> Why do you need to store more than the last fp op? The cumulative bits can
>>>> be tracked like it's done for other targets by not clearing fp_status then
>>>> you can read it from there. Only the non-sticky FI bit needs to be
>>>> computed but that's only determined by the last op so it's enough to
>>>> remember that and run that with softfloat (or even hardfloat after
>>>> clearing status but softfloat may be faster for this) to get the bits for
>>>> last op when status is read.
>>>>
>>> Yeap, store only the last fp op is also an option. Do you means that store
>>> the last fp op,
>>> and calculate it when necessary?  I am thinking about a general fp
>>> optmize method that suite
>>> for all target.
>>
>> I think that's getting a little ahead of yourself. Let's prove the
>> technique is valuable for PPC (given it has the most to gain). We can
>> always generalise later if it's worthwhile.
>>
>> Rather than creating a new structure I would suggest creating 3 new tcg
>> globals (op, inA, inB) and re-factor the front-end code so each FP op
>> loaded the TCG globals.
>
> So that's basically wherever you see helper_reset_fpstatus() in
> target/ppc we would need to replace it with saving op and args to
> globals? Or just repurpose this helper to do that. This is called
> before every fp op but not before sub ops within vector ops. Is that
> correct? Probably it is, as vector ops are a single op but how do we
> detect changes in flags by sub ops for those? These might have some
> existing bugs I think.

I'll defer to the PPC front end experts on this. I'm not familiar with
how it all goes together at all.

>
>> The TCG optimizer should pick up aliased loads
>> and automatically eliminate the dead ones. We might need some new
>> machinery for the TCG to avoid spilling the values over potentially
>> faulting loads/stores but that is likely a phase 2 problem.
>
> I have no idea how to do this or even where to look. Some more
> detailed explanation may be needed here.

Don't worry about it now. Let's worry about it when we see how often
faulting instructions are interleaved with fp ops.

>
>> Next you will want to find places that care about the per-op bits of
>> cpu_fpscr and call a helper with the new globals to re-run the
>> computation and feed the values in.
>
> So the code that cares about these bits are in guest thus we would
> need to compute it if we detect the guest accessing these. Detecting
> when the individual bits are accessed might be difficult so at first
> we could go for checking if the fpscr is read and recompute FI bit
> then before returning value. You previously said these might be when
> fpscr is read or when generating exceptions but not sure where exactly
> are these done for ppc. (I'd expect to have mffpscr but there seem to
> be different other ops instead accessing parts of fpscr which are
> found in target/ppc/fp-impl.inc.c:567 so this would need studying the
> PPC docs to understand how the guest can access the FI bit of fpscr
> reg.)
>
>> That would give you a reasonable working prototype to start doing some
>> measurements of overhead and if it makes a difference.
>>
>>>
>>>>
>>>>> 3. If we clear the fp_exceptions , then we set fp_cache_length to 0 and
>>>>> clear  fp_exceptions.
>>>>> 4. If the  fp_cache are full, then we re-compute
>>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>>
>>>> All this cache management and more than one element seems unnecessary to
>>>> me although I may be missing something.
>>>>
>>>>> Now the keypoint is how to tracking the read and write of FPSCR register,
>>>>> The current code are
>>>>>    cpu_fpscr = tcg_global_mem_new(cpu_env,
>>>>>                                   offsetof(CPUPPCState, fpscr), "fpscr");
>>>>
>>>> Maybe you could search where the value is read which should be the places
>>>> where we need to handle it but changes may be needed to make a clear API
>>>> for this between target/ppc, TCG and softfloat which likely does not
>>>> exist yet.
>>
>> Once the per-op calculation is fixed in the PPC front-end I thing the
>> only change needed is to remove the #if defined(TARGET_PPC) in
>> softfloat.c - it's only really there because it avoids the overhead of
>> checking flags which we always know to be clear in it's case.
>
> That's the theory but I've found that removing that define currently
> makes general fp ops slower but vector ops faster so I think there may
> be some bugs that would need to be found and fixed. So testing with
> some proper test suite might be needed.

You might want to do what Laurent did and hack up a testfloat with
"system" implementations:

  https://github.com/vivier/m68k-testfloat/blob/master/testfloat/M68K-Linux-GCC/systfloat.c

I would be nice to plumb that sort of support into our existing
testfloat fork in the code base (tests/fp) but I suspect getting an
out-of-tree fork building and running first would be the quickest way
forward. 

>
> Regards,
> BALATON Zoltan


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 13:10                                     ` Alex Bennée
  2020-05-01 13:39                                       ` BALATON Zoltan
@ 2020-05-01 14:18                                       ` Richard Henderson
  2020-05-01 16:25                                         ` 罗勇刚(Yonggang Luo)
  2020-05-01 16:29                                         ` 罗勇刚(Yonggang Luo)
  1 sibling, 2 replies; 40+ messages in thread
From: Richard Henderson @ 2020-05-01 14:18 UTC (permalink / raw)
  To: Alex Bennée, luoyonggang
  Cc: qemu-devel, Programmingkid, qemu-ppc, Howard Spoelstra, Dino Papararo

On 5/1/20 6:10 AM, Alex Bennée wrote:
> 
> 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> 
>> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>
>>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>>>> That's what I suggested,
>>>> We preserve a  float computing cache
>>>> typedef struct FpRecord {
>>>>  uint8_t op;
>>>>  float32 A;
>>>>  float32 B;
>>>> }  FpRecord;
>>>> FpRecord fp_cache[1024];
>>>> int fp_cache_length;
>>>> uint32_t fp_exceptions;
>>>>
>>>> 1. For each new fp operation we push it to the  fp_cache,
>>>> 2. Once we read the fp_exceptions , then we re-compute
>>>> the fp_exceptions by re-running the fp FpRecord sequence.
>>>> and clear  fp_cache_length.
>>>
>>> Why do you need to store more than the last fp op? The cumulative bits can
>>> be tracked like it's done for other targets by not clearing fp_status then
>>> you can read it from there. Only the non-sticky FI bit needs to be
>>> computed but that's only determined by the last op so it's enough to
>>> remember that and run that with softfloat (or even hardfloat after
>>> clearing status but softfloat may be faster for this) to get the bits for
>>> last op when status is read.
>>>
>> Yeap, store only the last fp op is also an option. Do you means that store
>> the last fp op,
>> and calculate it when necessary?  I am thinking about a general fp
>> optmize method that suite
>> for all target.
> 
> I think that's getting a little ahead of yourself. Let's prove the
> technique is valuable for PPC (given it has the most to gain). We can
> always generalise later if it's worthwhile.

Indeed.

> Rather than creating a new structure I would suggest creating 3 new tcg
> globals (op, inA, inB) and re-factor the front-end code so each FP op
> loaded the TCG globals. The TCG optimizer should pick up aliased loads
> and automatically eliminate the dead ones. We might need some new
> machinery for the TCG to avoid spilling the values over potentially
> faulting loads/stores but that is likely a phase 2 problem. 

There's no point in new tcg globals.

Every fp operation can raise an exception, and therefore every fp operation
will flush tcg globals to memory.  Therefore there is no optimization to be
done at the tcg opcode level.

However, every fp operation calls a helper function, and the quickest thing to
do is store the inputs to env->(op, inA, inB, inC) in the helper before
performing the operation.


> Next you will want to find places that care about the per-op bits of
> cpu_fpscr and call a helper with the new globals to re-run the
> computation and feed the values in.

Before we even get to this deferred fp operation thing, there are several giant
improvements to ppc emulation that can be made:

Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
I've mentioned this before, that it's possible to leave the steady-state of
env->fp_status.exception_flags == 0, so there's no need for a separate function
call.  I suspect this is worth a decent speedup by itself.

Step 2 is to notice when all fp exceptions are masked, so that no exception can
be raised, and set a tb_flags bit.  This is the default fp environment that
libc enables and therefore extremely common.

Currently, ppc has 3 helpers called per fp operation.  If step 1 is handled
correctly, then we're down to 2 fp helpers per fp operation.  If no exceptions
need raising, then we can perform the entire operation with a single function call.

We would require a parallel set of fp helpers that (1) performs the operation
and (2) does any post-processing of the exception bits straight away, but (3)
without raising any exceptions.  Sort of like helper_fadd +
do_float_check_status, but less.  IIRC the only real extra work is categorizing
invalid exceptions.  We could even plausibly extend softfloat to do that while
it is recording the invalid exception.

Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute inexact
from the inverse hardfloat operation.  This would let us relax the restriction
of only using hardfloat when we have already have an accrued inexact exception.

Only after all of these are done is it worth experimenting with caching the
last fp operation.


r~


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 14:18                                       ` Richard Henderson
@ 2020-05-01 16:25                                         ` 罗勇刚(Yonggang Luo)
  2020-05-01 19:33                                           ` Alex Bennée
  2020-05-01 16:29                                         ` 罗勇刚(Yonggang Luo)
  1 sibling, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01 16:25 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 6472 bytes --]

On Fri, May 1, 2020 at 10:18 PM Richard Henderson <
richard.henderson@linaro.org> wrote:

> On 5/1/20 6:10 AM, Alex Bennée wrote:
> >
> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >
> >> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu>
> wrote:
> >>
> >>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
> >>>> That's what I suggested,
> >>>> We preserve a  float computing cache
> >>>> typedef struct FpRecord {
> >>>>  uint8_t op;
> >>>>  float32 A;
> >>>>  float32 B;
> >>>> }  FpRecord;
> >>>> FpRecord fp_cache[1024];
> >>>> int fp_cache_length;
> >>>> uint32_t fp_exceptions;
> >>>>
> >>>> 1. For each new fp operation we push it to the  fp_cache,
> >>>> 2. Once we read the fp_exceptions , then we re-compute
> >>>> the fp_exceptions by re-running the fp FpRecord sequence.
> >>>> and clear  fp_cache_length.
> >>>
> >>> Why do you need to store more than the last fp op? The cumulative bits
> can
> >>> be tracked like it's done for other targets by not clearing fp_status
> then
> >>> you can read it from there. Only the non-sticky FI bit needs to be
> >>> computed but that's only determined by the last op so it's enough to
> >>> remember that and run that with softfloat (or even hardfloat after
> >>> clearing status but softfloat may be faster for this) to get the bits
> for
> >>> last op when status is read.
> >>>
> >> Yeap, store only the last fp op is also an option. Do you means that
> store
> >> the last fp op,
> >> and calculate it when necessary?  I am thinking about a general fp
> >> optmize method that suite
> >> for all target.
> >
> > I think that's getting a little ahead of yourself. Let's prove the
> > technique is valuable for PPC (given it has the most to gain). We can
> > always generalise later if it's worthwhile.
>
> Indeed.
>
> > Rather than creating a new structure I would suggest creating 3 new tcg
> > globals (op, inA, inB) and re-factor the front-end code so each FP op
> > loaded the TCG globals. The TCG optimizer should pick up aliased loads
> > and automatically eliminate the dead ones. We might need some new
> > machinery for the TCG to avoid spilling the values over potentially
> > faulting loads/stores but that is likely a phase 2 problem.
>
> There's no point in new tcg globals.
>
> Every fp operation can raise an exception, and therefore every fp operation
> will flush tcg globals to memory.  Therefore there is no optimization to be
> done at the tcg opcode level.
>
> However, every fp operation calls a helper function, and the quickest
> thing to
> do is store the inputs to env->(op, inA, inB, inC) in the helper before
> performing the operation.
>
I thinks there is a possibility to add the tcg ops to optimize the floating
point; For example
WebAssembly doesn't support for float point exception and fp round mode at
all, I suppose most fp execution are no need care about
 round mode  and fp expcetion, and for this path we can use tcg-op to
abstract it,
and for all other condition we can downgrading to soft-float. As a final
path to optmize to fp accel of
QEMU, we can split the tcg-op into two path. one is hard-float with result
cache for lazy fp flags calculating
And one is pure soft-float path.
For lazy fp flags calculating, cause we have stick flags
```
    float_flag_invalid   =  1,
    float_flag_divbyzero =  4,
    float_flag_overflow  =  8,
    float_flag_underflow = 16,
    float_flag_inexact   = 32,
```
We can skip the calculation of these flags when these flags are already
marked to 1.
For these five flags, we can split to 5 calculating function, One function
only check one of the flags.
And once the flags are set to 1, then we won't call the functon any more,
unless the flag are cleared.
We will reduce a lot of branch prediction. And the function would only be
called when the
fp flags are requested.
This is my final goal to optimize fp in QEMU, before that, we can do
simpler things to optimize fp in QEMU

And besides these type of optimization, we can also offloading the fp
exception calculating to other CPU core, so
we can making single threading performance be better, cause single core
performance are hard to improve, but multiple core
system are more and more used in these days, for Ryzen 2/ Threadripper we
even have 64-core /128 threads.



>
> > Next you will want to find places that care about the per-op bits of
> > cpu_fpscr and call a helper with the new globals to re-run the
> > computation and feed the values in.
>
> Before we even get to this deferred fp operation thing, there are several
> giant
> improvements to ppc emulation that can be made:
>
> Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
> I've mentioned this before, that it's possible to leave the steady-state of
> env->fp_status.exception_flags == 0, so there's no need for a separate
> function
> call.  I suspect this is worth a decent speedup by itself.
>
I would like to start the fp optimize from here.


>
> Step 2 is to notice when all fp exceptions are masked, so that no
> exception can
> be raised, and set a tb_flags bit.  This is the default fp environment that
> libc enables and therefore extremely common.
>
> Currently, ppc has 3 helpers called per fp operation.  If step 1 is handled
> correctly, then we're down to 2 fp helpers per fp operation.  If no
> exceptions
> need raising, then we can perform the entire operation with a single
> function call.
>
> We would require a parallel set of fp helpers that (1) performs the
> operation
> and (2) does any post-processing of the exception bits straight away, but
> (3)
> without raising any exceptions.  Sort of like helper_fadd +
> do_float_check_status, but less.  IIRC the only real extra work is
> categorizing
> invalid exceptions.  We could even plausibly extend softfloat to do that
> while
> it is recording the invalid exception.
>
> Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute
> inexact
> from the inverse hardfloat operation.  This would let us relax the
> restriction
> of only using hardfloat when we have already have an accrued inexact
> exception.
>
> Only after all of these are done is it worth experimenting with caching the
> last fp operation.
>
>
> r~
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 8108 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 14:18                                       ` Richard Henderson
  2020-05-01 16:25                                         ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 16:29                                         ` 罗勇刚(Yonggang Luo)
  2020-05-01 16:51                                           ` Richard Henderson
  1 sibling, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01 16:29 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 4876 bytes --]

On Fri, May 1, 2020 at 10:18 PM Richard Henderson <
richard.henderson@linaro.org> wrote:

> On 5/1/20 6:10 AM, Alex Bennée wrote:
> >
> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
> >
> >> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu>
> wrote:
> >>
> >>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
> >>>> That's what I suggested,
> >>>> We preserve a  float computing cache
> >>>> typedef struct FpRecord {
> >>>>  uint8_t op;
> >>>>  float32 A;
> >>>>  float32 B;
> >>>> }  FpRecord;
> >>>> FpRecord fp_cache[1024];
> >>>> int fp_cache_length;
> >>>> uint32_t fp_exceptions;
> >>>>
> >>>> 1. For each new fp operation we push it to the  fp_cache,
> >>>> 2. Once we read the fp_exceptions , then we re-compute
> >>>> the fp_exceptions by re-running the fp FpRecord sequence.
> >>>> and clear  fp_cache_length.
> >>>
> >>> Why do you need to store more than the last fp op? The cumulative bits
> can
> >>> be tracked like it's done for other targets by not clearing fp_status
> then
> >>> you can read it from there. Only the non-sticky FI bit needs to be
> >>> computed but that's only determined by the last op so it's enough to
> >>> remember that and run that with softfloat (or even hardfloat after
> >>> clearing status but softfloat may be faster for this) to get the bits
> for
> >>> last op when status is read.
> >>>
> >> Yeap, store only the last fp op is also an option. Do you means that
> store
> >> the last fp op,
> >> and calculate it when necessary?  I am thinking about a general fp
> >> optmize method that suite
> >> for all target.
> >
> > I think that's getting a little ahead of yourself. Let's prove the
> > technique is valuable for PPC (given it has the most to gain). We can
> > always generalise later if it's worthwhile.
>
> Indeed.
>
> > Rather than creating a new structure I would suggest creating 3 new tcg
> > globals (op, inA, inB) and re-factor the front-end code so each FP op
> > loaded the TCG globals. The TCG optimizer should pick up aliased loads
> > and automatically eliminate the dead ones. We might need some new
> > machinery for the TCG to avoid spilling the values over potentially
> > faulting loads/stores but that is likely a phase 2 problem.
>
> There's no point in new tcg globals.
>
> Every fp operation can raise an exception, and therefore every fp operation
> will flush tcg globals to memory.  Therefore there is no optimization to be
> done at the tcg opcode level.
>
> However, every fp operation calls a helper function, and the quickest
> thing to
> do is store the inputs to env->(op, inA, inB, inC) in the helper before
> performing the operation.
>
>
> > Next you will want to find places that care about the per-op bits of
> > cpu_fpscr and call a helper with the new globals to re-run the
> > computation and feed the values in.
>
> Before we even get to this deferred fp operation thing, there are several
> giant
> improvements to ppc emulation that can be made:
>
> Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
> I've mentioned this before, that it's possible to leave the steady-state of
> env->fp_status.exception_flags == 0, so there's no need for a separate
> function
> call.  I suspect this is worth a decent speedup by itself.
>
Hi Richard, what kinds of rearrange the fp need to be done? Can you give me
a more detailed
example? I am still not get the idea.

>
> Step 2 is to notice when all fp exceptions are masked, so that no
> exception can
> be raised, and set a tb_flags bit.  This is the default fp environment that
> libc enables and therefore extremely common.
>
> Currently, ppc has 3 helpers called per fp operation.  If step 1 is handled
> correctly, then we're down to 2 fp helpers per fp operation.  If no
> exceptions
> need raising, then we can perform the entire operation with a single
> function call.
>
> We would require a parallel set of fp helpers that (1) performs the
> operation
> and (2) does any post-processing of the exception bits straight away, but
> (3)
> without raising any exceptions.  Sort of like helper_fadd +
> do_float_check_status, but less.  IIRC the only real extra work is
> categorizing
> invalid exceptions.  We could even plausibly extend softfloat to do that
> while
> it is recording the invalid exception.
>
> Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute
> inexact
> from the inverse hardfloat operation.  This would let us relax the
> restriction
> of only using hardfloat when we have already have an accrued inexact
> exception.
>
> Only after all of these are done is it worth experimenting with caching the
> last fp operation.
>
>
> r~
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 6145 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 16:29                                         ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 16:51                                           ` Richard Henderson
  2020-05-01 17:49                                             ` 罗勇刚(Yonggang Luo)
  0 siblings, 1 reply; 40+ messages in thread
From: Richard Henderson @ 2020-05-01 16:51 UTC (permalink / raw)
  To: luoyonggang
  Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Alex Bennée

On 5/1/20 9:29 AM, 罗勇刚(Yonggang Luo) wrote:
> On Fri, May 1, 2020 at 10:18 PM Richard Henderson <richard.henderson@linaro.org
>     Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
>     I've mentioned this before, that it's possible to leave the steady-state of
>     env->fp_status.exception_flags == 0, so there's no need for a separate function
>     call.  I suspect this is worth a decent speedup by itself.
> 
> Hi Richard, what kinds of rearrange the fp need to be done? Can you give me a
> more detailed example? I am still not get the idea.

See target/openrisc, helper_update_fpcsr.

This is like target/ppc helper_float_check_status, in that it is called after
the primary fpu helper, after the fpu result is written back to the
architectural register, to process fpu exceptions.

Note that if get_float_exception_flags returns non-zero, we immediately reset
them to zero.  Thus the exception flags are only ever non-zero in between the
primary fpu operation and the update of the fpscr.

Thus, no need for a separate helper_reset_fpstatus.


r~


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 16:51                                           ` Richard Henderson
@ 2020-05-01 17:49                                             ` 罗勇刚(Yonggang Luo)
  2020-05-01 20:35                                               ` Richard Henderson
  0 siblings, 1 reply; 40+ messages in thread
From: 罗勇刚(Yonggang Luo) @ 2020-05-01 17:49 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 1858 bytes --]

On Sat, May 2, 2020 at 12:51 AM Richard Henderson <
richard.henderson@linaro.org> wrote:

> On 5/1/20 9:29 AM, 罗勇刚(Yonggang Luo) wrote:
> > On Fri, May 1, 2020 at 10:18 PM Richard Henderson <
> richard.henderson@linaro.org
> >     Step 1 is to rearrange the fp helpers to eliminate
> helper_reset_fpstatus().
> >     I've mentioned this before, that it's possible to leave the
> steady-state of
> >     env->fp_status.exception_flags == 0, so there's no need for a
> separate function
> >     call.  I suspect this is worth a decent speedup by itself.
> >
> > Hi Richard, what kinds of rearrange the fp need to be done? Can you give
> me a
> > more detailed example? I am still not get the idea.
>
> See target/openrisc, helper_update_fpcsr.
>
> This is like target/ppc helper_float_check_status, in that it is called
> after
> the primary fpu helper, after the fpu result is written back to the
> architectural register, to process fpu exceptions.
>
> Note that if get_float_exception_flags returns non-zero, we immediately
> reset
> them to zero.  Thus the exception flags are only ever non-zero in between
> the
> primary fpu operation and the update of the fpscr.
>
According to
```
void HELPER(update_fpcsr)(CPUOpenRISCState *env)
{
    int tmp = get_float_exception_flags(&env->fp_status);

    if (tmp) {
        set_float_exception_flags(0, &env->fp_status);
        tmp = ieee_ex_to_openrisc(tmp);
        if (tmp) {
            env->fpcsr |= tmp;
            if (env->fpcsr & FPCSR_FPEE) {
                helper_exception(env, EXCP_FPE);
            }
        }
    }
}
```
The openrisc also clearing the flags before each fp operation?

>
> Thus, no need for a separate helper_reset_fpstatus.
>
>
> r~
>


-- 
         此致
礼
罗勇刚
Yours
    sincerely,
Yonggang Luo

[-- Attachment #2: Type: text/html, Size: 2677 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 16:25                                         ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 19:33                                           ` Alex Bennée
  0 siblings, 0 replies; 40+ messages in thread
From: Alex Bennée @ 2020-05-01 19:33 UTC (permalink / raw)
  To: luoyonggang
  Cc: Richard Henderson, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Dino Papararo


罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:

> On Fri, May 1, 2020 at 10:18 PM Richard Henderson <
> richard.henderson@linaro.org> wrote:
>
>> On 5/1/20 6:10 AM, Alex Bennée wrote:
>> >
>> > 罗勇刚(Yonggang Luo) <luoyonggang@gmail.com> writes:
>> >
>> >> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu>
>> wrote:
>> >>
>> >>> On Fri, 1 May 2020, 罗勇刚(Yonggang Luo) wrote:
>> >>>> That's what I suggested,
>> >>>> We preserve a  float computing cache
>> >>>> typedef struct FpRecord {
>> >>>>  uint8_t op;
>> >>>>  float32 A;
>> >>>>  float32 B;
>> >>>> }  FpRecord;
>> >>>> FpRecord fp_cache[1024];
>> >>>> int fp_cache_length;
>> >>>> uint32_t fp_exceptions;
>> >>>>
>> >>>> 1. For each new fp operation we push it to the  fp_cache,
>> >>>> 2. Once we read the fp_exceptions , then we re-compute
>> >>>> the fp_exceptions by re-running the fp FpRecord sequence.
>> >>>> and clear  fp_cache_length.
>> >>>
>> >>> Why do you need to store more than the last fp op? The cumulative bits
>> can
>> >>> be tracked like it's done for other targets by not clearing fp_status
>> then
>> >>> you can read it from there. Only the non-sticky FI bit needs to be
>> >>> computed but that's only determined by the last op so it's enough to
>> >>> remember that and run that with softfloat (or even hardfloat after
>> >>> clearing status but softfloat may be faster for this) to get the bits
>> for
>> >>> last op when status is read.
>> >>>
>> >> Yeap, store only the last fp op is also an option. Do you means that
>> store
>> >> the last fp op,
>> >> and calculate it when necessary?  I am thinking about a general fp
>> >> optmize method that suite
>> >> for all target.
>> >
>> > I think that's getting a little ahead of yourself. Let's prove the
>> > technique is valuable for PPC (given it has the most to gain). We can
>> > always generalise later if it's worthwhile.
>>
>> Indeed.
>>
>> > Rather than creating a new structure I would suggest creating 3 new tcg
>> > globals (op, inA, inB) and re-factor the front-end code so each FP op
>> > loaded the TCG globals. The TCG optimizer should pick up aliased loads
>> > and automatically eliminate the dead ones. We might need some new
>> > machinery for the TCG to avoid spilling the values over potentially
>> > faulting loads/stores but that is likely a phase 2 problem.
>>
>> There's no point in new tcg globals.
>>
>> Every fp operation can raise an exception, and therefore every fp operation
>> will flush tcg globals to memory.  Therefore there is no optimization to be
>> done at the tcg opcode level.
>>
>> However, every fp operation calls a helper function, and the quickest
>> thing to
>> do is store the inputs to env->(op, inA, inB, inC) in the helper before
>> performing the operation.
>>
> I thinks there is a possibility to add the tcg ops to optimize the floating
> point; For example
> WebAssembly doesn't support for float point exception and fp round mode at
> all, I suppose most fp execution are no need care about
>  round mode  and fp expcetion, and for this path we can use tcg-op to
> abstract it,
> and for all other condition we can downgrading to soft-float. As a final
> path to optmize to fp accel of
> QEMU, we can split the tcg-op into two path. one is hard-float with result
> cache for lazy fp flags calculating
> And one is pure soft-float path.

We have talked about adding support for floating point TCG ops in the
past but I think we would need to be a fair bit farther down the road
before we can attempt that. The overhead of the helper call is
relatively minimal compared to that of the executing the operation
itself. As you can see from all the various front end wrappings around
the softfloat code there is a fair amount of implementation details
you'd need to abstract away into the TCG generation code to make it
useful for all our guests.

> For lazy fp flags calculating, cause we have stick flags
> ```
>     float_flag_invalid   =  1,
>     float_flag_divbyzero =  4,
>     float_flag_overflow  =  8,
>     float_flag_underflow = 16,
>     float_flag_inexact   = 32,
> ```
> We can skip the calculation of these flags when these flags are already
> marked to 1.
> For these five flags, we can split to 5 calculating function, One function
> only check one of the flags.
> And once the flags are set to 1, then we won't call the functon any more,
> unless the flag are cleared.
> We will reduce a lot of branch prediction. And the function would only be
> called when the
> fp flags are requested.
> This is my final goal to optimize fp in QEMU, before that, we can do
> simpler things to optimize fp in QEMU
>
> And besides these type of optimization, we can also offloading the fp
> exception calculating to other CPU core, so
> we can making single threading performance be better, cause single core
> performance are hard to improve, but multiple core
> system are more and more used in these days, for Ryzen 2/ Threadripper we
> even have 64-core /128 threads.

I would take some convincing that offloading exception calculation to
another thread would make a difference - surely there would be
inter-thread syncing required? Our main approach to threading has been
trying to improve scalability for softmmu so we can emulate more vCPUs
in the system.

>
>
>
>>
>> > Next you will want to find places that care about the per-op bits of
>> > cpu_fpscr and call a helper with the new globals to re-run the
>> > computation and feed the values in.
>>
>> Before we even get to this deferred fp operation thing, there are several
>> giant
>> improvements to ppc emulation that can be made:
>>
>> Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().
>> I've mentioned this before, that it's possible to leave the steady-state of
>> env->fp_status.exception_flags == 0, so there's no need for a separate
>> function
>> call.  I suspect this is worth a decent speedup by itself.
>>
> I would like to start the fp optimize from here.
>
>
>>
>> Step 2 is to notice when all fp exceptions are masked, so that no
>> exception can
>> be raised, and set a tb_flags bit.  This is the default fp environment that
>> libc enables and therefore extremely common.
>>
>> Currently, ppc has 3 helpers called per fp operation.  If step 1 is handled
>> correctly, then we're down to 2 fp helpers per fp operation.  If no
>> exceptions
>> need raising, then we can perform the entire operation with a single
>> function call.
>>
>> We would require a parallel set of fp helpers that (1) performs the
>> operation
>> and (2) does any post-processing of the exception bits straight away, but
>> (3)
>> without raising any exceptions.  Sort of like helper_fadd +
>> do_float_check_status, but less.  IIRC the only real extra work is
>> categorizing
>> invalid exceptions.  We could even plausibly extend softfloat to do that
>> while
>> it is recording the invalid exception.
>>
>> Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute
>> inexact
>> from the inverse hardfloat operation.  This would let us relax the
>> restriction
>> of only using hardfloat when we have already have an accrued inexact
>> exception.
>>
>> Only after all of these are done is it worth experimenting with caching the
>> last fp operation.
>>
>>
>> r~
>>


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: About hardfloat in ppc
  2020-05-01 17:49                                             ` 罗勇刚(Yonggang Luo)
@ 2020-05-01 20:35                                               ` Richard Henderson
  0 siblings, 0 replies; 40+ messages in thread
From: Richard Henderson @ 2020-05-01 20:35 UTC (permalink / raw)
  To: luoyonggang
  Cc: Dino Papararo, qemu-devel, Programmingkid, qemu-ppc,
	Howard Spoelstra, Alex Bennée

On 5/1/20 10:49 AM, 罗勇刚(Yonggang Luo) wrote:
> 
> 
> On Sat, May 2, 2020 at 12:51 AM Richard Henderson <richard.henderson@linaro.org
> <mailto:richard.henderson@linaro.org>> wrote:
> 
>     On 5/1/20 9:29 AM, 罗勇刚(Yonggang Luo) wrote:
>     > On Fri, May 1, 2020 at 10:18 PM Richard Henderson
>     <richard.henderson@linaro.org <mailto:richard.henderson@linaro.org>
>     >     Step 1 is to rearrange the fp helpers to eliminate
>     helper_reset_fpstatus().
>     >     I've mentioned this before, that it's possible to leave the
>     steady-state of
>     >     env->fp_status.exception_flags == 0, so there's no need for a
>     separate function
>     >     call.  I suspect this is worth a decent speedup by itself.
>     >
>     > Hi Richard, what kinds of rearrange the fp need to be done? Can you give me a
>     > more detailed example? I am still not get the idea.
> 
>     See target/openrisc, helper_update_fpcsr.
> 
>     This is like target/ppc helper_float_check_status, in that it is called after
>     the primary fpu helper, after the fpu result is written back to the
>     architectural register, to process fpu exceptions.
> 
>     Note that if get_float_exception_flags returns non-zero, we immediately reset
>     them to zero.  Thus the exception flags are only ever non-zero in between the
>     primary fpu operation and the update of the fpscr.
> 
> According to 
> ```
> void HELPER(update_fpcsr)(CPUOpenRISCState *env)
> {
>     int tmp = get_float_exception_flags(&env->fp_status);
> 
>     if (tmp) {
>         set_float_exception_flags(0, &env->fp_status);
>         tmp = ieee_ex_to_openrisc(tmp);
>         if (tmp) {
>             env->fpcsr |= tmp;
>             if (env->fpcsr & FPCSR_FPEE) {
>                 helper_exception(env, EXCP_FPE);
>             }
>         }
>     }
> }
> ```
> The openrisc also clearing the flags before each fp operation?

No.  Please re-read my description above.

OpenRISC is clearing the flags *after* each fp operation, at the same time that
it processes the flags from the current fp operation.

There are two calls at runtime for openrisc, e.g. do_fp2:

    fn(cpu_R(dc, a->d), cpu_env, cpu_R(dc, a->a));
    gen_helper_update_fpcsr(cpu_env);

Whereas for ppc there are between 2 and 5 calls at runtime, e.g. in _GEN_FLOAT_ACB:

>     gen_reset_fpstatus();                           [1]
>     get_fpr(t0, rA(ctx->opcode));                   
>     get_fpr(t1, rC(ctx->opcode));                   
>     get_fpr(t2, rB(ctx->opcode));                   
>     gen_helper_f##op(t3, cpu_env, t0, t1, t2);      [2]
>     if (isfloat) {                                  
>         gen_helper_frsp(t3, cpu_env, t3);           [3]
>     }                                               
>     set_fpr(rD(ctx->opcode), t3);                   
>     if (set_fprf) {                                 
>         gen_compute_fprf_float64(t3);               [4]
>     }                                               
>     if (unlikely(Rc(ctx->opcode) != 0)) {           
>         gen_set_cr1_from_fpscr(ctx);                [5]
>     }                                               

For step 1, we're talking about removing the call to gen_reset_fpstatus.

It might be worth adding a debugging check to the beginning of each helper of
the form [2] to assert that the exception flags are in fact zero.  This check
might be removed later, in relation to future improvements, but it can help
ensure that the value of set_fprf is correct, and validate that step 1 isn't
breaking anything.


r~


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2020-05-01 20:37 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-27  6:39 About hardfloat in ppc 罗勇刚(Yonggang Luo)
2020-04-27  9:42 ` Alex Bennée
2020-04-27 10:34   ` BALATON Zoltan
2020-04-27 11:10     ` Alex Bennée
2020-04-27 21:18       ` 罗勇刚(Yonggang Luo)
2020-04-28  8:36         ` Alex Bennée
2020-04-28 14:29           ` 罗勇刚(Yonggang Luo)
2020-04-29 10:17           ` R: " Dino Papararo
2020-04-29 10:31             ` Dino Papararo
2020-04-29 11:57             ` Alex Bennée
2020-04-29 12:33               ` 罗勇刚(Yonggang Luo)
2020-04-29 13:38                 ` Alex Bennée
2020-04-29 14:31               ` R: " Dino Papararo
2020-04-29 14:49                 ` Peter Maydell
2020-04-29 18:25                 ` R: " Alex Bennée
2020-04-30  0:20                   ` 罗勇刚(Yonggang Luo)
2020-04-30  2:18                     ` Richard Henderson
2020-04-30  7:26                       ` 罗勇刚(Yonggang Luo)
2020-04-30  8:11                         ` Alex Bennée
2020-04-30  8:13                       ` 罗勇刚(Yonggang Luo)
2020-04-30 15:35                         ` BALATON Zoltan
2020-04-30 16:34                           ` R: " Dino Papararo
2020-05-01  1:59                             ` Programmingkid
2020-05-01  2:21                               ` 罗勇刚(Yonggang Luo)
2020-05-01 11:58                                 ` BALATON Zoltan
2020-05-01 12:04                                   ` 罗勇刚(Yonggang Luo)
2020-05-01 13:10                                     ` Alex Bennée
2020-05-01 13:39                                       ` BALATON Zoltan
2020-05-01 14:01                                         ` Alex Bennée
2020-05-01 14:18                                       ` Richard Henderson
2020-05-01 16:25                                         ` 罗勇刚(Yonggang Luo)
2020-05-01 19:33                                           ` Alex Bennée
2020-05-01 16:29                                         ` 罗勇刚(Yonggang Luo)
2020-05-01 16:51                                           ` Richard Henderson
2020-05-01 17:49                                             ` 罗勇刚(Yonggang Luo)
2020-05-01 20:35                                               ` Richard Henderson
2020-04-29 23:12               ` R: " 罗勇刚(Yonggang Luo)
2020-04-30 15:16           ` BALATON Zoltan
2020-04-30 18:59             ` Alex Bennée
2020-04-30 20:17               ` BALATON Zoltan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.