qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* TCG Floating Point Support (Work in Progress)
@ 2021-09-30  5:39 Matt
  2021-09-30  7:30 ` Matt
  2021-09-30  9:13 ` Alex Bennée
  0 siblings, 2 replies; 7+ messages in thread
From: Matt @ 2021-09-30  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: peter.maydell, alex.bennee, aurelien

Hello--

I'm excited to share that I have been developing support for TCG
floating point operations; specifically, to accelerate emulation of
x86 guest code which heavily exercises the x87 FPU for a game console
emulator project based on QEMU. So far, this work has shown great
promise, demonstrating some dramatic performance improvements in
emulation of x87 heavy code.

The feature works in concert with unaccelerated x87 FPU helpers, and
also allows total soft float helper fallback if the user discovers
some issue with the hard float implementation. For the TCG target,
I've opted to implement it for x86-64 hosts using SSE2, although this
could be extended to support full 80b double extended precision with
host x87 support. I'm also in early development of an implementation
for AArch64 hosts.

There are still some significant tasks to be done, like proper
handling of exception flags, edge cases, and testing, to name a few.
Once in a slightly more mature state, I do think this feature would
make a natural addition to upstream QEMU and plan to submit it for
consideration.

I'm writing to the mailing list now to inform FPU maintainers and any
other interested parties that this work is happening, to solicit any
early feedback, and to extend an invitation to anyone interested in
collaborating to expedite its upstreaming.

My initial TCG FP work can be found here:
https://github.com/mborgerson/xemu/pull/464/commits

Thanks,
Matt


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCG Floating Point Support (Work in Progress)
  2021-09-30  5:39 TCG Floating Point Support (Work in Progress) Matt
@ 2021-09-30  7:30 ` Matt
  2021-09-30  9:13 ` Alex Bennée
  1 sibling, 0 replies; 7+ messages in thread
From: Matt @ 2021-09-30  7:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Alex Bennée, aurelien

Clarification: In my previous message, I talked a lot about x87
emulation; I want to make clear that x87 is merely my motivator. The
eventual goal of this TCG FP support is not only to enable fast x87
emulation, but to be generic and robust enough that other QEMU targets
could be modified to utilize it for accelerated floating point
emulation, and be able to be implemented for various TCG targets where
possible, with x86-64 and AArch64 being my priorities.

Thanks,
Matt

On Wed, Sep 29, 2021 at 10:39 PM Matt <mborgerson@gmail.com> wrote:
>
> Hello--
>
> I'm excited to share that I have been developing support for TCG
> floating point operations; specifically, to accelerate emulation of
> x86 guest code which heavily exercises the x87 FPU for a game console
> emulator project based on QEMU. So far, this work has shown great
> promise, demonstrating some dramatic performance improvements in
> emulation of x87 heavy code.
>
> The feature works in concert with unaccelerated x87 FPU helpers, and
> also allows total soft float helper fallback if the user discovers
> some issue with the hard float implementation. For the TCG target,
> I've opted to implement it for x86-64 hosts using SSE2, although this
> could be extended to support full 80b double extended precision with
> host x87 support. I'm also in early development of an implementation
> for AArch64 hosts.
>
> There are still some significant tasks to be done, like proper
> handling of exception flags, edge cases, and testing, to name a few.
> Once in a slightly more mature state, I do think this feature would
> make a natural addition to upstream QEMU and plan to submit it for
> consideration.
>
> I'm writing to the mailing list now to inform FPU maintainers and any
> other interested parties that this work is happening, to solicit any
> early feedback, and to extend an invitation to anyone interested in
> collaborating to expedite its upstreaming.
>
> My initial TCG FP work can be found here:
> https://github.com/mborgerson/xemu/pull/464/commits
>
> Thanks,
> Matt


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCG Floating Point Support (Work in Progress)
  2021-09-30  5:39 TCG Floating Point Support (Work in Progress) Matt
  2021-09-30  7:30 ` Matt
@ 2021-09-30  9:13 ` Alex Bennée
  2021-10-01  2:47   ` Matt
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2021-09-30  9:13 UTC (permalink / raw)
  To: Matt; +Cc: peter.maydell, Richard Henderson, qemu-devel, aurelien


Matt <mborgerson@gmail.com> writes:

> Hello--
>
> I'm excited to share that I have been developing support for TCG
> floating point operations; specifically, to accelerate emulation of
> x86 guest code which heavily exercises the x87 FPU for a game console
> emulator project based on QEMU. So far, this work has shown great
> promise, demonstrating some dramatic performance improvements in
> emulation of x87 heavy code.

I've not reviewed the code as it is a rather large diff. For your proper
submission could you please ensure that your patch series is broken up
into discreet commits to aid review. It also aids bisection if
regressions are identified.

> The feature works in concert with unaccelerated x87 FPU helpers, and
> also allows total soft float helper fallback if the user discovers
> some issue with the hard float implementation.

The phrase "if the user discovers some issues" is a bit of a red flag.
We should always be striving for correct emulation of floating point.
Indeed we have recently got the code base to the point we pass all of
the Berkey softfloat test suite. This can be checked by running:

  make check-softfloat

However the test code links directly to the softfloat code so won't work
with direct code execution. The existing 32/64 bit hardfloat
optimisations work within the helpers. While generating direct code is
appealing to avoid the cost of helper calls it's fairly well cached and
predicted code. Experience with the initial hardfloat code showed the
was still a performance win even with the cost of the helper call.

> For the TCG target,
> I've opted to implement it for x86-64 hosts using SSE2, although this
> could be extended to support full 80b double extended precision with
> host x87 support. I'm also in early development of an implementation
> for AArch64 hosts.

I don't think you'll see the same behaviour emulating an x87 on anything
that isn't an x87 because the boundaries for things like inexact
calculation will be different. Indeed if you look at the existing
hardfloat code function can_use_fpu() you will see we only call the host
processor function if the inexact bit is already set. Other wrappers
have even more checks for normal numbers. Anything that needs NaN
handling will fallback to the correct softfloat code.

I think there will be a wariness to merge anything that only works for a
single frontend/backend combination. Running translated x86 on x86 is
not the common case for TCG ;-)

> There are still some significant tasks to be done, like proper
> handling of exception flags, edge cases, and testing, to name a few.

These are the things that make correct handling of floating point hard. 

> Once in a slightly more mature state, I do think this feature would
> make a natural addition to upstream QEMU and plan to submit it for
> consideration.
>
> I'm writing to the mailing list now to inform FPU maintainers and any
> other interested parties that this work is happening, to solicit any
> early feedback, and to extend an invitation to anyone interested in
> collaborating to expedite its upstreaming.

I'll happily review patches on the list that provide for an accelerated
FPU experience as long as the correctness is maintained.

> My initial TCG FP work can be found here:
> https://github.com/mborgerson/xemu/pull/464/commits
>
> Thanks,
> Matt


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCG Floating Point Support (Work in Progress)
  2021-09-30  9:13 ` Alex Bennée
@ 2021-10-01  2:47   ` Matt
  2021-10-01  8:03     ` Alex Bennée
  0 siblings, 1 reply; 7+ messages in thread
From: Matt @ 2021-10-01  2:47 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Peter Maydell, Richard Henderson, qemu-devel, aurelien

Thank you Alex, for your quick and thoughtful response.

> I've not reviewed the code as it is a rather large diff. For your proper
> submission could you please ensure that your patch series is broken up
> into discreet commits to aid review.

Of course.

> The phrase "if the user discovers some issues" is a bit of a red flag.
> We should always be striving for correct emulation of floating point.

I agree. This is an option that I added for use during feature
development. Ultimately I would like not to have such an option, and
for it to always *just work*.

> Indeed we have recently got the code base to the point we pass all of
> the Berkey softfloat test suite. This can be checked by running:
>
>   make check-softfloat
>
> However the test code links directly to the softfloat code so won't work
> with direct code execution.

I had planned to leverage the existing soft float test suite, and I
think this can be done with the right harnessing. It would be nice to
have a mechanism to test translation of individual TCG ops, e.g. be
able to run translated blocks from test code and evaluate their
output. I'm not sure if any such op level testing like that is being
done. There are also guest tests in tests/tcg, which could also be
expanded to include more FP tests.

> The existing 32/64 bit hardfloat
> optimisations work within the helpers. While generating direct code is
> appealing to avoid the cost of helper calls it's fairly well cached and
> predicted code. Experience with the initial hardfloat code showed the
> was still a performance win even with the cost of the helper call.

Unfortunately, even with the existing hardfloat support, the overhead
of the helper calls is still too costly for my particular application.

> I don't think you'll see the same behaviour emulating an x87 on anything
> that isn't an x87 because the boundaries for things like inexact
> calculation will be different. Indeed if you look at the existing
> hardfloat code function can_use_fpu() you will see we only call the host
> processor function if the inexact bit is already set. Other wrappers
> have even more checks for normal numbers. Anything that needs NaN
> handling will fallback to the correct softfloat code.

Fair points. Bit-perfect x87 emulation with this approach may be
ultimately unachievable; and I'm still evaluating the cases when this
will not work. This has been a learning experience for me, and I
gladly welcome expert input in this matter.

Personally, I would accept minor accuracy differences in trade for
significant performance advantage in emulation of game code, but not
for scientific applications, which I understand may diminish upstream
appeal of this x87 translation work.

> I think there will be a wariness to merge anything that only works for a
> single frontend/backend combination. Running translated x86 on x86 is
> not the common case for TCG ;-)

Understood; initially this works on a single frontend/backend
combination, with eventual translation to other guests and hosts. It
will be a long road, but my plan next is to produce a translation for
AArch64 systems.

> These are the things that make correct handling of floating point hard.

Agreed!

> I'll happily review patches on the list that provide for an accelerated
> FPU experience as long as the correctness is maintained.

Thank you!

Matt

On Thu, Sep 30, 2021 at 2:38 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Matt <mborgerson@gmail.com> writes:
>
> > Hello--
> >
> > I'm excited to share that I have been developing support for TCG
> > floating point operations; specifically, to accelerate emulation of
> > x86 guest code which heavily exercises the x87 FPU for a game console
> > emulator project based on QEMU. So far, this work has shown great
> > promise, demonstrating some dramatic performance improvements in
> > emulation of x87 heavy code.
>
> I've not reviewed the code as it is a rather large diff. For your proper
> submission could you please ensure that your patch series is broken up
> into discreet commits to aid review. It also aids bisection if
> regressions are identified.
>
> > The feature works in concert with unaccelerated x87 FPU helpers, and
> > also allows total soft float helper fallback if the user discovers
> > some issue with the hard float implementation.
>
> The phrase "if the user discovers some issues" is a bit of a red flag.
> We should always be striving for correct emulation of floating point.
> Indeed we have recently got the code base to the point we pass all of
> the Berkey softfloat test suite. This can be checked by running:
>
>   make check-softfloat
>
> However the test code links directly to the softfloat code so won't work
> with direct code execution. The existing 32/64 bit hardfloat
> optimisations work within the helpers. While generating direct code is
> appealing to avoid the cost of helper calls it's fairly well cached and
> predicted code. Experience with the initial hardfloat code showed the
> was still a performance win even with the cost of the helper call.
>
> > For the TCG target,
> > I've opted to implement it for x86-64 hosts using SSE2, although this
> > could be extended to support full 80b double extended precision with
> > host x87 support. I'm also in early development of an implementation
> > for AArch64 hosts.
>
> I don't think you'll see the same behaviour emulating an x87 on anything
> that isn't an x87 because the boundaries for things like inexact
> calculation will be different. Indeed if you look at the existing
> hardfloat code function can_use_fpu() you will see we only call the host
> processor function if the inexact bit is already set. Other wrappers
> have even more checks for normal numbers. Anything that needs NaN
> handling will fallback to the correct softfloat code.
>
> I think there will be a wariness to merge anything that only works for a
> single frontend/backend combination. Running translated x86 on x86 is
> not the common case for TCG ;-)
>
> > There are still some significant tasks to be done, like proper
> > handling of exception flags, edge cases, and testing, to name a few.
>
> These are the things that make correct handling of floating point hard.
>
> > Once in a slightly more mature state, I do think this feature would
> > make a natural addition to upstream QEMU and plan to submit it for
> > consideration.
> >
> > I'm writing to the mailing list now to inform FPU maintainers and any
> > other interested parties that this work is happening, to solicit any
> > early feedback, and to extend an invitation to anyone interested in
> > collaborating to expedite its upstreaming.
>
> I'll happily review patches on the list that provide for an accelerated
> FPU experience as long as the correctness is maintained.
>
> > My initial TCG FP work can be found here:
> > https://github.com/mborgerson/xemu/pull/464/commits
> >
> > Thanks,
> > Matt
>
>
> --
> Alex Bennée


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCG Floating Point Support (Work in Progress)
  2021-10-01  2:47   ` Matt
@ 2021-10-01  8:03     ` Alex Bennée
  2021-10-02  2:07       ` Matt
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2021-10-01  8:03 UTC (permalink / raw)
  To: Matt; +Cc: Peter Maydell, Richard Henderson, qemu-devel, aurelien


Matt <mborgerson@gmail.com> writes:

> Thank you Alex, for your quick and thoughtful response.
>
>> I've not reviewed the code as it is a rather large diff. For your proper
>> submission could you please ensure that your patch series is broken up
>> into discreet commits to aid review.
>
> Of course.
>
>> The phrase "if the user discovers some issues" is a bit of a red flag.
>> We should always be striving for correct emulation of floating point.
>
> I agree. This is an option that I added for use during feature
> development. Ultimately I would like not to have such an option, and
> for it to always *just work*.

The closest I can think of is the --accel thread=single|multi option
which allowed for verifying if an issue was related to MTTCG. However
the default would always do the right thing.

>
>> Indeed we have recently got the code base to the point we pass all of
>> the Berkey softfloat test suite. This can be checked by running:
>>
>>   make check-softfloat
>>
>> However the test code links directly to the softfloat code so won't work
>> with direct code execution.
>
> I had planned to leverage the existing soft float test suite, and I
> think this can be done with the right harnessing. It would be nice to
> have a mechanism to test translation of individual TCG ops, e.g. be
> able to run translated blocks from test code and evaluate their
> output. I'm not sure if any such op level testing like that is being
> done.

Not at the moment but it would certainly be a useful addition for the
unit tests if we could test arbitrary sequences of TCG ops. I'm not sure
how much test harness would be needed to exercise that though.

> There are also guest tests in tests/tcg, which could also be
> expanded to include more FP tests.

We have a number of multiarch tcg tests for fused multiply-add and the
various fconv operations. There is also quite an exhaustive set of i386
specific tests (test-i386-fprem) but it doesn't get run by default as
the "reference" output is too big to include in the tree and has to be
generated in-situ. You get it by adding SPEED=slow to your make
invocation.

>> The existing 32/64 bit hardfloat
>> optimisations work within the helpers. While generating direct code is
>> appealing to avoid the cost of helper calls it's fairly well cached and
>> predicted code. Experience with the initial hardfloat code showed the
>> was still a performance win even with the cost of the helper call.
>
> Unfortunately, even with the existing hardfloat support, the overhead
> of the helper calls is still too costly for my particular application.

Once you start dealing with flag generation you might find that equation
changes somewhat if you have to mess around with bit masking and checks
using TCG ops. However providing benchmark results with your patch would
be required to argue the point. You can run tests/fp/fp-bench -t host
under translation to exercise that.

>
>> I don't think you'll see the same behaviour emulating an x87 on anything
>> that isn't an x87 because the boundaries for things like inexact
>> calculation will be different. Indeed if you look at the existing
>> hardfloat code function can_use_fpu() you will see we only call the host
>> processor function if the inexact bit is already set. Other wrappers
>> have even more checks for normal numbers. Anything that needs NaN
>> handling will fallback to the correct softfloat code.
>
> Fair points. Bit-perfect x87 emulation with this approach may be
> ultimately unachievable; and I'm still evaluating the cases when this
> will not work. This has been a learning experience for me, and I
> gladly welcome expert input in this matter.
>
> Personally, I would accept minor accuracy differences in trade for
> significant performance advantage in emulation of game code, but not
> for scientific applications, which I understand may diminish upstream
> appeal of this x87 translation work.

Out of interest what game code still uses x87? I know the classic Doom
and Quake benchmarks showed a performance regression when we switched to
softfloat:

  https://diasp.eu/posts/ec86de10240e01376f734061862b8e7b

however I kinda assumed more modern games would be taking advantaged of
SSE and later features. There is however some missing gaps in the x86
emulation that might mean code falls back to the x87. Maybe that would
be another area to look at.

>> I think there will be a wariness to merge anything that only works for a
>> single frontend/backend combination. Running translated x86 on x86 is
>> not the common case for TCG ;-)
>
> Understood; initially this works on a single frontend/backend
> combination, with eventual translation to other guests and hosts. It
> will be a long road, but my plan next is to produce a translation for
> AArch64 systems.
>
>> These are the things that make correct handling of floating point hard.
>
> Agreed!
>
>> I'll happily review patches on the list that provide for an accelerated
>> FPU experience as long as the correctness is maintained.
>
> Thank you!
>
> Matt
>
> On Thu, Sep 30, 2021 at 2:38 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Matt <mborgerson@gmail.com> writes:
>>
>> > Hello--
>> >
>> > I'm excited to share that I have been developing support for TCG
>> > floating point operations; specifically, to accelerate emulation of
>> > x86 guest code which heavily exercises the x87 FPU for a game console
>> > emulator project based on QEMU. So far, this work has shown great
>> > promise, demonstrating some dramatic performance improvements in
>> > emulation of x87 heavy code.
>>
>> I've not reviewed the code as it is a rather large diff. For your proper
>> submission could you please ensure that your patch series is broken up
>> into discreet commits to aid review. It also aids bisection if
>> regressions are identified.
>>
>> > The feature works in concert with unaccelerated x87 FPU helpers, and
>> > also allows total soft float helper fallback if the user discovers
>> > some issue with the hard float implementation.
>>
>> The phrase "if the user discovers some issues" is a bit of a red flag.
>> We should always be striving for correct emulation of floating point.
>> Indeed we have recently got the code base to the point we pass all of
>> the Berkey softfloat test suite. This can be checked by running:
>>
>>   make check-softfloat
>>
>> However the test code links directly to the softfloat code so won't work
>> with direct code execution. The existing 32/64 bit hardfloat
>> optimisations work within the helpers. While generating direct code is
>> appealing to avoid the cost of helper calls it's fairly well cached and
>> predicted code. Experience with the initial hardfloat code showed the
>> was still a performance win even with the cost of the helper call.
>>
>> > For the TCG target,
>> > I've opted to implement it for x86-64 hosts using SSE2, although this
>> > could be extended to support full 80b double extended precision with
>> > host x87 support. I'm also in early development of an implementation
>> > for AArch64 hosts.
>>
>> I don't think you'll see the same behaviour emulating an x87 on anything
>> that isn't an x87 because the boundaries for things like inexact
>> calculation will be different. Indeed if you look at the existing
>> hardfloat code function can_use_fpu() you will see we only call the host
>> processor function if the inexact bit is already set. Other wrappers
>> have even more checks for normal numbers. Anything that needs NaN
>> handling will fallback to the correct softfloat code.
>>
>> I think there will be a wariness to merge anything that only works for a
>> single frontend/backend combination. Running translated x86 on x86 is
>> not the common case for TCG ;-)
>>
>> > There are still some significant tasks to be done, like proper
>> > handling of exception flags, edge cases, and testing, to name a few.
>>
>> These are the things that make correct handling of floating point hard.
>>
>> > Once in a slightly more mature state, I do think this feature would
>> > make a natural addition to upstream QEMU and plan to submit it for
>> > consideration.
>> >
>> > I'm writing to the mailing list now to inform FPU maintainers and any
>> > other interested parties that this work is happening, to solicit any
>> > early feedback, and to extend an invitation to anyone interested in
>> > collaborating to expedite its upstreaming.
>>
>> I'll happily review patches on the list that provide for an accelerated
>> FPU experience as long as the correctness is maintained.
>>
>> > My initial TCG FP work can be found here:
>> > https://github.com/mborgerson/xemu/pull/464/commits
>> >
>> > Thanks,
>> > Matt
>>
>>
>> --
>> Alex Bennée


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCG Floating Point Support (Work in Progress)
  2021-10-01  8:03     ` Alex Bennée
@ 2021-10-02  2:07       ` Matt
  2022-03-09  3:48         ` gaosong
  0 siblings, 1 reply; 7+ messages in thread
From: Matt @ 2021-10-02  2:07 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Peter Maydell, Richard Henderson, qemu-devel, aurelien

> Not at the moment but it would certainly be a useful addition for the
> unit tests if we could test arbitrary sequences of TCG ops. I'm not sure
> how much test harness would be needed to exercise that though.

On a related note, in addition to testing TCG->Host translation, it
would be nice to also have a way to make sure TCG->TCG optimization
passes are working as expected. Is there existing work in this area?


> We have a number of multiarch tcg tests for fused multiply-add and the
> various fconv operations. There is also quite an exhaustive set of i386
> specific tests (test-i386-fprem) but it doesn't get run by default as
> the "reference" output is too big to include in the tree and has to be
> generated in-situ. You get it by adding SPEED=slow to your make
> invocation. [...]
> You can run tests/fp/fp-bench -t host under translation to exercise that.

Thanks for the info! This will be useful.


> I know the classic Doom and Quake benchmarks showed a performance
> regression when we switched to softfloat:
>
>   https://diasp.eu/posts/ec86de10240e01376f734061862b8e7b

That post was an interesting read, thanks for sharing!


> Out of interest what game code still uses x87? [...]
> however I kinda assumed more modern games would be taking advantaged of
> SSE and later features. There is however some missing gaps in the x86
> emulation that might mean code falls back to the x87. Maybe that would
> be another area to look at.

This project is an emulator of the original Xbox game console, which
is now...twenty years old (time flies). The Xbox CPU (P3) does feature
SSE (not SSE2+), however most of the games I've tested for this
generation still make heavy use of x87.

I have seen at least one game make noticeable use of MMX/SSE features
though, which I also need to look at accelerating. Profiler indicates
they are also very costly. I have seen the TCG vector ops, which are a
very cool addition.

Matt


On Fri, Oct 1, 2021 at 1:24 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Matt <mborgerson@gmail.com> writes:
>
> > Thank you Alex, for your quick and thoughtful response.
> >
> >> I've not reviewed the code as it is a rather large diff. For your proper
> >> submission could you please ensure that your patch series is broken up
> >> into discreet commits to aid review.
> >
> > Of course.
> >
> >> The phrase "if the user discovers some issues" is a bit of a red flag.
> >> We should always be striving for correct emulation of floating point.
> >
> > I agree. This is an option that I added for use during feature
> > development. Ultimately I would like not to have such an option, and
> > for it to always *just work*.
>
> The closest I can think of is the --accel thread=single|multi option
> which allowed for verifying if an issue was related to MTTCG. However
> the default would always do the right thing.
>
> >
> >> Indeed we have recently got the code base to the point we pass all of
> >> the Berkey softfloat test suite. This can be checked by running:
> >>
> >>   make check-softfloat
> >>
> >> However the test code links directly to the softfloat code so won't work
> >> with direct code execution.
> >
> > I had planned to leverage the existing soft float test suite, and I
> > think this can be done with the right harnessing. It would be nice to
> > have a mechanism to test translation of individual TCG ops, e.g. be
> > able to run translated blocks from test code and evaluate their
> > output. I'm not sure if any such op level testing like that is being
> > done.
>
> Not at the moment but it would certainly be a useful addition for the
> unit tests if we could test arbitrary sequences of TCG ops. I'm not sure
> how much test harness would be needed to exercise that though.
>
> > There are also guest tests in tests/tcg, which could also be
> > expanded to include more FP tests.
>
> We have a number of multiarch tcg tests for fused multiply-add and the
> various fconv operations. There is also quite an exhaustive set of i386
> specific tests (test-i386-fprem) but it doesn't get run by default as
> the "reference" output is too big to include in the tree and has to be
> generated in-situ. You get it by adding SPEED=slow to your make
> invocation.
>
> >> The existing 32/64 bit hardfloat
> >> optimisations work within the helpers. While generating direct code is
> >> appealing to avoid the cost of helper calls it's fairly well cached and
> >> predicted code. Experience with the initial hardfloat code showed the
> >> was still a performance win even with the cost of the helper call.
> >
> > Unfortunately, even with the existing hardfloat support, the overhead
> > of the helper calls is still too costly for my particular application.
>
> Once you start dealing with flag generation you might find that equation
> changes somewhat if you have to mess around with bit masking and checks
> using TCG ops. However providing benchmark results with your patch would
> be required to argue the point. You can run tests/fp/fp-bench -t host
> under translation to exercise that.
>
> >
> >> I don't think you'll see the same behaviour emulating an x87 on anything
> >> that isn't an x87 because the boundaries for things like inexact
> >> calculation will be different. Indeed if you look at the existing
> >> hardfloat code function can_use_fpu() you will see we only call the host
> >> processor function if the inexact bit is already set. Other wrappers
> >> have even more checks for normal numbers. Anything that needs NaN
> >> handling will fallback to the correct softfloat code.
> >
> > Fair points. Bit-perfect x87 emulation with this approach may be
> > ultimately unachievable; and I'm still evaluating the cases when this
> > will not work. This has been a learning experience for me, and I
> > gladly welcome expert input in this matter.
> >
> > Personally, I would accept minor accuracy differences in trade for
> > significant performance advantage in emulation of game code, but not
> > for scientific applications, which I understand may diminish upstream
> > appeal of this x87 translation work.
>
> Out of interest what game code still uses x87? I know the classic Doom
> and Quake benchmarks showed a performance regression when we switched to
> softfloat:
>
>   https://diasp.eu/posts/ec86de10240e01376f734061862b8e7b
>
> however I kinda assumed more modern games would be taking advantaged of
> SSE and later features. There is however some missing gaps in the x86
> emulation that might mean code falls back to the x87. Maybe that would
> be another area to look at.
>
> >> I think there will be a wariness to merge anything that only works for a
> >> single frontend/backend combination. Running translated x86 on x86 is
> >> not the common case for TCG ;-)
> >
> > Understood; initially this works on a single frontend/backend
> > combination, with eventual translation to other guests and hosts. It
> > will be a long road, but my plan next is to produce a translation for
> > AArch64 systems.
> >
> >> These are the things that make correct handling of floating point hard.
> >
> > Agreed!
> >
> >> I'll happily review patches on the list that provide for an accelerated
> >> FPU experience as long as the correctness is maintained.
> >
> > Thank you!
> >
> > Matt
> >
> > On Thu, Sep 30, 2021 at 2:38 AM Alex Bennée <alex.bennee@linaro.org> wrote:
> >>
> >>
> >> Matt <mborgerson@gmail.com> writes:
> >>
> >> > Hello--
> >> >
> >> > I'm excited to share that I have been developing support for TCG
> >> > floating point operations; specifically, to accelerate emulation of
> >> > x86 guest code which heavily exercises the x87 FPU for a game console
> >> > emulator project based on QEMU. So far, this work has shown great
> >> > promise, demonstrating some dramatic performance improvements in
> >> > emulation of x87 heavy code.
> >>
> >> I've not reviewed the code as it is a rather large diff. For your proper
> >> submission could you please ensure that your patch series is broken up
> >> into discreet commits to aid review. It also aids bisection if
> >> regressions are identified.
> >>
> >> > The feature works in concert with unaccelerated x87 FPU helpers, and
> >> > also allows total soft float helper fallback if the user discovers
> >> > some issue with the hard float implementation.
> >>
> >> The phrase "if the user discovers some issues" is a bit of a red flag.
> >> We should always be striving for correct emulation of floating point.
> >> Indeed we have recently got the code base to the point we pass all of
> >> the Berkey softfloat test suite. This can be checked by running:
> >>
> >>   make check-softfloat
> >>
> >> However the test code links directly to the softfloat code so won't work
> >> with direct code execution. The existing 32/64 bit hardfloat
> >> optimisations work within the helpers. While generating direct code is
> >> appealing to avoid the cost of helper calls it's fairly well cached and
> >> predicted code. Experience with the initial hardfloat code showed the
> >> was still a performance win even with the cost of the helper call.
> >>
> >> > For the TCG target,
> >> > I've opted to implement it for x86-64 hosts using SSE2, although this
> >> > could be extended to support full 80b double extended precision with
> >> > host x87 support. I'm also in early development of an implementation
> >> > for AArch64 hosts.
> >>
> >> I don't think you'll see the same behaviour emulating an x87 on anything
> >> that isn't an x87 because the boundaries for things like inexact
> >> calculation will be different. Indeed if you look at the existing
> >> hardfloat code function can_use_fpu() you will see we only call the host
> >> processor function if the inexact bit is already set. Other wrappers
> >> have even more checks for normal numbers. Anything that needs NaN
> >> handling will fallback to the correct softfloat code.
> >>
> >> I think there will be a wariness to merge anything that only works for a
> >> single frontend/backend combination. Running translated x86 on x86 is
> >> not the common case for TCG ;-)
> >>
> >> > There are still some significant tasks to be done, like proper
> >> > handling of exception flags, edge cases, and testing, to name a few.
> >>
> >> These are the things that make correct handling of floating point hard.
> >>
> >> > Once in a slightly more mature state, I do think this feature would
> >> > make a natural addition to upstream QEMU and plan to submit it for
> >> > consideration.
> >> >
> >> > I'm writing to the mailing list now to inform FPU maintainers and any
> >> > other interested parties that this work is happening, to solicit any
> >> > early feedback, and to extend an invitation to anyone interested in
> >> > collaborating to expedite its upstreaming.
> >>
> >> I'll happily review patches on the list that provide for an accelerated
> >> FPU experience as long as the correctness is maintained.
> >>
> >> > My initial TCG FP work can be found here:
> >> > https://github.com/mborgerson/xemu/pull/464/commits
> >> >
> >> > Thanks,
> >> > Matt
> >>
> >>
> >> --
> >> Alex Bennée
>
>
> --
> Alex Bennée


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: TCG Floating Point Support (Work in Progress)
  2021-10-02  2:07       ` Matt
@ 2022-03-09  3:48         ` gaosong
  0 siblings, 0 replies; 7+ messages in thread
From: gaosong @ 2022-03-09  3:48 UTC (permalink / raw)
  To: Matt, Alex Bennée
  Cc: Peter Maydell, maobibo, Richard Henderson, qemu-devel, aurelien

[-- Attachment #1: Type: text/plain, Size: 11346 bytes --]

  On 2021/10/2 上午10:07, Matt wrote:

>> Not at the moment but it would certainly be a useful addition for the
>> unit tests if we could test arbitrary sequences of TCG ops. I'm not sure
>> how much test harness would be needed to exercise that though.
> On a related note, in addition to testing TCG->Host translation, it
> would be nice to also have a way to make sure TCG->TCG optimization
> passes are working as expected. Is there existing work in this area?
>
>
>> We have a number of multiarch tcg tests for fused multiply-add and the
>> various fconv operations. There is also quite an exhaustive set of i386
>> specific tests (test-i386-fprem) but it doesn't get run by default as
>> the "reference" output is too big to include in the tree and has to be
>> generated in-situ. You get it by adding SPEED=slow to your make
>> invocation. [...]
>> You can run tests/fp/fp-bench -t host under translation to exercise that.
> Thanks for the info! This will be useful.
>
>
>> I know the classic Doom and Quake benchmarks showed a performance
>> regression when we switched to softfloat:
>>
>>    https://diasp.eu/posts/ec86de10240e01376f734061862b8e7b
> That post was an interesting read, thanks for sharing!
>
>
>> Out of interest what game code still uses x87? [...]
>> however I kinda assumed more modern games would be taking advantaged of
>> SSE and later features. There is however some missing gaps in the x86
>> emulation that might mean code falls back to the x87. Maybe that would
>> be another area to look at.
> This project is an emulator of the original Xbox game console, which
> is now...twenty years old (time flies). The Xbox CPU (P3) does feature
> SSE (not SSE2+), however most of the games I've tested for this
> generation still make heavy use of x87.
>
> I have seen at least one game make noticeable use of MMX/SSE features
> though, which I also need to look at accelerating. Profiler indicates
> they are also very costly. I have seen the TCG vector ops, which are a
> very cool addition.
>
> Matt
>
>
> On Fri, Oct 1, 2021 at 1:24 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> Matt <mborgerson@gmail.com> writes:
>>
>>> Thank you Alex, for your quick and thoughtful response.
>>>
>>>> I've not reviewed the code as it is a rather large diff. For your proper
>>>> submission could you please ensure that your patch series is broken up
>>>> into discreet commits to aid review.
>>> Of course.
>>>
>>>> The phrase "if the user discovers some issues" is a bit of a red flag.
>>>> We should always be striving for correct emulation of floating point.
>>> I agree. This is an option that I added for use during feature
>>> development. Ultimately I would like not to have such an option, and
>>> for it to always *just work*.
>> The closest I can think of is the --accel thread=single|multi option
>> which allowed for verifying if an issue was related to MTTCG. However
>> the default would always do the right thing.
>>
>>>> Indeed we have recently got the code base to the point we pass all of
>>>> the Berkey softfloat test suite. This can be checked by running:
>>>>
>>>>    make check-softfloat
>>>>
>>>> However the test code links directly to the softfloat code so won't work
>>>> with direct code execution.
>>> I had planned to leverage the existing soft float test suite, and I
>>> think this can be done with the right harnessing. It would be nice to
>>> have a mechanism to test translation of individual TCG ops, e.g. be
>>> able to run translated blocks from test code and evaluate their
>>> output. I'm not sure if any such op level testing like that is being
>>> done.
>> Not at the moment but it would certainly be a useful addition for the
>> unit tests if we could test arbitrary sequences of TCG ops. I'm not sure
>> how much test harness would be needed to exercise that though.
>>
>>> There are also guest tests in tests/tcg, which could also be
>>> expanded to include more FP tests.
>> We have a number of multiarch tcg tests for fused multiply-add and the
>> various fconv operations. There is also quite an exhaustive set of i386
>> specific tests (test-i386-fprem) but it doesn't get run by default as
>> the "reference" output is too big to include in the tree and has to be
>> generated in-situ. You get it by adding SPEED=slow to your make
>> invocation.
>>
>>>> The existing 32/64 bit hardfloat
>>>> optimisations work within the helpers. While generating direct code is
>>>> appealing to avoid the cost of helper calls it's fairly well cached and
>>>> predicted code. Experience with the initial hardfloat code showed the
>>>> was still a performance win even with the cost of the helper call.
>>> Unfortunately, even with the existing hardfloat support, the overhead
>>> of the helper calls is still too costly for my particular application.
>> Once you start dealing with flag generation you might find that equation
>> changes somewhat if you have to mess around with bit masking and checks
>> using TCG ops. However providing benchmark results with your patch would
>> be required to argue the point. You can run tests/fp/fp-bench -t host
>> under translation to exercise that.
>>
>>>> I don't think you'll see the same behaviour emulating an x87 on anything
>>>> that isn't an x87 because the boundaries for things like inexact
>>>> calculation will be different. Indeed if you look at the existing
>>>> hardfloat code function can_use_fpu() you will see we only call the host
>>>> processor function if the inexact bit is already set. Other wrappers
>>>> have even more checks for normal numbers. Anything that needs NaN
>>>> handling will fallback to the correct softfloat code.
>>> Fair points. Bit-perfect x87 emulation with this approach may be
>>> ultimately unachievable; and I'm still evaluating the cases when this
>>> will not work. This has been a learning experience for me, and I
>>> gladly welcome expert input in this matter.
>>>
>>> Personally, I would accept minor accuracy differences in trade for
>>> significant performance advantage in emulation of game code, but not
>>> for scientific applications, which I understand may diminish upstream
>>> appeal of this x87 translation work.
>> Out of interest what game code still uses x87? I know the classic Doom
>> and Quake benchmarks showed a performance regression when we switched to
>> softfloat:
>>
>>    https://diasp.eu/posts/ec86de10240e01376f734061862b8e7b
>>
>> however I kinda assumed more modern games would be taking advantaged of
>> SSE and later features. There is however some missing gaps in the x86
>> emulation that might mean code falls back to the x87. Maybe that would
>> be another area to look at.
>>
>>>> I think there will be a wariness to merge anything that only works for a
>>>> single frontend/backend combination. Running translated x86 on x86 is
>>>> not the common case for TCG ;-)
>>> Understood; initially this works on a single frontend/backend
>>> combination, with eventual translation to other guests and hosts. It
>>> will be a long road, but my plan next is to produce a translation for
>>> AArch64 systems.

Hi, Matt

We have read you shared patch, We are interested in the work you are working.
Have you already supported AArch64 systems?

>>>> These are the things that make correct handling of floating point hard.
>>> Agreed!

Hi, Alex

Does TCG plan to support Hardware  Floating Point?

Thanks
Song

>>>> I'll happily review patches on the list that provide for an accelerated
>>>> FPU experience as long as the correctness is maintained.
>>> Thank you!
>>>
>>> Matt
>>>
>>> On Thu, Sep 30, 2021 at 2:38 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>>>>
>>>> Matt <mborgerson@gmail.com> writes:
>>>>
>>>>> Hello--
>>>>>
>>>>> I'm excited to share that I have been developing support for TCG
>>>>> floating point operations; specifically, to accelerate emulation of
>>>>> x86 guest code which heavily exercises the x87 FPU for a game console
>>>>> emulator project based on QEMU. So far, this work has shown great
>>>>> promise, demonstrating some dramatic performance improvements in
>>>>> emulation of x87 heavy code.
>>>> I've not reviewed the code as it is a rather large diff. For your proper
>>>> submission could you please ensure that your patch series is broken up
>>>> into discreet commits to aid review. It also aids bisection if
>>>> regressions are identified.
>>>>
>>>>> The feature works in concert with unaccelerated x87 FPU helpers, and
>>>>> also allows total soft float helper fallback if the user discovers
>>>>> some issue with the hard float implementation.
>>>> The phrase "if the user discovers some issues" is a bit of a red flag.
>>>> We should always be striving for correct emulation of floating point.
>>>> Indeed we have recently got the code base to the point we pass all of
>>>> the Berkey softfloat test suite. This can be checked by running:
>>>>
>>>>    make check-softfloat
>>>>
>>>> However the test code links directly to the softfloat code so won't work
>>>> with direct code execution. The existing 32/64 bit hardfloat
>>>> optimisations work within the helpers. While generating direct code is
>>>> appealing to avoid the cost of helper calls it's fairly well cached and
>>>> predicted code. Experience with the initial hardfloat code showed the
>>>> was still a performance win even with the cost of the helper call.
>>>>
>>>>> For the TCG target,
>>>>> I've opted to implement it for x86-64 hosts using SSE2, although this
>>>>> could be extended to support full 80b double extended precision with
>>>>> host x87 support. I'm also in early development of an implementation
>>>>> for AArch64 hosts.
>>>> I don't think you'll see the same behaviour emulating an x87 on anything
>>>> that isn't an x87 because the boundaries for things like inexact
>>>> calculation will be different. Indeed if you look at the existing
>>>> hardfloat code function can_use_fpu() you will see we only call the host
>>>> processor function if the inexact bit is already set. Other wrappers
>>>> have even more checks for normal numbers. Anything that needs NaN
>>>> handling will fallback to the correct softfloat code.
>>>>
>>>> I think there will be a wariness to merge anything that only works for a
>>>> single frontend/backend combination. Running translated x86 on x86 is
>>>> not the common case for TCG ;-)
>>>>
>>>>> There are still some significant tasks to be done, like proper
>>>>> handling of exception flags, edge cases, and testing, to name a few.
>>>> These are the things that make correct handling of floating point hard.
>>>>
>>>>> Once in a slightly more mature state, I do think this feature would
>>>>> make a natural addition to upstream QEMU and plan to submit it for
>>>>> consideration.
>>>>>
>>>>> I'm writing to the mailing list now to inform FPU maintainers and any
>>>>> other interested parties that this work is happening, to solicit any
>>>>> early feedback, and to extend an invitation to anyone interested in
>>>>> collaborating to expedite its upstreaming.
>>>> I'll happily review patches on the list that provide for an accelerated
>>>> FPU experience as long as the correctness is maintained.
>>>>
>>>>> My initial TCG FP work can be found here:
>>>>> https://github.com/mborgerson/xemu/pull/464/commits
>>>>>
>>>>> Thanks,
>>>>> Matt
>>>>
>>>> --
>>>> Alex Bennée
>>
>> --
>> Alex Bennée

[-- Attachment #2: Type: text/html, Size: 16520 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-03-09  3:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-30  5:39 TCG Floating Point Support (Work in Progress) Matt
2021-09-30  7:30 ` Matt
2021-09-30  9:13 ` Alex Bennée
2021-10-01  2:47   ` Matt
2021-10-01  8:03     ` Alex Bennée
2021-10-02  2:07       ` Matt
2022-03-09  3:48         ` gaosong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).