On 2021/10/2 上午10:07, Matt wrote:

>> Not at the moment but it would certainly be a useful addition for the
>> unit tests if we could test arbitrary sequences of TCG ops. I'm not sure
>> how much test harness would be needed to exercise that though.
> On a related note, in addition to testing TCG->Host translation, it
> would be nice to also have a way to make sure TCG->TCG optimization
> passes are working as expected. Is there existing work in this area?
>
>
>> We have a number of multiarch tcg tests for fused multiply-add and the
>> various fconv operations. There is also quite an exhaustive set of i386
>> specific tests (test-i386-fprem) but it doesn't get run by default as
>> the "reference" output is too big to include in the tree and has to be
>> generated in-situ. You get it by adding SPEED=slow to your make
>> invocation. [...]
>> You can run tests/fp/fp-bench -t host under translation to exercise that.
> Thanks for the info! This will be useful.
>
>
>> I know the classic Doom and Quake benchmarks showed a performance
>> regression when we switched to softfloat:
>>
>>    https://diasp.eu/posts/ec86de10240e01376f734061862b8e7b
> That post was an interesting read, thanks for sharing!
>
>
>> Out of interest what game code still uses x87? [...]
>> however I kinda assumed more modern games would be taking advantaged of
>> SSE and later features. There is however some missing gaps in the x86
>> emulation that might mean code falls back to the x87. Maybe that would
>> be another area to look at.
> This project is an emulator of the original Xbox game console, which
> is now...twenty years old (time flies). The Xbox CPU (P3) does feature
> SSE (not SSE2+), however most of the games I've tested for this
> generation still make heavy use of x87.
>
> I have seen at least one game make noticeable use of MMX/SSE features
> though, which I also need to look at accelerating. Profiler indicates
> they are also very costly. I have seen the TCG vector ops, which are a
> very cool addition.
>
> Matt
>
>
> On Fri, Oct 1, 2021 at 1:24 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> Matt <mborgerson@gmail.com> writes:
>>
>>> Thank you Alex, for your quick and thoughtful response.
>>>
>>>> I've not reviewed the code as it is a rather large diff. For your proper
>>>> submission could you please ensure that your patch series is broken up
>>>> into discreet commits to aid review.
>>> Of course.
>>>
>>>> The phrase "if the user discovers some issues" is a bit of a red flag.
>>>> We should always be striving for correct emulation of floating point.
>>> I agree. This is an option that I added for use during feature
>>> development. Ultimately I would like not to have such an option, and
>>> for it to always *just work*.
>> The closest I can think of is the --accel thread=single|multi option
>> which allowed for verifying if an issue was related to MTTCG. However
>> the default would always do the right thing.
>>
>>>> Indeed we have recently got the code base to the point we pass all of
>>>> the Berkey softfloat test suite. This can be checked by running:
>>>>
>>>>    make check-softfloat
>>>>
>>>> However the test code links directly to the softfloat code so won't work
>>>> with direct code execution.
>>> I had planned to leverage the existing soft float test suite, and I
>>> think this can be done with the right harnessing. It would be nice to
>>> have a mechanism to test translation of individual TCG ops, e.g. be
>>> able to run translated blocks from test code and evaluate their
>>> output. I'm not sure if any such op level testing like that is being
>>> done.
>> Not at the moment but it would certainly be a useful addition for the
>> unit tests if we could test arbitrary sequences of TCG ops. I'm not sure
>> how much test harness would be needed to exercise that though.
>>
>>> There are also guest tests in tests/tcg, which could also be
>>> expanded to include more FP tests.
>> We have a number of multiarch tcg tests for fused multiply-add and the
>> various fconv operations. There is also quite an exhaustive set of i386
>> specific tests (test-i386-fprem) but it doesn't get run by default as
>> the "reference" output is too big to include in the tree and has to be
>> generated in-situ. You get it by adding SPEED=slow to your make
>> invocation.
>>
>>>> The existing 32/64 bit hardfloat
>>>> optimisations work within the helpers. While generating direct code is
>>>> appealing to avoid the cost of helper calls it's fairly well cached and
>>>> predicted code. Experience with the initial hardfloat code showed the
>>>> was still a performance win even with the cost of the helper call.
>>> Unfortunately, even with the existing hardfloat support, the overhead
>>> of the helper calls is still too costly for my particular application.
>> Once you start dealing with flag generation you might find that equation
>> changes somewhat if you have to mess around with bit masking and checks
>> using TCG ops. However providing benchmark results with your patch would
>> be required to argue the point. You can run tests/fp/fp-bench -t host
>> under translation to exercise that.
>>
>>>> I don't think you'll see the same behaviour emulating an x87 on anything
>>>> that isn't an x87 because the boundaries for things like inexact
>>>> calculation will be different. Indeed if you look at the existing
>>>> hardfloat code function can_use_fpu() you will see we only call the host
>>>> processor function if the inexact bit is already set. Other wrappers
>>>> have even more checks for normal numbers. Anything that needs NaN
>>>> handling will fallback to the correct softfloat code.
>>> Fair points. Bit-perfect x87 emulation with this approach may be
>>> ultimately unachievable; and I'm still evaluating the cases when this
>>> will not work. This has been a learning experience for me, and I
>>> gladly welcome expert input in this matter.
>>>
>>> Personally, I would accept minor accuracy differences in trade for
>>> significant performance advantage in emulation of game code, but not
>>> for scientific applications, which I understand may diminish upstream
>>> appeal of this x87 translation work.
>> Out of interest what game code still uses x87? I know the classic Doom
>> and Quake benchmarks showed a performance regression when we switched to
>> softfloat:
>>
>>    https://diasp.eu/posts/ec86de10240e01376f734061862b8e7b
>>
>> however I kinda assumed more modern games would be taking advantaged of
>> SSE and later features. There is however some missing gaps in the x86
>> emulation that might mean code falls back to the x87. Maybe that would
>> be another area to look at.
>>
>>>> I think there will be a wariness to merge anything that only works for a
>>>> single frontend/backend combination. Running translated x86 on x86 is
>>>> not the common case for TCG ;-)
>>> Understood; initially this works on a single frontend/backend
>>> combination, with eventual translation to other guests and hosts. It
>>> will be a long road, but my plan next is to produce a translation for
>>> AArch64 systems.

Hi, Matt

We have read you shared patch, We are interested in the work you are working.
Have you already supported AArch64 systems?

>>>> These are the things that make correct handling of floating point hard.
>>> Agreed!

Hi, Alex

Does TCG plan to support Hardware  Floating Point?

Thanks
Song

>>>> I'll happily review patches on the list that provide for an accelerated
>>>> FPU experience as long as the correctness is maintained.
>>> Thank you!
>>>
>>> Matt
>>>
>>> On Thu, Sep 30, 2021 at 2:38 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>>>>
>>>> Matt <mborgerson@gmail.com> writes:
>>>>
>>>>> Hello--
>>>>>
>>>>> I'm excited to share that I have been developing support for TCG
>>>>> floating point operations; specifically, to accelerate emulation of
>>>>> x86 guest code which heavily exercises the x87 FPU for a game console
>>>>> emulator project based on QEMU. So far, this work has shown great
>>>>> promise, demonstrating some dramatic performance improvements in
>>>>> emulation of x87 heavy code.
>>>> I've not reviewed the code as it is a rather large diff. For your proper
>>>> submission could you please ensure that your patch series is broken up
>>>> into discreet commits to aid review. It also aids bisection if
>>>> regressions are identified.
>>>>
>>>>> The feature works in concert with unaccelerated x87 FPU helpers, and
>>>>> also allows total soft float helper fallback if the user discovers
>>>>> some issue with the hard float implementation.
>>>> The phrase "if the user discovers some issues" is a bit of a red flag.
>>>> We should always be striving for correct emulation of floating point.
>>>> Indeed we have recently got the code base to the point we pass all of
>>>> the Berkey softfloat test suite. This can be checked by running:
>>>>
>>>>    make check-softfloat
>>>>
>>>> However the test code links directly to the softfloat code so won't work
>>>> with direct code execution. The existing 32/64 bit hardfloat
>>>> optimisations work within the helpers. While generating direct code is
>>>> appealing to avoid the cost of helper calls it's fairly well cached and
>>>> predicted code. Experience with the initial hardfloat code showed the
>>>> was still a performance win even with the cost of the helper call.
>>>>
>>>>> For the TCG target,
>>>>> I've opted to implement it for x86-64 hosts using SSE2, although this
>>>>> could be extended to support full 80b double extended precision with
>>>>> host x87 support. I'm also in early development of an implementation
>>>>> for AArch64 hosts.
>>>> I don't think you'll see the same behaviour emulating an x87 on anything
>>>> that isn't an x87 because the boundaries for things like inexact
>>>> calculation will be different. Indeed if you look at the existing
>>>> hardfloat code function can_use_fpu() you will see we only call the host
>>>> processor function if the inexact bit is already set. Other wrappers
>>>> have even more checks for normal numbers. Anything that needs NaN
>>>> handling will fallback to the correct softfloat code.
>>>>
>>>> I think there will be a wariness to merge anything that only works for a
>>>> single frontend/backend combination. Running translated x86 on x86 is
>>>> not the common case for TCG ;-)
>>>>
>>>>> There are still some significant tasks to be done, like proper
>>>>> handling of exception flags, edge cases, and testing, to name a few.
>>>> These are the things that make correct handling of floating point hard.
>>>>
>>>>> Once in a slightly more mature state, I do think this feature would
>>>>> make a natural addition to upstream QEMU and plan to submit it for
>>>>> consideration.
>>>>>
>>>>> I'm writing to the mailing list now to inform FPU maintainers and any
>>>>> other interested parties that this work is happening, to solicit any
>>>>> early feedback, and to extend an invitation to anyone interested in
>>>>> collaborating to expedite its upstreaming.
>>>> I'll happily review patches on the list that provide for an accelerated
>>>> FPU experience as long as the correctness is maintained.
>>>>
>>>>> My initial TCG FP work can be found here:
>>>>> https://github.com/mborgerson/xemu/pull/464/commits
>>>>>
>>>>> Thanks,
>>>>> Matt
>>>>
>>>> --
>>>> Alex Bennée
>>
>> --
>> Alex Bennée