On 2021/10/2 上午10:07, Matt wrote: >> Not at the moment but it would certainly be a useful addition for the >> unit tests if we could test arbitrary sequences of TCG ops. I'm not sure >> how much test harness would be needed to exercise that though. > On a related note, in addition to testing TCG->Host translation, it > would be nice to also have a way to make sure TCG->TCG optimization > passes are working as expected. Is there existing work in this area? > > >> We have a number of multiarch tcg tests for fused multiply-add and the >> various fconv operations. There is also quite an exhaustive set of i386 >> specific tests (test-i386-fprem) but it doesn't get run by default as >> the "reference" output is too big to include in the tree and has to be >> generated in-situ. You get it by adding SPEED=slow to your make >> invocation. [...] >> You can run tests/fp/fp-bench -t host under translation to exercise that. > Thanks for the info! This will be useful. > > >> I know the classic Doom and Quake benchmarks showed a performance >> regression when we switched to softfloat: >> >> https://diasp.eu/posts/ec86de10240e01376f734061862b8e7b > That post was an interesting read, thanks for sharing! > > >> Out of interest what game code still uses x87? [...] >> however I kinda assumed more modern games would be taking advantaged of >> SSE and later features. There is however some missing gaps in the x86 >> emulation that might mean code falls back to the x87. Maybe that would >> be another area to look at. > This project is an emulator of the original Xbox game console, which > is now...twenty years old (time flies). The Xbox CPU (P3) does feature > SSE (not SSE2+), however most of the games I've tested for this > generation still make heavy use of x87. > > I have seen at least one game make noticeable use of MMX/SSE features > though, which I also need to look at accelerating. Profiler indicates > they are also very costly. I have seen the TCG vector ops, which are a > very cool addition. > > Matt > > > On Fri, Oct 1, 2021 at 1:24 AM Alex Bennée wrote: >> >> Matt writes: >> >>> Thank you Alex, for your quick and thoughtful response. >>> >>>> I've not reviewed the code as it is a rather large diff. For your proper >>>> submission could you please ensure that your patch series is broken up >>>> into discreet commits to aid review. >>> Of course. >>> >>>> The phrase "if the user discovers some issues" is a bit of a red flag. >>>> We should always be striving for correct emulation of floating point. >>> I agree. This is an option that I added for use during feature >>> development. Ultimately I would like not to have such an option, and >>> for it to always *just work*. >> The closest I can think of is the --accel thread=single|multi option >> which allowed for verifying if an issue was related to MTTCG. However >> the default would always do the right thing. >> >>>> Indeed we have recently got the code base to the point we pass all of >>>> the Berkey softfloat test suite. This can be checked by running: >>>> >>>> make check-softfloat >>>> >>>> However the test code links directly to the softfloat code so won't work >>>> with direct code execution. >>> I had planned to leverage the existing soft float test suite, and I >>> think this can be done with the right harnessing. It would be nice to >>> have a mechanism to test translation of individual TCG ops, e.g. be >>> able to run translated blocks from test code and evaluate their >>> output. I'm not sure if any such op level testing like that is being >>> done. >> Not at the moment but it would certainly be a useful addition for the >> unit tests if we could test arbitrary sequences of TCG ops. I'm not sure >> how much test harness would be needed to exercise that though. >> >>> There are also guest tests in tests/tcg, which could also be >>> expanded to include more FP tests. >> We have a number of multiarch tcg tests for fused multiply-add and the >> various fconv operations. There is also quite an exhaustive set of i386 >> specific tests (test-i386-fprem) but it doesn't get run by default as >> the "reference" output is too big to include in the tree and has to be >> generated in-situ. You get it by adding SPEED=slow to your make >> invocation. >> >>>> The existing 32/64 bit hardfloat >>>> optimisations work within the helpers. While generating direct code is >>>> appealing to avoid the cost of helper calls it's fairly well cached and >>>> predicted code. Experience with the initial hardfloat code showed the >>>> was still a performance win even with the cost of the helper call. >>> Unfortunately, even with the existing hardfloat support, the overhead >>> of the helper calls is still too costly for my particular application. >> Once you start dealing with flag generation you might find that equation >> changes somewhat if you have to mess around with bit masking and checks >> using TCG ops. However providing benchmark results with your patch would >> be required to argue the point. You can run tests/fp/fp-bench -t host >> under translation to exercise that. >> >>>> I don't think you'll see the same behaviour emulating an x87 on anything >>>> that isn't an x87 because the boundaries for things like inexact >>>> calculation will be different. Indeed if you look at the existing >>>> hardfloat code function can_use_fpu() you will see we only call the host >>>> processor function if the inexact bit is already set. Other wrappers >>>> have even more checks for normal numbers. Anything that needs NaN >>>> handling will fallback to the correct softfloat code. >>> Fair points. Bit-perfect x87 emulation with this approach may be >>> ultimately unachievable; and I'm still evaluating the cases when this >>> will not work. This has been a learning experience for me, and I >>> gladly welcome expert input in this matter. >>> >>> Personally, I would accept minor accuracy differences in trade for >>> significant performance advantage in emulation of game code, but not >>> for scientific applications, which I understand may diminish upstream >>> appeal of this x87 translation work. >> Out of interest what game code still uses x87? I know the classic Doom >> and Quake benchmarks showed a performance regression when we switched to >> softfloat: >> >> https://diasp.eu/posts/ec86de10240e01376f734061862b8e7b >> >> however I kinda assumed more modern games would be taking advantaged of >> SSE and later features. There is however some missing gaps in the x86 >> emulation that might mean code falls back to the x87. Maybe that would >> be another area to look at. >> >>>> I think there will be a wariness to merge anything that only works for a >>>> single frontend/backend combination. Running translated x86 on x86 is >>>> not the common case for TCG ;-) >>> Understood; initially this works on a single frontend/backend >>> combination, with eventual translation to other guests and hosts. It >>> will be a long road, but my plan next is to produce a translation for >>> AArch64 systems. Hi, Matt We have read you shared patch, We are interested in the work you are working. Have you already supported AArch64 systems? >>>> These are the things that make correct handling of floating point hard. >>> Agreed! Hi, Alex Does TCG plan to support Hardware Floating Point? Thanks Song >>>> I'll happily review patches on the list that provide for an accelerated >>>> FPU experience as long as the correctness is maintained. >>> Thank you! >>> >>> Matt >>> >>> On Thu, Sep 30, 2021 at 2:38 AM Alex Bennée wrote: >>>> >>>> Matt writes: >>>> >>>>> Hello-- >>>>> >>>>> I'm excited to share that I have been developing support for TCG >>>>> floating point operations; specifically, to accelerate emulation of >>>>> x86 guest code which heavily exercises the x87 FPU for a game console >>>>> emulator project based on QEMU. So far, this work has shown great >>>>> promise, demonstrating some dramatic performance improvements in >>>>> emulation of x87 heavy code. >>>> I've not reviewed the code as it is a rather large diff. For your proper >>>> submission could you please ensure that your patch series is broken up >>>> into discreet commits to aid review. It also aids bisection if >>>> regressions are identified. >>>> >>>>> The feature works in concert with unaccelerated x87 FPU helpers, and >>>>> also allows total soft float helper fallback if the user discovers >>>>> some issue with the hard float implementation. >>>> The phrase "if the user discovers some issues" is a bit of a red flag. >>>> We should always be striving for correct emulation of floating point. >>>> Indeed we have recently got the code base to the point we pass all of >>>> the Berkey softfloat test suite. This can be checked by running: >>>> >>>> make check-softfloat >>>> >>>> However the test code links directly to the softfloat code so won't work >>>> with direct code execution. The existing 32/64 bit hardfloat >>>> optimisations work within the helpers. While generating direct code is >>>> appealing to avoid the cost of helper calls it's fairly well cached and >>>> predicted code. Experience with the initial hardfloat code showed the >>>> was still a performance win even with the cost of the helper call. >>>> >>>>> For the TCG target, >>>>> I've opted to implement it for x86-64 hosts using SSE2, although this >>>>> could be extended to support full 80b double extended precision with >>>>> host x87 support. I'm also in early development of an implementation >>>>> for AArch64 hosts. >>>> I don't think you'll see the same behaviour emulating an x87 on anything >>>> that isn't an x87 because the boundaries for things like inexact >>>> calculation will be different. Indeed if you look at the existing >>>> hardfloat code function can_use_fpu() you will see we only call the host >>>> processor function if the inexact bit is already set. Other wrappers >>>> have even more checks for normal numbers. Anything that needs NaN >>>> handling will fallback to the correct softfloat code. >>>> >>>> I think there will be a wariness to merge anything that only works for a >>>> single frontend/backend combination. Running translated x86 on x86 is >>>> not the common case for TCG ;-) >>>> >>>>> There are still some significant tasks to be done, like proper >>>>> handling of exception flags, edge cases, and testing, to name a few. >>>> These are the things that make correct handling of floating point hard. >>>> >>>>> Once in a slightly more mature state, I do think this feature would >>>>> make a natural addition to upstream QEMU and plan to submit it for >>>>> consideration. >>>>> >>>>> I'm writing to the mailing list now to inform FPU maintainers and any >>>>> other interested parties that this work is happening, to solicit any >>>>> early feedback, and to extend an invitation to anyone interested in >>>>> collaborating to expedite its upstreaming. >>>> I'll happily review patches on the list that provide for an accelerated >>>> FPU experience as long as the correctness is maintained. >>>> >>>>> My initial TCG FP work can be found here: >>>>> https://github.com/mborgerson/xemu/pull/464/commits >>>>> >>>>> Thanks, >>>>> Matt >>>> >>>> -- >>>> Alex Bennée >> >> -- >> Alex Bennée