All of lore.kernel.org
 help / color / mirror / Atom feed
* Display update issue on M1 Macs
@ 2023-01-04 23:24 BALATON Zoltan
  2023-01-13 13:43 ` BALATON Zoltan
  0 siblings, 1 reply; 14+ messages in thread
From: BALATON Zoltan @ 2023-01-04 23:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Gerd Hoffmann, Akihiko Odaki, Joelle van Dyne

Hello,

I got reports from several users trying to run AmigaOS4 on sam460ex on 
Apple silicon Macs that they get missing graphics that I can't reproduce 
on x86_64. With help from the users who get the problem we've narrowed it 
down to the following:

It looks like that data written to the sm501's ram in 
qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen from 
sm501_update_display() in the same file. The sm501_2d_operation() function 
is called when the guest accesses the emulated card so it may run in a 
different thread than sm501_update_display() which is called by the ui 
backend but I'm not sure how QEMU calls these. Is device code running in 
iothread and display update in main thread? The problem is also 
independent of the display backend and was reproduced with both -display 
cocoa and -display sdl.

We have confirmed it's not the pixman routines that sm501_2d_operation() 
uses as the same issue is seen also with QEMU 4.x where pixman wasn't used 
and with all versions up to 7.2 so it's also not some bisectable change in 
QEMU. It also happens with --enable-debug so it doesn't seem to be related 
to optimisation either and I don't get it on x86_64 but even x86_64 QEMU 
builds run on Apple M1 with Rosetta 2 show the problem. It also only seems 
to affect graphics written from sm501_2d_operation() which AmigaOS4 uses 
extensively but other OSes don't and just render graphics with the vcpu 
which work without problem also on the M1 Macs that show this problem with 
AmigaOS4. Theoretically this could be some missing syncronisation which is 
something ARM and PPC may need while x86 doesn't but I don't know if this 
is really the reason and if so where and how to fix it). Any idea what may 
cause this and what could be a fix to try?

(Info on how to run it is here:
http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
but AmigaOS4 is not freely distributable so it's a bit hard to reproduce. 
Some Linux X servers that support sm501/sm502 may also use the card's 2d 
engine but I don't know about any live CDs that readily run on sam460ex.)

Thank you,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-04 23:24 Display update issue on M1 Macs BALATON Zoltan
@ 2023-01-13 13:43 ` BALATON Zoltan
  2023-01-14  2:41   ` Akihiko Odaki
  0 siblings, 1 reply; 14+ messages in thread
From: BALATON Zoltan @ 2023-01-13 13:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Peter Maydell, Gerd Hoffmann, Akihiko Odaki, Joelle van Dyne

On Thu, 5 Jan 2023, BALATON Zoltan wrote:
> Hello,
>
> I got reports from several users trying to run AmigaOS4 on sam460ex on Apple 
> silicon Macs that they get missing graphics that I can't reproduce on x86_64. 
> With help from the users who get the problem we've narrowed it down to the 
> following:
>
> It looks like that data written to the sm501's ram in 
> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen from 
> sm501_update_display() in the same file. The sm501_2d_operation() function is 
> called when the guest accesses the emulated card so it may run in a different 
> thread than sm501_update_display() which is called by the ui backend but I'm 
> not sure how QEMU calls these. Is device code running in iothread and display 
> update in main thread? The problem is also independent of the display backend 
> and was reproduced with both -display cocoa and -display sdl.
>
> We have confirmed it's not the pixman routines that sm501_2d_operation() uses 
> as the same issue is seen also with QEMU 4.x where pixman wasn't used and 
> with all versions up to 7.2 so it's also not some bisectable change in QEMU. 
> It also happens with --enable-debug so it doesn't seem to be related to 
> optimisation either and I don't get it on x86_64 but even x86_64 QEMU builds 
> run on Apple M1 with Rosetta 2 show the problem. It also only seems to affect 
> graphics written from sm501_2d_operation() which AmigaOS4 uses extensively 
> but other OSes don't and just render graphics with the vcpu which work 
> without problem also on the M1 Macs that show this problem with AmigaOS4. 
> Theoretically this could be some missing syncronisation which is something 
> ARM and PPC may need while x86 doesn't but I don't know if this is really the 
> reason and if so where and how to fix it). Any idea what may cause this and 
> what could be a fix to try?

Any idea anyone? At least some explanation if the above is plausible or if 
there's an option to disable the iothread and run everyting in a single 
thread to verify the theory could help. I've got reports from at least 3 
people getting this problem but I can't do much to fix it without some 
help.

> (Info on how to run it is here:
> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
> but AmigaOS4 is not freely distributable so it's a bit hard to reproduce. 
> Some Linux X servers that support sm501/sm502 may also use the card's 2d 
> engine but I don't know about any live CDs that readily run on sam460ex.)
>
> Thank you,
> BALATON Zoltan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-13 13:43 ` BALATON Zoltan
@ 2023-01-14  2:41   ` Akihiko Odaki
  2023-01-14 18:11     ` BALATON Zoltan
  0 siblings, 1 reply; 14+ messages in thread
From: Akihiko Odaki @ 2023-01-14  2:41 UTC (permalink / raw)
  To: Peter Maydell, BALATON Zoltan, qemu-devel; +Cc: Gerd Hoffmann, Joelle van Dyne

On 2023/01/13 22:43, BALATON Zoltan wrote:
> On Thu, 5 Jan 2023, BALATON Zoltan wrote:
>> Hello,
>>
>> I got reports from several users trying to run AmigaOS4 on sam460ex on 
>> Apple silicon Macs that they get missing graphics that I can't 
>> reproduce on x86_64. With help from the users who get the problem 
>> we've narrowed it down to the following:
>>
>> It looks like that data written to the sm501's ram in 
>> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen from 
>> sm501_update_display() in the same file. The sm501_2d_operation() 
>> function is called when the guest accesses the emulated card so it may 
>> run in a different thread than sm501_update_display() which is called 
>> by the ui backend but I'm not sure how QEMU calls these. Is device 
>> code running in iothread and display update in main thread? The 
>> problem is also independent of the display backend and was reproduced 
>> with both -display cocoa and -display sdl.
>>
>> We have confirmed it's not the pixman routines that 
>> sm501_2d_operation() uses as the same issue is seen also with QEMU 4.x 
>> where pixman wasn't used and with all versions up to 7.2 so it's also 
>> not some bisectable change in QEMU. It also happens with 
>> --enable-debug so it doesn't seem to be related to optimisation either 
>> and I don't get it on x86_64 but even x86_64 QEMU builds run on Apple 
>> M1 with Rosetta 2 show the problem. It also only seems to affect 
>> graphics written from sm501_2d_operation() which AmigaOS4 uses 
>> extensively but other OSes don't and just render graphics with the 
>> vcpu which work without problem also on the M1 Macs that show this 
>> problem with AmigaOS4. Theoretically this could be some missing 
>> syncronisation which is something ARM and PPC may need while x86 
>> doesn't but I don't know if this is really the reason and if so where 
>> and how to fix it). Any idea what may cause this and what could be a 
>> fix to try?
> 
> Any idea anyone? At least some explanation if the above is plausible or 
> if there's an option to disable the iothread and run everyting in a 
> single thread to verify the theory could help. I've got reports from at 
> least 3 people getting this problem but I can't do much to fix it 
> without some help.
> 
>> (Info on how to run it is here:
>> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
>> but AmigaOS4 is not freely distributable so it's a bit hard to 
>> reproduce. Some Linux X servers that support sm501/sm502 may also use 
>> the card's 2d engine but I don't know about any live CDs that readily 
>> run on sam460ex.)
>>
>> Thank you,
>> BALATON Zoltan

Sorry, I missed the email.

Indeed the ui backend should call sm501_update_display() in the main 
thread, which should be different from the thread calling 
sm501_2d_operation(). However, if I understand it correctly, both of the 
functions should be called with iothread lock held so there should be no 
race condition in theory.

But there is an exception: memory_region_snapshot_and_clear_dirty() 
releases iothread lock, and that broke raspi3b display device:
https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/

It is unexpected that gfx_update() callback releases iothread lock so it 
may break things in peculiar ways.

Peter, is there any change in the situation regarding the race 
introduced by memory_region_snapshot_and_clear_dirty()?

For now, to workaround the issue, I think you can create another mutex 
and make the entire sm501_2d_engine_write() and sm501_update_display() 
critical sections.

Regards,
Akihiko Odaki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-14  2:41   ` Akihiko Odaki
@ 2023-01-14 18:11     ` BALATON Zoltan
  2023-01-19 13:10       ` Akihiko Odaki
  0 siblings, 1 reply; 14+ messages in thread
From: BALATON Zoltan @ 2023-01-14 18:11 UTC (permalink / raw)
  To: Akihiko Odaki; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On Sat, 14 Jan 2023, Akihiko Odaki wrote:
> On 2023/01/13 22:43, BALATON Zoltan wrote:
>> On Thu, 5 Jan 2023, BALATON Zoltan wrote:
>>> Hello,
>>> 
>>> I got reports from several users trying to run AmigaOS4 on sam460ex on 
>>> Apple silicon Macs that they get missing graphics that I can't reproduce 
>>> on x86_64. With help from the users who get the problem we've narrowed it 
>>> down to the following:
>>> 
>>> It looks like that data written to the sm501's ram in 
>>> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen from 
>>> sm501_update_display() in the same file. The sm501_2d_operation() function 
>>> is called when the guest accesses the emulated card so it may run in a 
>>> different thread than sm501_update_display() which is called by the ui 
>>> backend but I'm not sure how QEMU calls these. Is device code running in 
>>> iothread and display update in main thread? The problem is also 
>>> independent of the display backend and was reproduced with both -display 
>>> cocoa and -display sdl.
>>> 
>>> We have confirmed it's not the pixman routines that sm501_2d_operation() 
>>> uses as the same issue is seen also with QEMU 4.x where pixman wasn't used 
>>> and with all versions up to 7.2 so it's also not some bisectable change in 
>>> QEMU. It also happens with --enable-debug so it doesn't seem to be related 
>>> to optimisation either and I don't get it on x86_64 but even x86_64 QEMU 
>>> builds run on Apple M1 with Rosetta 2 show the problem. It also only seems 
>>> to affect graphics written from sm501_2d_operation() which AmigaOS4 uses 
>>> extensively but other OSes don't and just render graphics with the vcpu 
>>> which work without problem also on the M1 Macs that show this problem with 
>>> AmigaOS4. Theoretically this could be some missing syncronisation which is 
>>> something ARM and PPC may need while x86 doesn't but I don't know if this 
>>> is really the reason and if so where and how to fix it). Any idea what may 
>>> cause this and what could be a fix to try?
>> 
>> Any idea anyone? At least some explanation if the above is plausible or if 
>> there's an option to disable the iothread and run everyting in a single 
>> thread to verify the theory could help. I've got reports from at least 3 
>> people getting this problem but I can't do much to fix it without some 
>> help.
>> 
>>> (Info on how to run it is here:
>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
>>> but AmigaOS4 is not freely distributable so it's a bit hard to reproduce. 
>>> Some Linux X servers that support sm501/sm502 may also use the card's 2d 
>>> engine but I don't know about any live CDs that readily run on sam460ex.)
>>> 
>>> Thank you,
>>> BALATON Zoltan
>
> Sorry, I missed the email.
>
> Indeed the ui backend should call sm501_update_display() in the main thread, 
> which should be different from the thread calling sm501_2d_operation(). 
> However, if I understand it correctly, both of the functions should be called 
> with iothread lock held so there should be no race condition in theory.
>
> But there is an exception: memory_region_snapshot_and_clear_dirty() releases 
> iothread lock, and that broke raspi3b display device:
> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>
> It is unexpected that gfx_update() callback releases iothread lock so it may 
> break things in peculiar ways.
>
> Peter, is there any change in the situation regarding the race introduced by 
> memory_region_snapshot_and_clear_dirty()?
>
> For now, to workaround the issue, I think you can create another mutex and 
> make the entire sm501_2d_engine_write() and sm501_update_display() critical 
> sections.

Interesting thread but not sure it's the same problem so this workaround 
may not be enough to fix my issue. Here's a video posted by one of the 
people who reported it showing the problem on M1 Mac:

https://www.youtube.com/watch?v=FDqoNbp6PQs

and here's how it looks like on other machines:

https://www.youtube.com/watch?v=ML7-F4HNFKQ

There are also videos showing it running on RPi 4 and G5 Mac without this 
issue so it seems to only happen on Apple Silicon M1 Macs. What's strange 
is that graphics elements are not just delayed which I think should happen 
with missing thread synchronisation where the update callback would miss 
some pixels rendered during it's running but subsequent update callbacks 
would eventually draw those, woudn't they? Also setting full_update to 1 
in sm501_update_display() callback to disable dirty tracking does not fix 
the problem. So it looks like as if sm501_2d_operation() running on one 
CPU core only writes data to the local cache of that core which 
sm501_update_display() running on other core can't see, so maybe some 
cache synchronisation is needed in memory_region_set_dirty() or if that's 
already there maybe I should call it for all changes not only those in the 
visible display area? I'm still not sure I understand the problem and 
don't know what could be a fix for it so anything to test to identify the 
issue better might also bring us closer to a solution.

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-14 18:11     ` BALATON Zoltan
@ 2023-01-19 13:10       ` Akihiko Odaki
  2023-01-22 23:28         ` BALATON Zoltan
  0 siblings, 1 reply; 14+ messages in thread
From: Akihiko Odaki @ 2023-01-19 13:10 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On 2023/01/15 3:11, BALATON Zoltan wrote:
> On Sat, 14 Jan 2023, Akihiko Odaki wrote:
>> On 2023/01/13 22:43, BALATON Zoltan wrote:
>>> On Thu, 5 Jan 2023, BALATON Zoltan wrote:
>>>> Hello,
>>>>
>>>> I got reports from several users trying to run AmigaOS4 on sam460ex 
>>>> on Apple silicon Macs that they get missing graphics that I can't 
>>>> reproduce on x86_64. With help from the users who get the problem 
>>>> we've narrowed it down to the following:
>>>>
>>>> It looks like that data written to the sm501's ram in 
>>>> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen from 
>>>> sm501_update_display() in the same file. The sm501_2d_operation() 
>>>> function is called when the guest accesses the emulated card so it 
>>>> may run in a different thread than sm501_update_display() which is 
>>>> called by the ui backend but I'm not sure how QEMU calls these. Is 
>>>> device code running in iothread and display update in main thread? 
>>>> The problem is also independent of the display backend and was 
>>>> reproduced with both -display cocoa and -display sdl.
>>>>
>>>> We have confirmed it's not the pixman routines that 
>>>> sm501_2d_operation() uses as the same issue is seen also with QEMU 
>>>> 4.x where pixman wasn't used and with all versions up to 7.2 so it's 
>>>> also not some bisectable change in QEMU. It also happens with 
>>>> --enable-debug so it doesn't seem to be related to optimisation 
>>>> either and I don't get it on x86_64 but even x86_64 QEMU builds run 
>>>> on Apple M1 with Rosetta 2 show the problem. It also only seems to 
>>>> affect graphics written from sm501_2d_operation() which AmigaOS4 
>>>> uses extensively but other OSes don't and just render graphics with 
>>>> the vcpu which work without problem also on the M1 Macs that show 
>>>> this problem with AmigaOS4. Theoretically this could be some missing 
>>>> syncronisation which is something ARM and PPC may need while x86 
>>>> doesn't but I don't know if this is really the reason and if so 
>>>> where and how to fix it). Any idea what may cause this and what 
>>>> could be a fix to try?
>>>
>>> Any idea anyone? At least some explanation if the above is plausible 
>>> or if there's an option to disable the iothread and run everyting in 
>>> a single thread to verify the theory could help. I've got reports 
>>> from at least 3 people getting this problem but I can't do much to 
>>> fix it without some help.
>>>
>>>> (Info on how to run it is here:
>>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
>>>> but AmigaOS4 is not freely distributable so it's a bit hard to 
>>>> reproduce. Some Linux X servers that support sm501/sm502 may also 
>>>> use the card's 2d engine but I don't know about any live CDs that 
>>>> readily run on sam460ex.)
>>>>
>>>> Thank you,
>>>> BALATON Zoltan
>>
>> Sorry, I missed the email.
>>
>> Indeed the ui backend should call sm501_update_display() in the main 
>> thread, which should be different from the thread calling 
>> sm501_2d_operation(). However, if I understand it correctly, both of 
>> the functions should be called with iothread lock held so there should 
>> be no race condition in theory.
>>
>> But there is an exception: memory_region_snapshot_and_clear_dirty() 
>> releases iothread lock, and that broke raspi3b display device:
>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>
>> It is unexpected that gfx_update() callback releases iothread lock so 
>> it may break things in peculiar ways.
>>
>> Peter, is there any change in the situation regarding the race 
>> introduced by memory_region_snapshot_and_clear_dirty()?
>>
>> For now, to workaround the issue, I think you can create another mutex 
>> and make the entire sm501_2d_engine_write() and sm501_update_display() 
>> critical sections.
> 
> Interesting thread but not sure it's the same problem so this workaround 
> may not be enough to fix my issue. Here's a video posted by one of the 
> people who reported it showing the problem on M1 Mac:
> 
> https://www.youtube.com/watch?v=FDqoNbp6PQs
> 
> and here's how it looks like on other machines:
> 
> https://www.youtube.com/watch?v=ML7-F4HNFKQ
> 
> There are also videos showing it running on RPi 4 and G5 Mac without 
> this issue so it seems to only happen on Apple Silicon M1 Macs. What's 
> strange is that graphics elements are not just delayed which I think 
> should happen with missing thread synchronisation where the update 
> callback would miss some pixels rendered during it's running but 
> subsequent update callbacks would eventually draw those, woudn't they? 
> Also setting full_update to 1 in sm501_update_display() callback to 
> disable dirty tracking does not fix the problem. So it looks like as if 
> sm501_2d_operation() running on one CPU core only writes data to the 
> local cache of that core which sm501_update_display() running on other 
> core can't see, so maybe some cache synchronisation is needed in 
> memory_region_set_dirty() or if that's already there maybe I should call 
> it for all changes not only those in the visible display area? I'm still 
> not sure I understand the problem and don't know what could be a fix for 
> it so anything to test to identify the issue better might also bring us 
> closer to a solution.
> 
> Regards,
> BALATON Zoltan

If you set full_update to 1, you may also comment out 
memory_region_snapshot_and_clear_dirty() and 
memory_region_snapshot_get_dirty() to avoid the iothread mutex being 
unlocked. The iothread mutex should ensure cache coherency as well.

But as you say, it's weird that the rendered result is not just delayed 
but missed. That may imply other possibilities (e.g., the results are 
overwritten by someone else). If the problem persists after commenting 
out memory_region_snapshot_and_clear_dirty() and 
memory_region_snapshot_get_dirty(), I think you can assume the 
inter-thread coherency between sm501_2d_operation() and 
sm501_update_display() is not causing the problem.

Regards,
Akihiko Odaki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-19 13:10       ` Akihiko Odaki
@ 2023-01-22 23:28         ` BALATON Zoltan
  2023-01-28  4:01           ` Akihiko Odaki
  0 siblings, 1 reply; 14+ messages in thread
From: BALATON Zoltan @ 2023-01-22 23:28 UTC (permalink / raw)
  To: Akihiko Odaki; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On Thu, 19 Jan 2023, Akihiko Odaki wrote:
> On 2023/01/15 3:11, BALATON Zoltan wrote:
>> On Sat, 14 Jan 2023, Akihiko Odaki wrote:
>>> On 2023/01/13 22:43, BALATON Zoltan wrote:
>>>> On Thu, 5 Jan 2023, BALATON Zoltan wrote:
>>>>> Hello,
>>>>> 
>>>>> I got reports from several users trying to run AmigaOS4 on sam460ex on 
>>>>> Apple silicon Macs that they get missing graphics that I can't reproduce 
>>>>> on x86_64. With help from the users who get the problem we've narrowed 
>>>>> it down to the following:
>>>>> 
>>>>> It looks like that data written to the sm501's ram in 
>>>>> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen from 
>>>>> sm501_update_display() in the same file. The sm501_2d_operation() 
>>>>> function is called when the guest accesses the emulated card so it may 
>>>>> run in a different thread than sm501_update_display() which is called by 
>>>>> the ui backend but I'm not sure how QEMU calls these. Is device code 
>>>>> running in iothread and display update in main thread? The problem is 
>>>>> also independent of the display backend and was reproduced with both 
>>>>> -display cocoa and -display sdl.
>>>>> 
>>>>> We have confirmed it's not the pixman routines that sm501_2d_operation() 
>>>>> uses as the same issue is seen also with QEMU 4.x where pixman wasn't 
>>>>> used and with all versions up to 7.2 so it's also not some bisectable 
>>>>> change in QEMU. It also happens with --enable-debug so it doesn't seem 
>>>>> to be related to optimisation either and I don't get it on x86_64 but 
>>>>> even x86_64 QEMU builds run on Apple M1 with Rosetta 2 show the problem. 
>>>>> It also only seems to affect graphics written from sm501_2d_operation() 
>>>>> which AmigaOS4 uses extensively but other OSes don't and just render 
>>>>> graphics with the vcpu which work without problem also on the M1 Macs 
>>>>> that show this problem with AmigaOS4. Theoretically this could be some 
>>>>> missing syncronisation which is something ARM and PPC may need while x86 
>>>>> doesn't but I don't know if this is really the reason and if so where 
>>>>> and how to fix it). Any idea what may cause this and what could be a fix 
>>>>> to try?
>>>> 
>>>> Any idea anyone? At least some explanation if the above is plausible or 
>>>> if there's an option to disable the iothread and run everyting in a 
>>>> single thread to verify the theory could help. I've got reports from at 
>>>> least 3 people getting this problem but I can't do much to fix it without 
>>>> some help.
>>>> 
>>>>> (Info on how to run it is here:
>>>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
>>>>> but AmigaOS4 is not freely distributable so it's a bit hard to 
>>>>> reproduce. Some Linux X servers that support sm501/sm502 may also use 
>>>>> the card's 2d engine but I don't know about any live CDs that readily 
>>>>> run on sam460ex.)
>>>>> 
>>>>> Thank you,
>>>>> BALATON Zoltan
>>> 
>>> Sorry, I missed the email.
>>> 
>>> Indeed the ui backend should call sm501_update_display() in the main 
>>> thread, which should be different from the thread calling 
>>> sm501_2d_operation(). However, if I understand it correctly, both of the 
>>> functions should be called with iothread lock held so there should be no 
>>> race condition in theory.
>>> 
>>> But there is an exception: memory_region_snapshot_and_clear_dirty() 
>>> releases iothread lock, and that broke raspi3b display device:
>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>> 
>>> It is unexpected that gfx_update() callback releases iothread lock so it 
>>> may break things in peculiar ways.
>>> 
>>> Peter, is there any change in the situation regarding the race introduced 
>>> by memory_region_snapshot_and_clear_dirty()?
>>> 
>>> For now, to workaround the issue, I think you can create another mutex and 
>>> make the entire sm501_2d_engine_write() and sm501_update_display() 
>>> critical sections.
>> 
>> Interesting thread but not sure it's the same problem so this workaround 
>> may not be enough to fix my issue. Here's a video posted by one of the 
>> people who reported it showing the problem on M1 Mac:
>> 
>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>> 
>> and here's how it looks like on other machines:
>> 
>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>> 
>> There are also videos showing it running on RPi 4 and G5 Mac without this 
>> issue so it seems to only happen on Apple Silicon M1 Macs. What's strange 
>> is that graphics elements are not just delayed which I think should happen 
>> with missing thread synchronisation where the update callback would miss 
>> some pixels rendered during it's running but subsequent update callbacks 
>> would eventually draw those, woudn't they? Also setting full_update to 1 in 
>> sm501_update_display() callback to disable dirty tracking does not fix the 
>> problem. So it looks like as if sm501_2d_operation() running on one CPU 
>> core only writes data to the local cache of that core which 
>> sm501_update_display() running on other core can't see, so maybe some cache 
>> synchronisation is needed in memory_region_set_dirty() or if that's already 
>> there maybe I should call it for all changes not only those in the visible 
>> display area? I'm still not sure I understand the problem and don't know 
>> what could be a fix for it so anything to test to identify the issue better 
>> might also bring us closer to a solution.
>> 
>> Regards,
>> BALATON Zoltan
>
> If you set full_update to 1, you may also comment out 
> memory_region_snapshot_and_clear_dirty() and 
> memory_region_snapshot_get_dirty() to avoid the iothread mutex being 
> unlocked. The iothread mutex should ensure cache coherency as well.
>
> But as you say, it's weird that the rendered result is not just delayed but 
> missed. That may imply other possibilities (e.g., the results are overwritten 
> by someone else). If the problem persists after commenting out 
> memory_region_snapshot_and_clear_dirty() and 
> memory_region_snapshot_get_dirty(), I think you can assume the inter-thread 
> coherency between sm501_2d_operation() and sm501_update_display() is not 
> causing the problem.

I've asked people who reported and can reproduce it to test this but it 
did not change anything so confirmed it's not that race condition but 
looks more like some cache inconsistency maybe. Any other ideas?

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-22 23:28         ` BALATON Zoltan
@ 2023-01-28  4:01           ` Akihiko Odaki
  2023-01-30 23:58             ` BALATON Zoltan
  0 siblings, 1 reply; 14+ messages in thread
From: Akihiko Odaki @ 2023-01-28  4:01 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne



On 2023/01/23 8:28, BALATON Zoltan wrote:
> On Thu, 19 Jan 2023, Akihiko Odaki wrote:
>> On 2023/01/15 3:11, BALATON Zoltan wrote:
>>> On Sat, 14 Jan 2023, Akihiko Odaki wrote:
>>>> On 2023/01/13 22:43, BALATON Zoltan wrote:
>>>>> On Thu, 5 Jan 2023, BALATON Zoltan wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I got reports from several users trying to run AmigaOS4 on 
>>>>>> sam460ex on Apple silicon Macs that they get missing graphics that 
>>>>>> I can't reproduce on x86_64. With help from the users who get the 
>>>>>> problem we've narrowed it down to the following:
>>>>>>
>>>>>> It looks like that data written to the sm501's ram in 
>>>>>> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen 
>>>>>> from sm501_update_display() in the same file. The 
>>>>>> sm501_2d_operation() function is called when the guest accesses 
>>>>>> the emulated card so it may run in a different thread than 
>>>>>> sm501_update_display() which is called by the ui backend but I'm 
>>>>>> not sure how QEMU calls these. Is device code running in iothread 
>>>>>> and display update in main thread? The problem is also independent 
>>>>>> of the display backend and was reproduced with both -display cocoa 
>>>>>> and -display sdl.
>>>>>>
>>>>>> We have confirmed it's not the pixman routines that 
>>>>>> sm501_2d_operation() uses as the same issue is seen also with QEMU 
>>>>>> 4.x where pixman wasn't used and with all versions up to 7.2 so 
>>>>>> it's also not some bisectable change in QEMU. It also happens with 
>>>>>> --enable-debug so it doesn't seem to be related to optimisation 
>>>>>> either and I don't get it on x86_64 but even x86_64 QEMU builds 
>>>>>> run on Apple M1 with Rosetta 2 show the problem. It also only 
>>>>>> seems to affect graphics written from sm501_2d_operation() which 
>>>>>> AmigaOS4 uses extensively but other OSes don't and just render 
>>>>>> graphics with the vcpu which work without problem also on the M1 
>>>>>> Macs that show this problem with AmigaOS4. Theoretically this 
>>>>>> could be some missing syncronisation which is something ARM and 
>>>>>> PPC may need while x86 doesn't but I don't know if this is really 
>>>>>> the reason and if so where and how to fix it). Any idea what may 
>>>>>> cause this and what could be a fix to try?
>>>>>
>>>>> Any idea anyone? At least some explanation if the above is 
>>>>> plausible or if there's an option to disable the iothread and run 
>>>>> everyting in a single thread to verify the theory could help. I've 
>>>>> got reports from at least 3 people getting this problem but I can't 
>>>>> do much to fix it without some help.
>>>>>
>>>>>> (Info on how to run it is here:
>>>>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
>>>>>> but AmigaOS4 is not freely distributable so it's a bit hard to 
>>>>>> reproduce. Some Linux X servers that support sm501/sm502 may also 
>>>>>> use the card's 2d engine but I don't know about any live CDs that 
>>>>>> readily run on sam460ex.)
>>>>>>
>>>>>> Thank you,
>>>>>> BALATON Zoltan
>>>>
>>>> Sorry, I missed the email.
>>>>
>>>> Indeed the ui backend should call sm501_update_display() in the main 
>>>> thread, which should be different from the thread calling 
>>>> sm501_2d_operation(). However, if I understand it correctly, both of 
>>>> the functions should be called with iothread lock held so there 
>>>> should be no race condition in theory.
>>>>
>>>> But there is an exception: memory_region_snapshot_and_clear_dirty() 
>>>> releases iothread lock, and that broke raspi3b display device:
>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>
>>>> It is unexpected that gfx_update() callback releases iothread lock 
>>>> so it may break things in peculiar ways.
>>>>
>>>> Peter, is there any change in the situation regarding the race 
>>>> introduced by memory_region_snapshot_and_clear_dirty()?
>>>>
>>>> For now, to workaround the issue, I think you can create another 
>>>> mutex and make the entire sm501_2d_engine_write() and 
>>>> sm501_update_display() critical sections.
>>>
>>> Interesting thread but not sure it's the same problem so this 
>>> workaround may not be enough to fix my issue. Here's a video posted 
>>> by one of the people who reported it showing the problem on M1 Mac:
>>>
>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>
>>> and here's how it looks like on other machines:
>>>
>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>
>>> There are also videos showing it running on RPi 4 and G5 Mac without 
>>> this issue so it seems to only happen on Apple Silicon M1 Macs. 
>>> What's strange is that graphics elements are not just delayed which I 
>>> think should happen with missing thread synchronisation where the 
>>> update callback would miss some pixels rendered during it's running 
>>> but subsequent update callbacks would eventually draw those, woudn't 
>>> they? Also setting full_update to 1 in sm501_update_display() 
>>> callback to disable dirty tracking does not fix the problem. So it 
>>> looks like as if sm501_2d_operation() running on one CPU core only 
>>> writes data to the local cache of that core which 
>>> sm501_update_display() running on other core can't see, so maybe some 
>>> cache synchronisation is needed in memory_region_set_dirty() or if 
>>> that's already there maybe I should call it for all changes not only 
>>> those in the visible display area? I'm still not sure I understand 
>>> the problem and don't know what could be a fix for it so anything to 
>>> test to identify the issue better might also bring us closer to a 
>>> solution.
>>>
>>> Regards,
>>> BALATON Zoltan
>>
>> If you set full_update to 1, you may also comment out 
>> memory_region_snapshot_and_clear_dirty() and 
>> memory_region_snapshot_get_dirty() to avoid the iothread mutex being 
>> unlocked. The iothread mutex should ensure cache coherency as well.
>>
>> But as you say, it's weird that the rendered result is not just 
>> delayed but missed. That may imply other possibilities (e.g., the 
>> results are overwritten by someone else). If the problem persists 
>> after commenting out memory_region_snapshot_and_clear_dirty() and 
>> memory_region_snapshot_get_dirty(), I think you can assume the 
>> inter-thread coherency between sm501_2d_operation() and 
>> sm501_update_display() is not causing the problem.
> 
> I've asked people who reported and can reproduce it to test this but it 
> did not change anything so confirmed it's not that race condition but 
> looks more like some cache inconsistency maybe. Any other ideas?
> 
> Regards,
> BALATON Zoltan

I can come up with two important differences between x86 and Arm which 
can affect the execution of QEMU:
1. Memory model. Arm uses a memory model more relaxed than x86 so it is 
more sensitive for synchronization failures among threads.
2. Different instructions. TCG uses JIT so differences in instructions 
matter.

We should be able to exclude 1) as a potential cause of the problem. 
iothread mutex should take care of race condition and even cache 
coherency problem; mutex includes memory barrier functionality.

For difference 2), you may try to use TCI. You can find details of TCI 
in tcg/tci/README.

The common sense tells, however, the memory model is usually the cause 
of the problem when you see behavioral differences between x86 and Arm, 
and TCG should work fine with both of x86 and Arm as they should have 
been tested well.

Regards,
Akihiko Odaki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-28  4:01           ` Akihiko Odaki
@ 2023-01-30 23:58             ` BALATON Zoltan
  2023-01-31  7:37               ` Akihiko Odaki
  0 siblings, 1 reply; 14+ messages in thread
From: BALATON Zoltan @ 2023-01-30 23:58 UTC (permalink / raw)
  To: Akihiko Odaki; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On Sat, 28 Jan 2023, Akihiko Odaki wrote:
> On 2023/01/23 8:28, BALATON Zoltan wrote:
>> On Thu, 19 Jan 2023, Akihiko Odaki wrote:
>>> On 2023/01/15 3:11, BALATON Zoltan wrote:
>>>> On Sat, 14 Jan 2023, Akihiko Odaki wrote:
>>>>> On 2023/01/13 22:43, BALATON Zoltan wrote:
>>>>>> On Thu, 5 Jan 2023, BALATON Zoltan wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> I got reports from several users trying to run AmigaOS4 on sam460ex on 
>>>>>>> Apple silicon Macs that they get missing graphics that I can't 
>>>>>>> reproduce on x86_64. With help from the users who get the problem 
>>>>>>> we've narrowed it down to the following:
>>>>>>> 
>>>>>>> It looks like that data written to the sm501's ram in 
>>>>>>> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen from 
>>>>>>> sm501_update_display() in the same file. The sm501_2d_operation() 
>>>>>>> function is called when the guest accesses the emulated card so it may 
>>>>>>> run in a different thread than sm501_update_display() which is called 
>>>>>>> by the ui backend but I'm not sure how QEMU calls these. Is device 
>>>>>>> code running in iothread and display update in main thread? The 
>>>>>>> problem is also independent of the display backend and was reproduced 
>>>>>>> with both -display cocoa and -display sdl.
>>>>>>> 
>>>>>>> We have confirmed it's not the pixman routines that 
>>>>>>> sm501_2d_operation() uses as the same issue is seen also with QEMU 4.x 
>>>>>>> where pixman wasn't used and with all versions up to 7.2 so it's also 
>>>>>>> not some bisectable change in QEMU. It also happens with 
>>>>>>> --enable-debug so it doesn't seem to be related to optimisation either 
>>>>>>> and I don't get it on x86_64 but even x86_64 QEMU builds run on Apple 
>>>>>>> M1 with Rosetta 2 show the problem. It also only seems to affect 
>>>>>>> graphics written from sm501_2d_operation() which AmigaOS4 uses 
>>>>>>> extensively but other OSes don't and just render graphics with the 
>>>>>>> vcpu which work without problem also on the M1 Macs that show this 
>>>>>>> problem with AmigaOS4. Theoretically this could be some missing 
>>>>>>> syncronisation which is something ARM and PPC may need while x86 
>>>>>>> doesn't but I don't know if this is really the reason and if so where 
>>>>>>> and how to fix it). Any idea what may cause this and what could be a 
>>>>>>> fix to try?
>>>>>> 
>>>>>> Any idea anyone? At least some explanation if the above is plausible or 
>>>>>> if there's an option to disable the iothread and run everyting in a 
>>>>>> single thread to verify the theory could help. I've got reports from at 
>>>>>> least 3 people getting this problem but I can't do much to fix it 
>>>>>> without some help.
>>>>>> 
>>>>>>> (Info on how to run it is here:
>>>>>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
>>>>>>> but AmigaOS4 is not freely distributable so it's a bit hard to 
>>>>>>> reproduce. Some Linux X servers that support sm501/sm502 may also use 
>>>>>>> the card's 2d engine but I don't know about any live CDs that readily 
>>>>>>> run on sam460ex.)
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> BALATON Zoltan
>>>>> 
>>>>> Sorry, I missed the email.
>>>>> 
>>>>> Indeed the ui backend should call sm501_update_display() in the main 
>>>>> thread, which should be different from the thread calling 
>>>>> sm501_2d_operation(). However, if I understand it correctly, both of the 
>>>>> functions should be called with iothread lock held so there should be no 
>>>>> race condition in theory.
>>>>> 
>>>>> But there is an exception: memory_region_snapshot_and_clear_dirty() 
>>>>> releases iothread lock, and that broke raspi3b display device:
>>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>> 
>>>>> It is unexpected that gfx_update() callback releases iothread lock so it 
>>>>> may break things in peculiar ways.
>>>>> 
>>>>> Peter, is there any change in the situation regarding the race 
>>>>> introduced by memory_region_snapshot_and_clear_dirty()?
>>>>> 
>>>>> For now, to workaround the issue, I think you can create another mutex 
>>>>> and make the entire sm501_2d_engine_write() and sm501_update_display() 
>>>>> critical sections.
>>>> 
>>>> Interesting thread but not sure it's the same problem so this workaround 
>>>> may not be enough to fix my issue. Here's a video posted by one of the 
>>>> people who reported it showing the problem on M1 Mac:
>>>> 
>>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>> 
>>>> and here's how it looks like on other machines:
>>>> 
>>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>> 
>>>> There are also videos showing it running on RPi 4 and G5 Mac without this 
>>>> issue so it seems to only happen on Apple Silicon M1 Macs. What's strange 
>>>> is that graphics elements are not just delayed which I think should 
>>>> happen with missing thread synchronisation where the update callback 
>>>> would miss some pixels rendered during it's running but subsequent update 
>>>> callbacks would eventually draw those, woudn't they? Also setting 
>>>> full_update to 1 in sm501_update_display() callback to disable dirty 
>>>> tracking does not fix the problem. So it looks like as if 
>>>> sm501_2d_operation() running on one CPU core only writes data to the 
>>>> local cache of that core which sm501_update_display() running on other 
>>>> core can't see, so maybe some cache synchronisation is needed in 
>>>> memory_region_set_dirty() or if that's already there maybe I should call 
>>>> it for all changes not only those in the visible display area? I'm still 
>>>> not sure I understand the problem and don't know what could be a fix for 
>>>> it so anything to test to identify the issue better might also bring us 
>>>> closer to a solution.
>>>> 
>>>> Regards,
>>>> BALATON Zoltan
>>> 
>>> If you set full_update to 1, you may also comment out 
>>> memory_region_snapshot_and_clear_dirty() and 
>>> memory_region_snapshot_get_dirty() to avoid the iothread mutex being 
>>> unlocked. The iothread mutex should ensure cache coherency as well.
>>> 
>>> But as you say, it's weird that the rendered result is not just delayed 
>>> but missed. That may imply other possibilities (e.g., the results are 
>>> overwritten by someone else). If the problem persists after commenting out 
>>> memory_region_snapshot_and_clear_dirty() and 
>>> memory_region_snapshot_get_dirty(), I think you can assume the 
>>> inter-thread coherency between sm501_2d_operation() and 
>>> sm501_update_display() is not causing the problem.
>> 
>> I've asked people who reported and can reproduce it to test this but it did 
>> not change anything so confirmed it's not that race condition but looks 
>> more like some cache inconsistency maybe. Any other ideas?
>> 
>> Regards,
>> BALATON Zoltan
>
> I can come up with two important differences between x86 and Arm which can 
> affect the execution of QEMU:
> 1. Memory model. Arm uses a memory model more relaxed than x86 so it is more 
> sensitive for synchronization failures among threads.
> 2. Different instructions. TCG uses JIT so differences in instructions 
> matter.
>
> We should be able to exclude 1) as a potential cause of the problem. iothread 
> mutex should take care of race condition and even cache coherency problem; 
> mutex includes memory barrier functionality.

Where is this barrier in QEMU code? Does this also ensure cache coherency 
between different cores or only memory sync in one core? From the testing 
I suspect it's probably not becuase of the weak ordering of ARM but 
something to do with different threads writing and reading the memory 
area. Is there a way to disable separate vcpu thread and run everything in 
a single thread to verify this theory? (We only have one vcpu so it's not 
an MTTCG issue but something between the vcpu and main thread maybe.)

> For difference 2), you may try to use TCI. You can find details of TCI in 
> tcg/tci/README.

This was tested and also with TCI got the same results just much slower.

> The common sense tells, however, the memory model is usually the cause of the 
> problem when you see behavioral differences between x86 and Arm, and TCG 
> should work fine with both of x86 and Arm as they should have been tested 
> well.

It's not only between x86 and ARM but also between different ARM CPUs it 
seems as there are videos of this test case running on Raspberry Pi 4 but 
all QEMU versions failed on Apple M1 so maybe it's something specific to 
that CPU.

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-30 23:58             ` BALATON Zoltan
@ 2023-01-31  7:37               ` Akihiko Odaki
  2023-01-31 14:15                 ` BALATON Zoltan
  0 siblings, 1 reply; 14+ messages in thread
From: Akihiko Odaki @ 2023-01-31  7:37 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On 2023/01/31 8:58, BALATON Zoltan wrote:
> On Sat, 28 Jan 2023, Akihiko Odaki wrote:
>> On 2023/01/23 8:28, BALATON Zoltan wrote:
>>> On Thu, 19 Jan 2023, Akihiko Odaki wrote:
>>>> On 2023/01/15 3:11, BALATON Zoltan wrote:
>>>>> On Sat, 14 Jan 2023, Akihiko Odaki wrote:
>>>>>> On 2023/01/13 22:43, BALATON Zoltan wrote:
>>>>>>> On Thu, 5 Jan 2023, BALATON Zoltan wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I got reports from several users trying to run AmigaOS4 on 
>>>>>>>> sam460ex on Apple silicon Macs that they get missing graphics 
>>>>>>>> that I can't reproduce on x86_64. With help from the users who 
>>>>>>>> get the problem we've narrowed it down to the following:
>>>>>>>>
>>>>>>>> It looks like that data written to the sm501's ram in 
>>>>>>>> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen 
>>>>>>>> from sm501_update_display() in the same file. The 
>>>>>>>> sm501_2d_operation() function is called when the guest accesses 
>>>>>>>> the emulated card so it may run in a different thread than 
>>>>>>>> sm501_update_display() which is called by the ui backend but I'm 
>>>>>>>> not sure how QEMU calls these. Is device code running in 
>>>>>>>> iothread and display update in main thread? The problem is also 
>>>>>>>> independent of the display backend and was reproduced with both 
>>>>>>>> -display cocoa and -display sdl.
>>>>>>>>
>>>>>>>> We have confirmed it's not the pixman routines that 
>>>>>>>> sm501_2d_operation() uses as the same issue is seen also with 
>>>>>>>> QEMU 4.x where pixman wasn't used and with all versions up to 
>>>>>>>> 7.2 so it's also not some bisectable change in QEMU. It also 
>>>>>>>> happens with --enable-debug so it doesn't seem to be related to 
>>>>>>>> optimisation either and I don't get it on x86_64 but even x86_64 
>>>>>>>> QEMU builds run on Apple M1 with Rosetta 2 show the problem. It 
>>>>>>>> also only seems to affect graphics written from 
>>>>>>>> sm501_2d_operation() which AmigaOS4 uses extensively but other 
>>>>>>>> OSes don't and just render graphics with the vcpu which work 
>>>>>>>> without problem also on the M1 Macs that show this problem with 
>>>>>>>> AmigaOS4. Theoretically this could be some missing 
>>>>>>>> syncronisation which is something ARM and PPC may need while x86 
>>>>>>>> doesn't but I don't know if this is really the reason and if so 
>>>>>>>> where and how to fix it). Any idea what may cause this and what 
>>>>>>>> could be a fix to try?
>>>>>>>
>>>>>>> Any idea anyone? At least some explanation if the above is 
>>>>>>> plausible or if there's an option to disable the iothread and run 
>>>>>>> everyting in a single thread to verify the theory could help. 
>>>>>>> I've got reports from at least 3 people getting this problem but 
>>>>>>> I can't do much to fix it without some help.
>>>>>>>
>>>>>>>> (Info on how to run it is here:
>>>>>>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
>>>>>>>> but AmigaOS4 is not freely distributable so it's a bit hard to 
>>>>>>>> reproduce. Some Linux X servers that support sm501/sm502 may 
>>>>>>>> also use the card's 2d engine but I don't know about any live 
>>>>>>>> CDs that readily run on sam460ex.)
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> BALATON Zoltan
>>>>>>
>>>>>> Sorry, I missed the email.
>>>>>>
>>>>>> Indeed the ui backend should call sm501_update_display() in the 
>>>>>> main thread, which should be different from the thread calling 
>>>>>> sm501_2d_operation(). However, if I understand it correctly, both 
>>>>>> of the functions should be called with iothread lock held so there 
>>>>>> should be no race condition in theory.
>>>>>>
>>>>>> But there is an exception: 
>>>>>> memory_region_snapshot_and_clear_dirty() releases iothread lock, 
>>>>>> and that broke raspi3b display device:
>>>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>>>
>>>>>> It is unexpected that gfx_update() callback releases iothread lock 
>>>>>> so it may break things in peculiar ways.
>>>>>>
>>>>>> Peter, is there any change in the situation regarding the race 
>>>>>> introduced by memory_region_snapshot_and_clear_dirty()?
>>>>>>
>>>>>> For now, to workaround the issue, I think you can create another 
>>>>>> mutex and make the entire sm501_2d_engine_write() and 
>>>>>> sm501_update_display() critical sections.
>>>>>
>>>>> Interesting thread but not sure it's the same problem so this 
>>>>> workaround may not be enough to fix my issue. Here's a video posted 
>>>>> by one of the people who reported it showing the problem on M1 Mac:
>>>>>
>>>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>>>
>>>>> and here's how it looks like on other machines:
>>>>>
>>>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>>>
>>>>> There are also videos showing it running on RPi 4 and G5 Mac 
>>>>> without this issue so it seems to only happen on Apple Silicon M1 
>>>>> Macs. What's strange is that graphics elements are not just delayed 
>>>>> which I think should happen with missing thread synchronisation 
>>>>> where the update callback would miss some pixels rendered during 
>>>>> it's running but subsequent update callbacks would eventually draw 
>>>>> those, woudn't they? Also setting full_update to 1 in 
>>>>> sm501_update_display() callback to disable dirty tracking does not 
>>>>> fix the problem. So it looks like as if sm501_2d_operation() 
>>>>> running on one CPU core only writes data to the local cache of that 
>>>>> core which sm501_update_display() running on other core can't see, 
>>>>> so maybe some cache synchronisation is needed in 
>>>>> memory_region_set_dirty() or if that's already there maybe I should 
>>>>> call it for all changes not only those in the visible display area? 
>>>>> I'm still not sure I understand the problem and don't know what 
>>>>> could be a fix for it so anything to test to identify the issue 
>>>>> better might also bring us closer to a solution.
>>>>>
>>>>> Regards,
>>>>> BALATON Zoltan
>>>>
>>>> If you set full_update to 1, you may also comment out 
>>>> memory_region_snapshot_and_clear_dirty() and 
>>>> memory_region_snapshot_get_dirty() to avoid the iothread mutex being 
>>>> unlocked. The iothread mutex should ensure cache coherency as well.
>>>>
>>>> But as you say, it's weird that the rendered result is not just 
>>>> delayed but missed. That may imply other possibilities (e.g., the 
>>>> results are overwritten by someone else). If the problem persists 
>>>> after commenting out memory_region_snapshot_and_clear_dirty() and 
>>>> memory_region_snapshot_get_dirty(), I think you can assume the 
>>>> inter-thread coherency between sm501_2d_operation() and 
>>>> sm501_update_display() is not causing the problem.
>>>
>>> I've asked people who reported and can reproduce it to test this but 
>>> it did not change anything so confirmed it's not that race condition 
>>> but looks more like some cache inconsistency maybe. Any other ideas?
>>>
>>> Regards,
>>> BALATON Zoltan
>>
>> I can come up with two important differences between x86 and Arm which 
>> can affect the execution of QEMU:
>> 1. Memory model. Arm uses a memory model more relaxed than x86 so it 
>> is more sensitive for synchronization failures among threads.
>> 2. Different instructions. TCG uses JIT so differences in instructions 
>> matter.
>>
>> We should be able to exclude 1) as a potential cause of the problem. 
>> iothread mutex should take care of race condition and even cache 
>> coherency problem; mutex includes memory barrier functionality.
> 
> Where is this barrier in QEMU code? Does this also ensure cache 
> coherency between different cores or only memory sync in one core? From 
> the testing I suspect it's probably not becuase of the weak ordering of 
> ARM but something to do with different threads writing and reading the 
> memory area. Is there a way to disable separate vcpu thread and run 
> everything in a single thread to verify this theory? (We only have one 
> vcpu so it's not an MTTCG issue but something between the vcpu and main 
> thread maybe.)

QEMU uses pthread_mutex for macOS, and pthread_mutex (or any sane mutex 
implementation for SMP systems) should also ensure memory 
synchronization across different cores.

That said, it is still possible that we miss something that prevents 
memory synchronization. Ideally the theory should be confirmed by 
experiments, but it is not easy with Mac.

The easiest option is to run QEMU/sam460ex on Linux on QEMU/hvf. Running 
the entire Linux system without -smp option may be too slow so you may 
use taskset command on Linux to pin QEMU/sam460ex process to a 
particular vCPU. This is somewhat incomplete as virtualization 
interferes with caches and hide problems or trigger other bugs. The 
difference of the operating systems is also concerning.

Another option is to use taskset command on Asahi Linux. Installing 
Asahi Linux is easy, but uninstalling it is a bit complicated.

m1n1 hypervisor from Asahi Linux project allows to restrict CPUs to use, 
and I think it also allows to change the memory model to x86 TSO. Unlike 
QEMU/hvf on macOS, it is very minimalistic so its interference to e.g.m 
caches is limited. It is very useful for debugging XNU or Linux, but 
hard to set up and requires another computer to control it.

Finally, you can patch XNU kernel, but this is obviously not easy.

> 
>> For difference 2), you may try to use TCI. You can find details of TCI 
>> in tcg/tci/README.
> 
> This was tested and also with TCI got the same results just much slower.
> 
>> The common sense tells, however, the memory model is usually the cause 
>> of the problem when you see behavioral differences between x86 and 
>> Arm, and TCG should work fine with both of x86 and Arm as they should 
>> have been tested well.
> 
> It's not only between x86 and ARM but also between different ARM CPUs it 
> seems as there are videos of this test case running on Raspberry Pi 4 
> but all QEMU versions failed on Apple M1 so maybe it's something 
> specific to that CPU.

It is likely that the combination of Apple's microarchitecture and Arm 
instruction set causes the problem. For example, even though the memory 
model in x86 is weaker than x86, such difference may not surface 
depending on the design of load/store unit or the size of load/store 
buffers.

Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1, 
which makes it possible to compare x86 and Arm without concerning the 
difference of the microarchitecture.

Regards,
Akihiko Odaki

> 
> Regards,
> BALATON Zoltan



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-31  7:37               ` Akihiko Odaki
@ 2023-01-31 14:15                 ` BALATON Zoltan
  2023-02-02 10:51                   ` BALATON Zoltan
  0 siblings, 1 reply; 14+ messages in thread
From: BALATON Zoltan @ 2023-01-31 14:15 UTC (permalink / raw)
  To: Akihiko Odaki; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On Tue, 31 Jan 2023, Akihiko Odaki wrote:
> On 2023/01/31 8:58, BALATON Zoltan wrote:
>> On Sat, 28 Jan 2023, Akihiko Odaki wrote:
>>> On 2023/01/23 8:28, BALATON Zoltan wrote:
>>>> On Thu, 19 Jan 2023, Akihiko Odaki wrote:
>>>>> On 2023/01/15 3:11, BALATON Zoltan wrote:
>>>>>> On Sat, 14 Jan 2023, Akihiko Odaki wrote:
>>>>>>> On 2023/01/13 22:43, BALATON Zoltan wrote:
>>>>>>>> On Thu, 5 Jan 2023, BALATON Zoltan wrote:
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> I got reports from several users trying to run AmigaOS4 on sam460ex 
>>>>>>>>> on Apple silicon Macs that they get missing graphics that I can't 
>>>>>>>>> reproduce on x86_64. With help from the users who get the problem 
>>>>>>>>> we've narrowed it down to the following:
>>>>>>>>> 
>>>>>>>>> It looks like that data written to the sm501's ram in 
>>>>>>>>> qemu/hw/display/sm501.c::sm501_2d_operation() is then not seen from 
>>>>>>>>> sm501_update_display() in the same file. The sm501_2d_operation() 
>>>>>>>>> function is called when the guest accesses the emulated card so it 
>>>>>>>>> may run in a different thread than sm501_update_display() which is 
>>>>>>>>> called by the ui backend but I'm not sure how QEMU calls these. Is 
>>>>>>>>> device code running in iothread and display update in main thread? 
>>>>>>>>> The problem is also independent of the display backend and was 
>>>>>>>>> reproduced with both -display cocoa and -display sdl.
>>>>>>>>> 
>>>>>>>>> We have confirmed it's not the pixman routines that 
>>>>>>>>> sm501_2d_operation() uses as the same issue is seen also with QEMU 
>>>>>>>>> 4.x where pixman wasn't used and with all versions up to 7.2 so it's 
>>>>>>>>> also not some bisectable change in QEMU. It also happens with 
>>>>>>>>> --enable-debug so it doesn't seem to be related to optimisation 
>>>>>>>>> either and I don't get it on x86_64 but even x86_64 QEMU builds run 
>>>>>>>>> on Apple M1 with Rosetta 2 show the problem. It also only seems to 
>>>>>>>>> affect graphics written from sm501_2d_operation() which AmigaOS4 
>>>>>>>>> uses extensively but other OSes don't and just render graphics with 
>>>>>>>>> the vcpu which work without problem also on the M1 Macs that show 
>>>>>>>>> this problem with AmigaOS4. Theoretically this could be some missing 
>>>>>>>>> syncronisation which is something ARM and PPC may need while x86 
>>>>>>>>> doesn't but I don't know if this is really the reason and if so 
>>>>>>>>> where and how to fix it). Any idea what may cause this and what 
>>>>>>>>> could be a fix to try?
>>>>>>>> 
>>>>>>>> Any idea anyone? At least some explanation if the above is plausible 
>>>>>>>> or if there's an option to disable the iothread and run everyting in 
>>>>>>>> a single thread to verify the theory could help. I've got reports 
>>>>>>>> from at least 3 people getting this problem but I can't do much to 
>>>>>>>> fix it without some help.
>>>>>>>> 
>>>>>>>>> (Info on how to run it is here:
>>>>>>>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/#amigaos
>>>>>>>>> but AmigaOS4 is not freely distributable so it's a bit hard to 
>>>>>>>>> reproduce. Some Linux X servers that support sm501/sm502 may also 
>>>>>>>>> use the card's 2d engine but I don't know about any live CDs that 
>>>>>>>>> readily run on sam460ex.)
>>>>>>>>> 
>>>>>>>>> Thank you,
>>>>>>>>> BALATON Zoltan
>>>>>>> 
>>>>>>> Sorry, I missed the email.
>>>>>>> 
>>>>>>> Indeed the ui backend should call sm501_update_display() in the main 
>>>>>>> thread, which should be different from the thread calling 
>>>>>>> sm501_2d_operation(). However, if I understand it correctly, both of 
>>>>>>> the functions should be called with iothread lock held so there should 
>>>>>>> be no race condition in theory.
>>>>>>> 
>>>>>>> But there is an exception: memory_region_snapshot_and_clear_dirty() 
>>>>>>> releases iothread lock, and that broke raspi3b display device:
>>>>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>>>> 
>>>>>>> It is unexpected that gfx_update() callback releases iothread lock so 
>>>>>>> it may break things in peculiar ways.
>>>>>>> 
>>>>>>> Peter, is there any change in the situation regarding the race 
>>>>>>> introduced by memory_region_snapshot_and_clear_dirty()?
>>>>>>> 
>>>>>>> For now, to workaround the issue, I think you can create another mutex 
>>>>>>> and make the entire sm501_2d_engine_write() and sm501_update_display() 
>>>>>>> critical sections.
>>>>>> 
>>>>>> Interesting thread but not sure it's the same problem so this 
>>>>>> workaround may not be enough to fix my issue. Here's a video posted by 
>>>>>> one of the people who reported it showing the problem on M1 Mac:
>>>>>> 
>>>>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>>>> 
>>>>>> and here's how it looks like on other machines:
>>>>>> 
>>>>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>>>> 
>>>>>> There are also videos showing it running on RPi 4 and G5 Mac without 
>>>>>> this issue so it seems to only happen on Apple Silicon M1 Macs. What's 
>>>>>> strange is that graphics elements are not just delayed which I think 
>>>>>> should happen with missing thread synchronisation where the update 
>>>>>> callback would miss some pixels rendered during it's running but 
>>>>>> subsequent update callbacks would eventually draw those, woudn't they? 
>>>>>> Also setting full_update to 1 in sm501_update_display() callback to 
>>>>>> disable dirty tracking does not fix the problem. So it looks like as if 
>>>>>> sm501_2d_operation() running on one CPU core only writes data to the 
>>>>>> local cache of that core which sm501_update_display() running on other 
>>>>>> core can't see, so maybe some cache synchronisation is needed in 
>>>>>> memory_region_set_dirty() or if that's already there maybe I should 
>>>>>> call it for all changes not only those in the visible display area? I'm 
>>>>>> still not sure I understand the problem and don't know what could be a 
>>>>>> fix for it so anything to test to identify the issue better might also 
>>>>>> bring us closer to a solution.
>>>>>> 
>>>>>> Regards,
>>>>>> BALATON Zoltan
>>>>> 
>>>>> If you set full_update to 1, you may also comment out 
>>>>> memory_region_snapshot_and_clear_dirty() and 
>>>>> memory_region_snapshot_get_dirty() to avoid the iothread mutex being 
>>>>> unlocked. The iothread mutex should ensure cache coherency as well.
>>>>> 
>>>>> But as you say, it's weird that the rendered result is not just delayed 
>>>>> but missed. That may imply other possibilities (e.g., the results are 
>>>>> overwritten by someone else). If the problem persists after commenting 
>>>>> out memory_region_snapshot_and_clear_dirty() and 
>>>>> memory_region_snapshot_get_dirty(), I think you can assume the 
>>>>> inter-thread coherency between sm501_2d_operation() and 
>>>>> sm501_update_display() is not causing the problem.
>>>> 
>>>> I've asked people who reported and can reproduce it to test this but it 
>>>> did not change anything so confirmed it's not that race condition but 
>>>> looks more like some cache inconsistency maybe. Any other ideas?
>>>> 
>>>> Regards,
>>>> BALATON Zoltan
>>> 
>>> I can come up with two important differences between x86 and Arm which can 
>>> affect the execution of QEMU:
>>> 1. Memory model. Arm uses a memory model more relaxed than x86 so it is 
>>> more sensitive for synchronization failures among threads.
>>> 2. Different instructions. TCG uses JIT so differences in instructions 
>>> matter.
>>> 
>>> We should be able to exclude 1) as a potential cause of the problem. 
>>> iothread mutex should take care of race condition and even cache coherency 
>>> problem; mutex includes memory barrier functionality.
>> 
>> Where is this barrier in QEMU code? Does this also ensure cache coherency 
>> between different cores or only memory sync in one core? From the testing I 
>> suspect it's probably not becuase of the weak ordering of ARM but something 
>> to do with different threads writing and reading the memory area. Is there 
>> a way to disable separate vcpu thread and run everything in a single thread 
>> to verify this theory? (We only have one vcpu so it's not an MTTCG issue 
>> but something between the vcpu and main thread maybe.)
>
> QEMU uses pthread_mutex for macOS, and pthread_mutex (or any sane mutex 
> implementation for SMP systems) should also ensure memory synchronization 
> across different cores.
>
> That said, it is still possible that we miss something that prevents memory 
> synchronization. Ideally the theory should be confirmed by experiments, but 
> it is not easy with Mac.
>
> The easiest option is to run QEMU/sam460ex on Linux on QEMU/hvf. Running the 
> entire Linux system without -smp option may be too slow so you may use 
> taskset command on Linux to pin QEMU/sam460ex process to a particular vCPU. 
> This is somewhat incomplete as virtualization interferes with caches and hide 
> problems or trigger other bugs. The difference of the operating systems is 
> also concerning.
>
> Another option is to use taskset command on Asahi Linux. Installing Asahi 
> Linux is easy, but uninstalling it is a bit complicated.
>
> m1n1 hypervisor from Asahi Linux project allows to restrict CPUs to use, and 
> I think it also allows to change the memory model to x86 TSO. Unlike QEMU/hvf 
> on macOS, it is very minimalistic so its interference to e.g.m caches is 
> limited. It is very useful for debugging XNU or Linux, but hard to set up and 
> requires another computer to control it.
>
> Finally, you can patch XNU kernel, but this is obviously not easy.

Yes that is getting too difficult. I don't have an M1 Mac myself so I rely 
on users who reported the problem to test so this is limited to something 
simple like trying a QEMU option to disable threads. At least in the past 
this was possible but I don't know how to do that. Anybody else reading 
this thread or should I ask that separately for somebody who knows the 
answer to notice?

>>> For difference 2), you may try to use TCI. You can find details of TCI in 
>>> tcg/tci/README.
>> 
>> This was tested and also with TCI got the same results just much slower.
>> 
>>> The common sense tells, however, the memory model is usually the cause of 
>>> the problem when you see behavioral differences between x86 and Arm, and 
>>> TCG should work fine with both of x86 and Arm as they should have been 
>>> tested well.
>> 
>> It's not only between x86 and ARM but also between different ARM CPUs it 
>> seems as there are videos of this test case running on Raspberry Pi 4 but 
>> all QEMU versions failed on Apple M1 so maybe it's something specific to 
>> that CPU.
>
> It is likely that the combination of Apple's microarchitecture and Arm 
> instruction set causes the problem. For example, even though the memory model 
> in x86 is weaker than x86, such difference may not surface depending on the 
> design of load/store unit or the size of load/store buffers.
>
> Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1, which 
> makes it possible to compare x86 and Arm without concerning the difference of 
> the microarchitecture.

We've tried that before and even running x86 QEMU on M1 with Rosetta 2 it 
was the same so it's probably not something about the code itself but how 
it's being run by that CPU. I just don't see how can this fail while it 
works elsewhere. The ati-vga has similar code and other guest OSes can use 
that so I've asked for more testing with that, maybe it would reveal some 
more details. One difference might be that the sm501 driver in AmigaOS 
uses 16 bit display so it may also be something with converting bit depths 
but I'm not sure.

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-01-31 14:15                 ` BALATON Zoltan
@ 2023-02-02 10:51                   ` BALATON Zoltan
  2023-02-03 10:16                     ` Akihiko Odaki
  0 siblings, 1 reply; 14+ messages in thread
From: BALATON Zoltan @ 2023-02-02 10:51 UTC (permalink / raw)
  To: Akihiko Odaki; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On Tue, 31 Jan 2023, BALATON Zoltan wrote:
> On Tue, 31 Jan 2023, Akihiko Odaki wrote:
[...]
To summarise previous discussion:

- There's a problem on Apple M1 Macs with sm501 and ati-vga 2d accel 
functions drawing from device model into the video memory of the emulated 
card which is not shown on screen when the display update callback is 
called from another thread. This works on x86_64 host so I suspect it may 
be related to missing memory synchronisation that ARM may need.

- This can be reproduced running AmigaOS4 on sam460ex or MorphOS (demo iso 
downliadable from their web site) on sam460ex, pegasos2 or mac99,via=pmu 
with -device ati-vga,romfile="" as described here: 
http://zero.eik.bme.hu/~balaton/qemu/amiga/

- I can't test it myself lacking hardware so I have to rely on reports 
from people who have this hardware so there may be some uncertainity in 
the info I get.

- We have confirmed it's not related to a known race condition as 
disabling dirty tracking and always doing full updates of whole screen 
did not fix it:

>>>>>>>> But there is an exception: memory_region_snapshot_and_clear_dirty() 
>>>>>>>> releases iothread lock, and that broke raspi3b display device:
>>>>>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>>>>> 
>>>>>>>> It is unexpected that gfx_update() callback releases iothread lock so 
>>>>>>>> it may break things in peculiar ways.
>>>>>>>> 
>>>>>>>> Peter, is there any change in the situation regarding the race 
>>>>>>>> introduced by memory_region_snapshot_and_clear_dirty()?
>>>>>>>> 
>>>>>>>> For now, to workaround the issue, I think you can create another 
>>>>>>>> mutex and make the entire sm501_2d_engine_write() and 
>>>>>>>> sm501_update_display() critical sections.
>>>>>>> 
>>>>>>> Interesting thread but not sure it's the same problem so this 
>>>>>>> workaround may not be enough to fix my issue. Here's a video posted by 
>>>>>>> one of the people who reported it showing the problem on M1 Mac:
>>>>>>> 
>>>>>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>>>>> 
>>>>>>> and here's how it looks like on other machines:
>>>>>>> 
>>>>>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>>>>> 
>>>>>>> There are also videos showing it running on RPi 4 and G5 Mac without 
>>>>>>> this issue so it seems to only happen on Apple Silicon M1 Macs. What's 
>>>>>>> strange is that graphics elements are not just delayed which I think 
>>>>>>> should happen with missing thread synchronisation where the update 
>>>>>>> callback would miss some pixels rendered during it's running but 
>>>>>>> subsequent update callbacks would eventually draw those, woudn't they? 
>>>>>>> Also setting full_update to 1 in sm501_update_display() callback to 
>>>>>>> disable dirty tracking does not fix the problem. So it looks like as 
>>>>>>> if sm501_2d_operation() running on one CPU core only writes data to 
>>>>>>> the local cache of that core which sm501_update_display() running on 
>>>>>>> other core can't see, so maybe some cache synchronisation is needed in 
>>>>>>> memory_region_set_dirty() or if that's already there maybe I should 
>>>>>>> call it for all changes not only those in the visible display area? 
>>>>>>> I'm still not sure I understand the problem and don't know what could 
>>>>>>> be a fix for it so anything to test to identify the issue better might 
>>>>>>> also bring us closer to a solution.
>>>>>> 
>>>>>> If you set full_update to 1, you may also comment out 
>>>>>> memory_region_snapshot_and_clear_dirty() and 
>>>>>> memory_region_snapshot_get_dirty() to avoid the iothread mutex being 
>>>>>> unlocked. The iothread mutex should ensure cache coherency as well.
>>>>>> 
>>>>>> But as you say, it's weird that the rendered result is not just delayed 
>>>>>> but missed. That may imply other possibilities (e.g., the results are 
>>>>>> overwritten by someone else). If the problem persists after commenting 
>>>>>> out memory_region_snapshot_and_clear_dirty() and 
>>>>>> memory_region_snapshot_get_dirty(), I think you can assume the 
>>>>>> inter-thread coherency between sm501_2d_operation() and 
>>>>>> sm501_update_display() is not causing the problem.
>>>>> 
>>>>> I've asked people who reported and can reproduce it to test this but it 
>>>>> did not change anything so confirmed it's not that race condition but 
>>>>> looks more like some cache inconsistency maybe. Any other ideas?
>>>> 
>>>> I can come up with two important differences between x86 and Arm which 
>>>> can affect the execution of QEMU:
>>>> 1. Memory model. Arm uses a memory model more relaxed than x86 so it is 
>>>> more sensitive for synchronization failures among threads.
>>>> 2. Different instructions. TCG uses JIT so differences in instructions 
>>>> matter.
>>>> 
>>>> We should be able to exclude 1) as a potential cause of the problem. 
>>>> iothread mutex should take care of race condition and even cache 
>>>> coherency problem; mutex includes memory barrier functionality.
[...]
>>>> For difference 2), you may try to use TCI. You can find details of TCI in 
>>>> tcg/tci/README.
>>> 
>>> This was tested and also with TCI got the same results just much slower.
>>> 
>>>> The common sense tells, however, the memory model is usually the cause of 
>>>> the problem when you see behavioral differences between x86 and Arm, and 
>>>> TCG should work fine with both of x86 and Arm as they should have been 
>>>> tested well.
[...]
>> Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1, which 
>> makes it possible to compare x86 and Arm without concerning the difference 
>> of the microarchitecture.
>
> We've tried that before and even running x86 QEMU on M1 with Rosetta 2 it was 
> the same so it's probably not something about the code itself but how it's

As this was odd I've asked to re-test this and now I'm told at least QEMU 
5.1 x86_64 build from emaculation.com is working with Rosetta on M1 Mac so 
this suggests it may be a problem with memory sync but still don't know 
where and what to try. We're now try newer X86_64 builds to see if it 
broke somewhere along the way.

Anybody else with an M1 Mac wants to help testing? Can you reproduce the 
same with UTM with MorphOS and ati-vga? Here's what I've got showing the 
problem: https://www.youtube.com/watch?v=j5Ag5_Yq-Mk

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-02-02 10:51                   ` BALATON Zoltan
@ 2023-02-03 10:16                     ` Akihiko Odaki
  2023-02-03 13:45                       ` BALATON Zoltan
  0 siblings, 1 reply; 14+ messages in thread
From: Akihiko Odaki @ 2023-02-03 10:16 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On 2023/02/02 19:51, BALATON Zoltan wrote:
> On Tue, 31 Jan 2023, BALATON Zoltan wrote:
>> On Tue, 31 Jan 2023, Akihiko Odaki wrote:
> [...]
> To summarise previous discussion:
> 
> - There's a problem on Apple M1 Macs with sm501 and ati-vga 2d accel 
> functions drawing from device model into the video memory of the 
> emulated card which is not shown on screen when the display update 
> callback is called from another thread. This works on x86_64 host so I 
> suspect it may be related to missing memory synchronisation that ARM may 
> need.
> 
> - This can be reproduced running AmigaOS4 on sam460ex or MorphOS (demo 
> iso downliadable from their web site) on sam460ex, pegasos2 or 
> mac99,via=pmu with -device ati-vga,romfile="" as described here: 
> http://zero.eik.bme.hu/~balaton/qemu/amiga/
> 
> - I can't test it myself lacking hardware so I have to rely on reports 
> from people who have this hardware so there may be some uncertainity in 
> the info I get.
> 
> - We have confirmed it's not related to a known race condition as 
> disabling dirty tracking and always doing full updates of whole screen 
> did not fix it:
> 
>>>>>>>>> But there is an exception: 
>>>>>>>>> memory_region_snapshot_and_clear_dirty() releases iothread 
>>>>>>>>> lock, and that broke raspi3b display device:
>>>>>>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>>>>>>
>>>>>>>>> It is unexpected that gfx_update() callback releases iothread 
>>>>>>>>> lock so it may break things in peculiar ways.
>>>>>>>>>
>>>>>>>>> Peter, is there any change in the situation regarding the race 
>>>>>>>>> introduced by memory_region_snapshot_and_clear_dirty()?
>>>>>>>>>
>>>>>>>>> For now, to workaround the issue, I think you can create 
>>>>>>>>> another mutex and make the entire sm501_2d_engine_write() and 
>>>>>>>>> sm501_update_display() critical sections.
>>>>>>>>
>>>>>>>> Interesting thread but not sure it's the same problem so this 
>>>>>>>> workaround may not be enough to fix my issue. Here's a video 
>>>>>>>> posted by one of the people who reported it showing the problem 
>>>>>>>> on M1 Mac:
>>>>>>>>
>>>>>>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>>>>>>
>>>>>>>> and here's how it looks like on other machines:
>>>>>>>>
>>>>>>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>>>>>>
>>>>>>>> There are also videos showing it running on RPi 4 and G5 Mac 
>>>>>>>> without this issue so it seems to only happen on Apple Silicon 
>>>>>>>> M1 Macs. What's strange is that graphics elements are not just 
>>>>>>>> delayed which I think should happen with missing thread 
>>>>>>>> synchronisation where the update callback would miss some pixels 
>>>>>>>> rendered during it's running but subsequent update callbacks 
>>>>>>>> would eventually draw those, woudn't they? Also setting 
>>>>>>>> full_update to 1 in sm501_update_display() callback to disable 
>>>>>>>> dirty tracking does not fix the problem. So it looks like as if 
>>>>>>>> sm501_2d_operation() running on one CPU core only writes data to 
>>>>>>>> the local cache of that core which sm501_update_display() 
>>>>>>>> running on other core can't see, so maybe some cache 
>>>>>>>> synchronisation is needed in memory_region_set_dirty() or if 
>>>>>>>> that's already there maybe I should call it for all changes not 
>>>>>>>> only those in the visible display area? I'm still not sure I 
>>>>>>>> understand the problem and don't know what could be a fix for it 
>>>>>>>> so anything to test to identify the issue better might also 
>>>>>>>> bring us closer to a solution.
>>>>>>>
>>>>>>> If you set full_update to 1, you may also comment out 
>>>>>>> memory_region_snapshot_and_clear_dirty() and 
>>>>>>> memory_region_snapshot_get_dirty() to avoid the iothread mutex 
>>>>>>> being unlocked. The iothread mutex should ensure cache coherency 
>>>>>>> as well.
>>>>>>>
>>>>>>> But as you say, it's weird that the rendered result is not just 
>>>>>>> delayed but missed. That may imply other possibilities (e.g., the 
>>>>>>> results are overwritten by someone else). If the problem persists 
>>>>>>> after commenting out memory_region_snapshot_and_clear_dirty() and 
>>>>>>> memory_region_snapshot_get_dirty(), I think you can assume the 
>>>>>>> inter-thread coherency between sm501_2d_operation() and 
>>>>>>> sm501_update_display() is not causing the problem.
>>>>>>
>>>>>> I've asked people who reported and can reproduce it to test this 
>>>>>> but it did not change anything so confirmed it's not that race 
>>>>>> condition but looks more like some cache inconsistency maybe. Any 
>>>>>> other ideas?
>>>>>
>>>>> I can come up with two important differences between x86 and Arm 
>>>>> which can affect the execution of QEMU:
>>>>> 1. Memory model. Arm uses a memory model more relaxed than x86 so 
>>>>> it is more sensitive for synchronization failures among threads.
>>>>> 2. Different instructions. TCG uses JIT so differences in 
>>>>> instructions matter.
>>>>>
>>>>> We should be able to exclude 1) as a potential cause of the 
>>>>> problem. iothread mutex should take care of race condition and even 
>>>>> cache coherency problem; mutex includes memory barrier functionality.
> [...]
>>>>> For difference 2), you may try to use TCI. You can find details of 
>>>>> TCI in tcg/tci/README.
>>>>
>>>> This was tested and also with TCI got the same results just much 
>>>> slower.
>>>>
>>>>> The common sense tells, however, the memory model is usually the 
>>>>> cause of the problem when you see behavioral differences between 
>>>>> x86 and Arm, and TCG should work fine with both of x86 and Arm as 
>>>>> they should have been tested well.
> [...]
>>> Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1, 
>>> which makes it possible to compare x86 and Arm without concerning the 
>>> difference of the microarchitecture.
>>
>> We've tried that before and even running x86 QEMU on M1 with Rosetta 2 
>> it was the same so it's probably not something about the code itself 
>> but how it's
> 
> As this was odd I've asked to re-test this and now I'm told at least 
> QEMU 5.1 x86_64 build from emaculation.com is working with Rosetta on M1 
> Mac so this suggests it may be a problem with memory sync but still 
> don't know where and what to try. We're now try newer X86_64 builds to 
> see if it broke somewhere along the way.
> 
> Anybody else with an M1 Mac wants to help testing? Can you reproduce the 
> same with UTM with MorphOS and ati-vga? Here's what I've got showing the 
> problem: https://www.youtube.com/watch?v=j5Ag5_Yq-Mk
> 
> Regards,
> BALATON Zoltan

Hi,

I finally reproduced the issue with MorphOS and ati-vga and figured out 
its cause.

The problem is that pixman_blt() is disabled because its backend is 
written in GNU assembly, and GNU assembler is not available on macOS. 
There is no fallback written in C, unfortunately. The issue is tracked 
by the upstream at:
https://gitlab.freedesktop.org/pixman/pixman/-/issues/59

I hit the same problem on Asahi Linux, which is based on Arch Linux ARM. 
It is because Arch Linux copied PKGBUILD from x86 Arch Linux, which 
disables Arm backends. It is easy to enable the backend for the platform 
so I proposed a change at:
https://github.com/archlinuxarm/PKGBUILDs/pull/1985

Regards,
Akihiko Odaki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-02-03 10:16                     ` Akihiko Odaki
@ 2023-02-03 13:45                       ` BALATON Zoltan
  2023-02-04  5:19                         ` Akihiko Odaki
  0 siblings, 1 reply; 14+ messages in thread
From: BALATON Zoltan @ 2023-02-03 13:45 UTC (permalink / raw)
  To: Akihiko Odaki; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On Fri, 3 Feb 2023, Akihiko Odaki wrote:
> On 2023/02/02 19:51, BALATON Zoltan wrote:
>> On Tue, 31 Jan 2023, BALATON Zoltan wrote:
>>> On Tue, 31 Jan 2023, Akihiko Odaki wrote:
>> [...]
>> To summarise previous discussion:
>> 
>> - There's a problem on Apple M1 Macs with sm501 and ati-vga 2d accel 
>> functions drawing from device model into the video memory of the emulated 
>> card which is not shown on screen when the display update callback is 
>> called from another thread. This works on x86_64 host so I suspect it may 
>> be related to missing memory synchronisation that ARM may need.
>> 
>> - This can be reproduced running AmigaOS4 on sam460ex or MorphOS (demo iso 
>> downliadable from their web site) on sam460ex, pegasos2 or mac99,via=pmu 
>> with -device ati-vga,romfile="" as described here: 
>> http://zero.eik.bme.hu/~balaton/qemu/amiga/
>> 
>> - I can't test it myself lacking hardware so I have to rely on reports from 
>> people who have this hardware so there may be some uncertainity in the info 
>> I get.
>> 
>> - We have confirmed it's not related to a known race condition as disabling 
>> dirty tracking and always doing full updates of whole screen did not fix 
>> it:
>> 
>>>>>>>>>> But there is an exception: memory_region_snapshot_and_clear_dirty() 
>>>>>>>>>> releases iothread lock, and that broke raspi3b display device:
>>>>>>>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>>>>>>> 
>>>>>>>>>> It is unexpected that gfx_update() callback releases iothread lock 
>>>>>>>>>> so it may break things in peculiar ways.
>>>>>>>>>> 
>>>>>>>>>> Peter, is there any change in the situation regarding the race 
>>>>>>>>>> introduced by memory_region_snapshot_and_clear_dirty()?
>>>>>>>>>> 
>>>>>>>>>> For now, to workaround the issue, I think you can create another 
>>>>>>>>>> mutex and make the entire sm501_2d_engine_write() and 
>>>>>>>>>> sm501_update_display() critical sections.
>>>>>>>>> 
>>>>>>>>> Interesting thread but not sure it's the same problem so this 
>>>>>>>>> workaround may not be enough to fix my issue. Here's a video posted 
>>>>>>>>> by one of the people who reported it showing the problem on M1 Mac:
>>>>>>>>> 
>>>>>>>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>>>>>>> 
>>>>>>>>> and here's how it looks like on other machines:
>>>>>>>>> 
>>>>>>>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>>>>>>> 
>>>>>>>>> There are also videos showing it running on RPi 4 and G5 Mac without 
>>>>>>>>> this issue so it seems to only happen on Apple Silicon M1 Macs. 
>>>>>>>>> What's strange is that graphics elements are not just delayed which 
>>>>>>>>> I think should happen with missing thread synchronisation where the 
>>>>>>>>> update callback would miss some pixels rendered during it's running 
>>>>>>>>> but subsequent update callbacks would eventually draw those, woudn't 
>>>>>>>>> they? Also setting full_update to 1 in sm501_update_display() 
>>>>>>>>> callback to disable dirty tracking does not fix the problem. So it 
>>>>>>>>> looks like as if sm501_2d_operation() running on one CPU core only 
>>>>>>>>> writes data to the local cache of that core which 
>>>>>>>>> sm501_update_display() running on other core can't see, so maybe 
>>>>>>>>> some cache synchronisation is needed in memory_region_set_dirty() or 
>>>>>>>>> if that's already there maybe I should call it for all changes not 
>>>>>>>>> only those in the visible display area? I'm still not sure I 
>>>>>>>>> understand the problem and don't know what could be a fix for it so 
>>>>>>>>> anything to test to identify the issue better might also bring us 
>>>>>>>>> closer to a solution.
>>>>>>>> 
>>>>>>>> If you set full_update to 1, you may also comment out 
>>>>>>>> memory_region_snapshot_and_clear_dirty() and 
>>>>>>>> memory_region_snapshot_get_dirty() to avoid the iothread mutex being 
>>>>>>>> unlocked. The iothread mutex should ensure cache coherency as well.
>>>>>>>> 
>>>>>>>> But as you say, it's weird that the rendered result is not just 
>>>>>>>> delayed but missed. That may imply other possibilities (e.g., the 
>>>>>>>> results are overwritten by someone else). If the problem persists 
>>>>>>>> after commenting out memory_region_snapshot_and_clear_dirty() and 
>>>>>>>> memory_region_snapshot_get_dirty(), I think you can assume the 
>>>>>>>> inter-thread coherency between sm501_2d_operation() and 
>>>>>>>> sm501_update_display() is not causing the problem.
>>>>>>> 
>>>>>>> I've asked people who reported and can reproduce it to test this but 
>>>>>>> it did not change anything so confirmed it's not that race condition 
>>>>>>> but looks more like some cache inconsistency maybe. Any other ideas?
>>>>>> 
>>>>>> I can come up with two important differences between x86 and Arm which 
>>>>>> can affect the execution of QEMU:
>>>>>> 1. Memory model. Arm uses a memory model more relaxed than x86 so it is 
>>>>>> more sensitive for synchronization failures among threads.
>>>>>> 2. Different instructions. TCG uses JIT so differences in instructions 
>>>>>> matter.
>>>>>> 
>>>>>> We should be able to exclude 1) as a potential cause of the problem. 
>>>>>> iothread mutex should take care of race condition and even cache 
>>>>>> coherency problem; mutex includes memory barrier functionality.
>> [...]
>>>>>> For difference 2), you may try to use TCI. You can find details of TCI 
>>>>>> in tcg/tci/README.
>>>>> 
>>>>> This was tested and also with TCI got the same results just much slower.
>>>>> 
>>>>>> The common sense tells, however, the memory model is usually the cause 
>>>>>> of the problem when you see behavioral differences between x86 and Arm, 
>>>>>> and TCG should work fine with both of x86 and Arm as they should have 
>>>>>> been tested well.
>> [...]
>>>> Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1, which 
>>>> makes it possible to compare x86 and Arm without concerning the 
>>>> difference of the microarchitecture.
>>> 
>>> We've tried that before and even running x86 QEMU on M1 with Rosetta 2 it 
>>> was the same so it's probably not something about the code itself but how 
>>> it's
>> 
>> As this was odd I've asked to re-test this and now I'm told at least QEMU 
>> 5.1 x86_64 build from emaculation.com is working with Rosetta on M1 Mac so 
>> this suggests it may be a problem with memory sync but still don't know 
>> where and what to try. We're now try newer X86_64 builds to see if it broke 
>> somewhere along the way.
>> 
>> Anybody else with an M1 Mac wants to help testing? Can you reproduce the 
>> same with UTM with MorphOS and ati-vga? Here's what I've got showing the 
>> problem: https://www.youtube.com/watch?v=j5Ag5_Yq-Mk
>> 
>> Regards,
>> BALATON Zoltan
>
> Hi,
>
> I finally reproduced the issue with MorphOS and ati-vga and figured out its 
> cause.

Great, thanks a lot. After establishing it works with x86 version we were 
about to test with aarch64 QEMU 5.0 where sm501 did not yet use pixman but 
ati-vga did so we could check if it's related to pixman as previous test 
results with old version were all wrong it seems. But you were faster.

> The problem is that pixman_blt() is disabled because its backend is written 
> in GNU assembly, and GNU assembler is not available on macOS. There is no 
> fallback written in C, unfortunately. The issue is tracked by the upstream 
> at:
> https://gitlab.freedesktop.org/pixman/pixman/-/issues/59

Hm, OK but that ticket is just about compile error and suggests to disable 
it and does not say it won't work then. Are they aware this is a problem? 
Maybe we should write to their mailing list after we're sure what's 
happening.

> I hit the same problem on Asahi Linux, which is based on Arch Linux ARM. It 
> is because Arch Linux copied PKGBUILD from x86 Arch Linux, which disables Arm 
> backends. It is easy to enable the backend for the platform so I proposed a 
> change at:
> https://github.com/archlinuxarm/PKGBUILDs/pull/1985

On macOS one source of pixman most people use is brew.sh where this seems 
to be disabled:

https://github.com/Homebrew/homebrew-core/blob/master/Formula/pixman.rb

another source is macports which has an older version and no such options:

https://github.com/macports/macports-ports/blob/master/graphics/libpixman-devel/Portfile

I wonder if it compiles from macports on aarch64 then.

I wait if I can get some more test results and try to check pixman but its 
source is not too clear to me and there are no docs either so maybe the 
best way is to ask on their list. If this is a pixman issue I hope it can 
be fixed there and we don't need to implement a fallback in QEMU.

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Display update issue on M1 Macs
  2023-02-03 13:45                       ` BALATON Zoltan
@ 2023-02-04  5:19                         ` Akihiko Odaki
  0 siblings, 0 replies; 14+ messages in thread
From: Akihiko Odaki @ 2023-02-04  5:19 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Peter Maydell, qemu-devel, Gerd Hoffmann, Joelle van Dyne

On 2023/02/03 22:45, BALATON Zoltan wrote:
> On Fri, 3 Feb 2023, Akihiko Odaki wrote:
>> On 2023/02/02 19:51, BALATON Zoltan wrote:
>>> On Tue, 31 Jan 2023, BALATON Zoltan wrote:
>>>> On Tue, 31 Jan 2023, Akihiko Odaki wrote:
>>> [...]
>>> To summarise previous discussion:
>>>
>>> - There's a problem on Apple M1 Macs with sm501 and ati-vga 2d accel 
>>> functions drawing from device model into the video memory of the 
>>> emulated card which is not shown on screen when the display update 
>>> callback is called from another thread. This works on x86_64 host so 
>>> I suspect it may be related to missing memory synchronisation that 
>>> ARM may need.
>>>
>>> - This can be reproduced running AmigaOS4 on sam460ex or MorphOS 
>>> (demo iso downliadable from their web site) on sam460ex, pegasos2 or 
>>> mac99,via=pmu with -device ati-vga,romfile="" as described here: 
>>> http://zero.eik.bme.hu/~balaton/qemu/amiga/
>>>
>>> - I can't test it myself lacking hardware so I have to rely on 
>>> reports from people who have this hardware so there may be some 
>>> uncertainity in the info I get.
>>>
>>> - We have confirmed it's not related to a known race condition as 
>>> disabling dirty tracking and always doing full updates of whole 
>>> screen did not fix it:
>>>
>>>>>>>>>>> But there is an exception: 
>>>>>>>>>>> memory_region_snapshot_and_clear_dirty() releases iothread 
>>>>>>>>>>> lock, and that broke raspi3b display device:
>>>>>>>>>>> https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=Wn+k8dQneB_ynQ@mail.gmail.com/T/
>>>>>>>>>>>
>>>>>>>>>>> It is unexpected that gfx_update() callback releases iothread 
>>>>>>>>>>> lock so it may break things in peculiar ways.
>>>>>>>>>>>
>>>>>>>>>>> Peter, is there any change in the situation regarding the 
>>>>>>>>>>> race introduced by memory_region_snapshot_and_clear_dirty()?
>>>>>>>>>>>
>>>>>>>>>>> For now, to workaround the issue, I think you can create 
>>>>>>>>>>> another mutex and make the entire sm501_2d_engine_write() and 
>>>>>>>>>>> sm501_update_display() critical sections.
>>>>>>>>>>
>>>>>>>>>> Interesting thread but not sure it's the same problem so this 
>>>>>>>>>> workaround may not be enough to fix my issue. Here's a video 
>>>>>>>>>> posted by one of the people who reported it showing the 
>>>>>>>>>> problem on M1 Mac:
>>>>>>>>>>
>>>>>>>>>> https://www.youtube.com/watch?v=FDqoNbp6PQs
>>>>>>>>>>
>>>>>>>>>> and here's how it looks like on other machines:
>>>>>>>>>>
>>>>>>>>>> https://www.youtube.com/watch?v=ML7-F4HNFKQ
>>>>>>>>>>
>>>>>>>>>> There are also videos showing it running on RPi 4 and G5 Mac 
>>>>>>>>>> without this issue so it seems to only happen on Apple Silicon 
>>>>>>>>>> M1 Macs. What's strange is that graphics elements are not just 
>>>>>>>>>> delayed which I think should happen with missing thread 
>>>>>>>>>> synchronisation where the update callback would miss some 
>>>>>>>>>> pixels rendered during it's running but subsequent update 
>>>>>>>>>> callbacks would eventually draw those, woudn't they? Also 
>>>>>>>>>> setting full_update to 1 in sm501_update_display() callback to 
>>>>>>>>>> disable dirty tracking does not fix the problem. So it looks 
>>>>>>>>>> like as if sm501_2d_operation() running on one CPU core only 
>>>>>>>>>> writes data to the local cache of that core which 
>>>>>>>>>> sm501_update_display() running on other core can't see, so 
>>>>>>>>>> maybe some cache synchronisation is needed in 
>>>>>>>>>> memory_region_set_dirty() or if that's already there maybe I 
>>>>>>>>>> should call it for all changes not only those in the visible 
>>>>>>>>>> display area? I'm still not sure I understand the problem and 
>>>>>>>>>> don't know what could be a fix for it so anything to test to 
>>>>>>>>>> identify the issue better might also bring us closer to a 
>>>>>>>>>> solution.
>>>>>>>>>
>>>>>>>>> If you set full_update to 1, you may also comment out 
>>>>>>>>> memory_region_snapshot_and_clear_dirty() and 
>>>>>>>>> memory_region_snapshot_get_dirty() to avoid the iothread mutex 
>>>>>>>>> being unlocked. The iothread mutex should ensure cache 
>>>>>>>>> coherency as well.
>>>>>>>>>
>>>>>>>>> But as you say, it's weird that the rendered result is not just 
>>>>>>>>> delayed but missed. That may imply other possibilities (e.g., 
>>>>>>>>> the results are overwritten by someone else). If the problem 
>>>>>>>>> persists after commenting out 
>>>>>>>>> memory_region_snapshot_and_clear_dirty() and 
>>>>>>>>> memory_region_snapshot_get_dirty(), I think you can assume the 
>>>>>>>>> inter-thread coherency between sm501_2d_operation() and 
>>>>>>>>> sm501_update_display() is not causing the problem.
>>>>>>>>
>>>>>>>> I've asked people who reported and can reproduce it to test this 
>>>>>>>> but it did not change anything so confirmed it's not that race 
>>>>>>>> condition but looks more like some cache inconsistency maybe. 
>>>>>>>> Any other ideas?
>>>>>>>
>>>>>>> I can come up with two important differences between x86 and Arm 
>>>>>>> which can affect the execution of QEMU:
>>>>>>> 1. Memory model. Arm uses a memory model more relaxed than x86 so 
>>>>>>> it is more sensitive for synchronization failures among threads.
>>>>>>> 2. Different instructions. TCG uses JIT so differences in 
>>>>>>> instructions matter.
>>>>>>>
>>>>>>> We should be able to exclude 1) as a potential cause of the 
>>>>>>> problem. iothread mutex should take care of race condition and 
>>>>>>> even cache coherency problem; mutex includes memory barrier 
>>>>>>> functionality.
>>> [...]
>>>>>>> For difference 2), you may try to use TCI. You can find details 
>>>>>>> of TCI in tcg/tci/README.
>>>>>>
>>>>>> This was tested and also with TCI got the same results just much 
>>>>>> slower.
>>>>>>
>>>>>>> The common sense tells, however, the memory model is usually the 
>>>>>>> cause of the problem when you see behavioral differences between 
>>>>>>> x86 and Arm, and TCG should work fine with both of x86 and Arm as 
>>>>>>> they should have been tested well.
>>> [...]
>>>>> Fortunately macOS provides Rosetta 2 for x86 emulation on Apple M1, 
>>>>> which makes it possible to compare x86 and Arm without concerning 
>>>>> the difference of the microarchitecture.
>>>>
>>>> We've tried that before and even running x86 QEMU on M1 with Rosetta 
>>>> 2 it was the same so it's probably not something about the code 
>>>> itself but how it's
>>>
>>> As this was odd I've asked to re-test this and now I'm told at least 
>>> QEMU 5.1 x86_64 build from emaculation.com is working with Rosetta on 
>>> M1 Mac so this suggests it may be a problem with memory sync but 
>>> still don't know where and what to try. We're now try newer X86_64 
>>> builds to see if it broke somewhere along the way.
>>>
>>> Anybody else with an M1 Mac wants to help testing? Can you reproduce 
>>> the same with UTM with MorphOS and ati-vga? Here's what I've got 
>>> showing the problem: https://www.youtube.com/watch?v=j5Ag5_Yq-Mk
>>>
>>> Regards,
>>> BALATON Zoltan
>>
>> Hi,
>>
>> I finally reproduced the issue with MorphOS and ati-vga and figured 
>> out its cause.
> 
> Great, thanks a lot. After establishing it works with x86 version we 
> were about to test with aarch64 QEMU 5.0 where sm501 did not yet use 
> pixman but ati-vga did so we could check if it's related to pixman as 
> previous test results with old version were all wrong it seems. But you 
> were faster.
> 
>> The problem is that pixman_blt() is disabled because its backend is 
>> written in GNU assembly, and GNU assembler is not available on macOS. 
>> There is no fallback written in C, unfortunately. The issue is tracked 
>> by the upstream at:
>> https://gitlab.freedesktop.org/pixman/pixman/-/issues/59
> 
> Hm, OK but that ticket is just about compile error and suggests to 
> disable it and does not say it won't work then. Are they aware this is a 
> problem? Maybe we should write to their mailing list after we're sure 
> what's happening.

That's a good idea. They may prioritize the issue if they realize that 
disables pixman_blt().

> 
>> I hit the same problem on Asahi Linux, which is based on Arch Linux 
>> ARM. It is because Arch Linux copied PKGBUILD from x86 Arch Linux, 
>> which disables Arm backends. It is easy to enable the backend for the 
>> platform so I proposed a change at:
>> https://github.com/archlinuxarm/PKGBUILDs/pull/1985
> 
> On macOS one source of pixman most people use is brew.sh where this 
> seems to be disabled:
> 
> https://github.com/Homebrew/homebrew-core/blob/master/Formula/pixman.rb
> 
> another source is macports which has an older version and no such options:
> 
> https://github.com/macports/macports-ports/blob/master/graphics/libpixman-devel/Portfile
> 
> I wonder if it compiles from macports on aarch64 then.

It's more likely that it is just outdated. It does not carry a patch to 
fix the issue.

> 
> I wait if I can get some more test results and try to check pixman but 
> its source is not too clear to me and there are no docs either so maybe 
> the best way is to ask on their list. If this is a pixman issue I hope 
> it can be fixed there and we don't need to implement a fallback in QEMU.

This is certainly a pixman issue.

If you read the source, you can see pixman_blt() calls 
_pixman_implementation_blt(). _pixman_implementation_blt() calls blt 
member of pixman_implementation_t in turn. Grepping for "blt =" tells it 
is only assigned in:
pixman/pixman-arm-neon.c
pixman/pixman-arm-simd.c
pixman/pixman-mips-dspr2.c
pixman/pixman-mmx.c
pixman/pixman-sse2.c

For AArch64, only pixman/pixman-arm-neon.c is relevant, and it needs to 
be disabled to build the library on macOS.

Regards,
Akihiko Odaki

> 
> Regards,
> BALATON Zoltan


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-02-04  5:21 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-04 23:24 Display update issue on M1 Macs BALATON Zoltan
2023-01-13 13:43 ` BALATON Zoltan
2023-01-14  2:41   ` Akihiko Odaki
2023-01-14 18:11     ` BALATON Zoltan
2023-01-19 13:10       ` Akihiko Odaki
2023-01-22 23:28         ` BALATON Zoltan
2023-01-28  4:01           ` Akihiko Odaki
2023-01-30 23:58             ` BALATON Zoltan
2023-01-31  7:37               ` Akihiko Odaki
2023-01-31 14:15                 ` BALATON Zoltan
2023-02-02 10:51                   ` BALATON Zoltan
2023-02-03 10:16                     ` Akihiko Odaki
2023-02-03 13:45                       ` BALATON Zoltan
2023-02-04  5:19                         ` Akihiko Odaki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.