All of lore.kernel.org
 help / color / mirror / Atom feed
* notdirty_write thrashing in simple for loop
@ 2021-05-18  9:54 Mark Watson
  2021-05-23 13:41 ` BALATON Zoltan
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Watson @ 2021-05-18  9:54 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2268 bytes --]

Hi,

I'm trying to implement my own machine for amiga emulation using a software
cpu and fpga hardware. For this I have built my own machine which consists
of a large malloced ram block and some fpga hardware mmapped elsewhere into
the memory space.

I'm using qemu to emulate a 68040 on an arm cortex a9 host in system mode.

It is working, though I'm investigating a strange performance issue.

I'm looking for advice on where to look next in debugging this from the
specialist(s) of accel/tcg/cputlb.c please.

To investigate the performance issue I tried to break it down to the
simplest possible case. I can reproduce it with a simple for loop (compiled
without optimisation).
        for (int i=0;i!=0xffffff;++i)
{
if ((i&0xffff)==0)
{
}
}
Running it in user mode on the same host it takes ~0.6 seconds. In the
built-in 'virtual' m68k machine running linux it takes 1.3 seconds.
However in my machine under amigaos I'm seeing it typically taking 5 and a
half minutes! Occasionally it seems to run at the correct speed of <2
seconds, though I have yet to identify why. These are the logs of the
captured code before it goes into the main chain loop.
qemu_slow_stuck_fragment.log
<http://www.64kib.com/qemu_slow_stuck_fragment.log>
I have verified that this performance change is not due to slow fpga memory
area access, i.e. there are no accesses to that memory region during this.

I took a look in gdb while running this loop to see what is going on.
Initially I was surprised that I didn't find the code in 'OUT:', however I
guess it makes sense that it has to call into the framework for memory
access. I noticed that a lot of calls to glib are made and see
g_tree_lookup called a lot. This is caused by notdirty_write being called
'000s of times and each time going into the page_collection_lock and
tb_invalidate_phys_page_fast. I presume this is happening each time that
"i" is incremented on the stack, which clearly has a huge overhead.

Even being able to get a proper stack trace from gdb would be very helpful
to understand this. I tried to configure qemu with '--enable-debug' but
still do not get a proper stack if i attach to it. I'm not sure if this is
the case due to it running dynamically compiled code before calling into
this.

Thanks,
Mark

[-- Attachment #2: Type: text/html, Size: 3171 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: notdirty_write thrashing in simple for loop
  2021-05-18  9:54 notdirty_write thrashing in simple for loop Mark Watson
@ 2021-05-23 13:41 ` BALATON Zoltan
  2021-05-23 16:07   ` Mark Watson
  0 siblings, 1 reply; 4+ messages in thread
From: BALATON Zoltan @ 2021-05-23 13:41 UTC (permalink / raw)
  To: Mark Watson; +Cc: qemu-devel

Hello,

On Tue, 18 May 2021, Mark Watson wrote:
> I'm trying to implement my own machine for amiga emulation using a software
> cpu and fpga hardware. For this I have built my own machine which consists
> of a large malloced ram block and some fpga hardware mmapped elsewhere into
> the memory space.
>
> I'm using qemu to emulate a 68040 on an arm cortex a9 host in system mode.
>
> It is working, though I'm investigating a strange performance issue.
>
> I'm looking for advice on where to look next in debugging this from the
> specialist(s) of accel/tcg/cputlb.c please.

I think you need to be more specific about details or even better provide 
a way to reproduce it without your hardware if possible otherwise people 
will not get what you're seeing. From the above it's not clear to me if 
you're emulating an fpga hardware in QEMU or actually run with the fpga 
(supposedly implementing the Amiga chipset) in the virtual machine's 
memory so accesses to some addresses will do something in hardware (in 
which case it may be difficult to reproduce without it and also could be 
the source of problems so hard to tell what might be causing your issue.)

(Is this related to pistorm or something based on that for full Amiga 
emulation without Amiga hardware? Just insterested, unrelated to this 
thread.)

> To investigate the performance issue I tried to break it down to the
> simplest possible case. I can reproduce it with a simple for loop (compiled
> without optimisation).
>        for (int i=0;i!=0xffffff;++i)
> {
> if ((i&0xffff)==0)
> {
> }
> }

So you do nothing in the loop just test for the loop variable and this 
sometimes runs slow?

> Running it in user mode on the same host it takes ~0.6 seconds. In the
> built-in 'virtual' m68k machine running linux it takes 1.3 seconds.
> However in my machine under amigaos I'm seeing it typically taking 5 and a
> half minutes! Occasionally it seems to run at the correct speed of <2
> seconds, though I have yet to identify why. These are the logs of the
> captured code before it goes into the main chain loop.
> qemu_slow_stuck_fragment.log
> <http://www.64kib.com/qemu_slow_stuck_fragment.log>

The log does not make much sense to me but I'm also not an expert on TCG 
and ARM. Why do you have faults while running a simple empty loop and what 
are those? Is something flushing the TLB for some reason or is this just 
becuase of the debug logging? I think there are some -d oprions for mmu 
debugging that may give more info on TLB usage.

> I have verified that this performance change is not due to slow fpga memory
> area access, i.e. there are no accesses to that memory region during this.

OK so then it should be possible to reproduce without that hardware? If so 
that would help people to understand the issue and give advice but I 
see that reproducing may need understanding the issue first.

> I took a look in gdb while running this loop to see what is going on.
> Initially I was surprised that I didn't find the code in 'OUT:', however I
> guess it makes sense that it has to call into the framework for memory
> access. I noticed that a lot of calls to glib are made and see

I rarely use gdb with QEMU so not sure but normally with TCG in_asm and 
out_asm debug you'll only see these when the TB is first translated not 
when you run it later because then the translated code is run from the TB 
cache. I think you can kind of disable this with -singlestep that makes 
TBs just a single instruction and may change caching. At least with that I 
see all instructions all the time in -d in_asm so this may help debugging 
although it makes things much slower.

> g_tree_lookup called a lot. This is caused by notdirty_write being called
> '000s of times and each time going into the page_collection_lock and
> tb_invalidate_phys_page_fast. I presume this is happening each time that
> "i" is incremented on the stack, which clearly has a huge overhead.

There are only a few places notdirty_write is called from so you should be 
able to identify which of those is firing (if all else fails you could add 
debug logs but there may be trace points to enable too). Once we get which 
place it's coming from then maybe people could tell why that could happen. 
Don't know if you already know QEMU debug options, I have some things 
collected here that I've used while implementing machine emulation here:

https://osdn.net/projects/qmiga/wiki/DeveloperTips

> Even being able to get a proper stack trace from gdb would be very helpful
> to understand this. I tried to configure qemu with '--enable-debug' but
> still do not get a proper stack if i attach to it. I'm not sure if this is
> the case due to it running dynamically compiled code before calling into
> this.

The --enable-debug adds debug symbols to QEMU but if it's called from 
generated code then you'll probably see that as the source of the calls so 
hard to tell what has put that there. Yet it may help if you could show 
some back traces you got in case that makes sense to somebody who knows 
about TCG. Also verify that these excessive calls to notdirty_write does 
only happen when it's running slow so it's really the source of the 
problems and not something normal otherwise.

Sorry I can't give any more useful advice but maybe the above give you 
some idea on how to debug this further.

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: notdirty_write thrashing in simple for loop
  2021-05-23 13:41 ` BALATON Zoltan
@ 2021-05-23 16:07   ` Mark Watson
  2021-05-23 17:30     ` BALATON Zoltan
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Watson @ 2021-05-23 16:07 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2627 bytes --]

Hi

On Sun, 23 May 2021 at 15:41, BALATON Zoltan <balaton@eik.bme.hu> wrote:

> I think you need to be more specific about details or even better provide
> a way to reproduce it without your hardware if possible otherwise people
> will not get what you're seeing. From the above it's not clear to me if
> you're emulating an fpga hardware in QEMU or actually run with the fpga
> (supposedly implementing the Amiga chipset) in the virtual machine's
> memory so accesses to some addresses will do something in hardware (in
> which case it may be difficult to reproduce without it and also could be
> the source of problems so hard to tell what might be causing your issue.)
>

I managed to reproduce now locally without the fpga, on my x86 system.

The issue seems to be the layout of where the Amiga puts code and the
stack. It does not use virtual memory and each program seems to get the
stack just below the code. So whenever the code increments i, it writes to
the page and qemu does a lookup in a map to potentially invalidate the
code.


> (Is this related to pistorm or something based on that for full Amiga
> emulation without Amiga hardware? Just insterested, unrelated to this
> thread.)
>
The minimig is a recreation of the Amiga hardware in the FPGA. In addition
to its own dedicated board, it has been ported to many boards: Turbo
Chameleon, Mist, MiSTer (DE10 Nano with expansion). In the MiSTer an SOC
FPGA chip is used, which has dual arm codes and an fpga on the same
silicon, with high performance bridges beween them.
Pistorm and Buffee are fairly similar, in that they are replacing the 68K
cpu with an emulated cpu, but with intefaces to real hardware. As I
undetstand it, the former uses Musashi and the latter they are writing
their own JIT.


> So you do nothing in the loop just test for the loop variable and this
> sometimes runs slow?
>
Yes in fact even without the test in the loop. Just a loop incrementing i,
where i is on the stack. As I now found out it seems to be an issue if the
code and the variable i are in the same page.

Now I could try to modify the software on the amiga to split stack and
code. I do wonder if some kind of caching layer could be added to qemu so
that repeated invalidates do not take so much cpu time.


> Also verify that these excessive calls to notdirty_write does
> only happen when it's running slow so it's really the source of the
> problems and not something normal otherwise.
>
I have now confirmed this, I enable the trace_event on the notdirty to
confirm.

Many thanks for the qemu and dgb debugging tips, much appreciated. I will
real them.

Mark Watson

[-- Attachment #2: Type: text/html, Size: 3579 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: notdirty_write thrashing in simple for loop
  2021-05-23 16:07   ` Mark Watson
@ 2021-05-23 17:30     ` BALATON Zoltan
  0 siblings, 0 replies; 4+ messages in thread
From: BALATON Zoltan @ 2021-05-23 17:30 UTC (permalink / raw)
  To: Mark Watson; +Cc: Paolo Bonzini, Richard Henderson, qemu-devel

On Sun, 23 May 2021, Mark Watson wrote:
> Hi
>
> On Sun, 23 May 2021 at 15:41, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>> I think you need to be more specific about details or even better provide
>> a way to reproduce it without your hardware if possible otherwise people
>> will not get what you're seeing. From the above it's not clear to me if
>> you're emulating an fpga hardware in QEMU or actually run with the fpga
>> (supposedly implementing the Amiga chipset) in the virtual machine's
>> memory so accesses to some addresses will do something in hardware (in
>> which case it may be difficult to reproduce without it and also could be
>> the source of problems so hard to tell what might be causing your issue.)
>
> I managed to reproduce now locally without the fpga, on my x86 system.
>
> The issue seems to be the layout of where the Amiga puts code and the
> stack. It does not use virtual memory and each program seems to get the
> stack just below the code. So whenever the code increments i, it writes to
> the page and qemu does a lookup in a map to potentially invalidate the
> code.

That's probably enough for people who can give advice to understand the 
problem. I think I get it but can't help more as I don't know TCG or QEMU 
internals very well.

>> (Is this related to pistorm or something based on that for full Amiga
>> emulation without Amiga hardware? Just insterested, unrelated to this
>> thread.)
>>
> The minimig is a recreation of the Amiga hardware in the FPGA. In addition
> to its own dedicated board, it has been ported to many boards: Turbo
> Chameleon, Mist, MiSTer (DE10 Nano with expansion). In the MiSTer an SOC
> FPGA chip is used, which has dual arm codes and an fpga on the same
> silicon, with high performance bridges beween them.
> Pistorm and Buffee are fairly similar, in that they are replacing the 68K
> cpu with an emulated cpu, but with intefaces to real hardware. As I
> undetstand it, the former uses Musashi and the latter they are writing
> their own JIT.

I see. Interesting idea if you already have such an fpga SoC, then you 
can make good use of the ARM cores that way.

>> So you do nothing in the loop just test for the loop variable and this
>> sometimes runs slow?
>>
> Yes in fact even without the test in the loop. Just a loop incrementing i,
> where i is on the stack. As I now found out it seems to be an issue if the
> code and the variable i are in the same page.
>
> Now I could try to modify the software on the amiga to split stack and
> code. I do wonder if some kind of caching layer could be added to qemu so
> that repeated invalidates do not take so much cpu time.

I don't know but added maintainers of accel/tcg/cputlb.c to cc to get 
their attention. You can get this info from MAINTAINERS file and more 
easily with:

scripts/get_maintainer.pl -f accel/tcg/cputlb.c

For reference and more backfround info here's a link to Mark's original 
message:

https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg05581.html

Regards,
BALATON Zoltan

>> Also verify that these excessive calls to notdirty_write does
>> only happen when it's running slow so it's really the source of the
>> problems and not something normal otherwise.
>>
> I have now confirmed this, I enable the trace_event on the notdirty to
> confirm.
>
> Many thanks for the qemu and dgb debugging tips, much appreciated. I will
> real them.
>
> Mark Watson
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-05-23 17:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-18  9:54 notdirty_write thrashing in simple for loop Mark Watson
2021-05-23 13:41 ` BALATON Zoltan
2021-05-23 16:07   ` Mark Watson
2021-05-23 17:30     ` BALATON Zoltan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.