notdirty_write thrashing in simple for loop

* notdirty_write thrashing in simple for loop
@ 2021-05-18  9:54 Mark Watson
  2021-05-23 13:41 ` BALATON Zoltan
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Watson @ 2021-05-18  9:54 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2268 bytes --]

Hi,

I'm trying to implement my own machine for amiga emulation using a software
cpu and fpga hardware. For this I have built my own machine which consists
of a large malloced ram block and some fpga hardware mmapped elsewhere into
the memory space.

I'm using qemu to emulate a 68040 on an arm cortex a9 host in system mode.

It is working, though I'm investigating a strange performance issue.

I'm looking for advice on where to look next in debugging this from the
specialist(s) of accel/tcg/cputlb.c please.

To investigate the performance issue I tried to break it down to the
simplest possible case. I can reproduce it with a simple for loop (compiled
without optimisation).
        for (int i=0;i!=0xffffff;++i)
{
if ((i&0xffff)==0)
{
}
}
Running it in user mode on the same host it takes ~0.6 seconds. In the
built-in 'virtual' m68k machine running linux it takes 1.3 seconds.
However in my machine under amigaos I'm seeing it typically taking 5 and a
half minutes! Occasionally it seems to run at the correct speed of <2
seconds, though I have yet to identify why. These are the logs of the
captured code before it goes into the main chain loop.
qemu_slow_stuck_fragment.log
<http://www.64kib.com/qemu_slow_stuck_fragment.log>
I have verified that this performance change is not due to slow fpga memory
area access, i.e. there are no accesses to that memory region during this.

I took a look in gdb while running this loop to see what is going on.
Initially I was surprised that I didn't find the code in 'OUT:', however I
guess it makes sense that it has to call into the framework for memory
access. I noticed that a lot of calls to glib are made and see
g_tree_lookup called a lot. This is caused by notdirty_write being called
'000s of times and each time going into the page_collection_lock and
tb_invalidate_phys_page_fast. I presume this is happening each time that
"i" is incremented on the stack, which clearly has a huge overhead.

Even being able to get a proper stack trace from gdb would be very helpful
to understand this. I tried to configure qemu with '--enable-debug' but
still do not get a proper stack if i attach to it. I'm not sure if this is
the case due to it running dynamically compiled code before calling into
this.

Thanks,
Mark

[-- Attachment #2: Type: text/html, Size: 3171 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread