Re: Slowness with multi-thread TCG?

From: "Alex Bennée" <alex.bennee@linaro.org>
To: "Matheus K. Ferst" <matheus.ferst@eldorado.org.br>
Cc: Frederic Barrat <fbarrat@linux.ibm.com>,
	qemu-ppc@nongnu.org, qemu-devel@nongnu.org
Subject: Re: Slowness with multi-thread TCG?
Date: Wed, 29 Jun 2022 18:13:50 +0100	[thread overview]
Message-ID: <87ilojgzfs.fsf@linaro.org> (raw)
In-Reply-To: <9c97ae8f-f733-21fc-97d1-99af971e38fd@eldorado.org.br>

"Matheus K. Ferst" <matheus.ferst@eldorado.org.br> writes:

> On 29/06/2022 12:36, Frederic Barrat wrote:
>> [E-MAIL EXTERNO] Não clique em links ou abra anexos, a menos que
>> você possa confirmar o remetente e saber que o conteúdo é seguro. Em
>> caso de e-mail suspeito entre imediatamente em contato com o DTI.
>> On 29/06/2022 00:17, Alex Bennée wrote:
>>> If you run the sync-profiler (via the HMP "sync-profile on") you can
>>> then get a breakdown of which mutex's are being held and for how long
>>> ("info sync-profile").
>> Alex, a huge thank you!
>> For the record, the "info sync-profile" showed:
>> Type               Object  Call site                     Wait Time (s)
>>         Count  Average (us)
>> --------------------------------------------------------------------------------------------------
>> BQL mutex  0x55eb89425540  accel/tcg/cpu-exec.c:744          
>> 96.31578
>>      73589937          1.31
>> BQL mutex  0x55eb89425540  target/ppc/helper_regs.c:207        0.00150
>>          1178          1.27
>> And it points to a lock in the interrupt delivery path, in
>> cpu_handle_interrupt().
>> I now understand the root cause. The interrupt signal for the
>> decrementer interrupt remains set because the interrupt is not being
>> delivered, per the config. I'm not quite sure what the proper fix is yet
>> (there seems to be several implementations of the decrementer on ppc),
>> but at least I understand why we are so slow.
>> 
>
> To summarize what we talked elsewhere:
> 1 - The threads that are not decompressing the kernel have a pending
> PPC_INTERRUPT_DECR, and cs->interrupt_request is CPU_INTERRUPT_HARD;

I think ppc_set_irq should be doing some gating before calling to set
cs->interrupt_request.

> 2 - cpu_handle_interrupt calls ppc_cpu_exec_interrupt, that calls
> ppc_hw_interrupt to handle the interrupt;
> 3 - ppc_cpu_exec_interrupt decides that the interrupt cannot be
> delivered immediately, so the corresponding bit in
> env->pending_interrupts is not reset;

Is the logic controlled by ppc_hw_interrupt()? The stuff around
async_deliver?

I think maybe some of the logic needs to be factored out and checked
above. Also anywhere where env->msr is updated would need to check if
we've just enabled a load of pending interrupts and then call
ppc_set_irq.

However I'm not super familiar with the PPC code so I'll defer to the
maintainers here ;-)

> 4 - ppc_cpu_exec_interrupt does not change cs->interrupt_request
> because pending_interrupts != 0, so cpu_handle_interrupt will be
> called again.
>
> This loop will acquire and release qemu_mutex_lock_iothread, slowing
> down other threads that need this lock.
>
>> With a quick hack, I could verify that by moving that signal out of the
>> way, the decompression time of the kernel is now peanuts, no matter the
>> number of cpus. Even with one cpu, the 15 seconds measured before was
>> already a huge waste, so it was not really a multiple-cpus problem.
>> Multiple cpus were just highlighting it.
>> Thanks again!
>>    Fred

-- 
Alex Bennée