Re: Slowness with multi-thread TCG?

From: "Alex Bennée" <alex.bennee@linaro.org>
To: Frederic Barrat <fbarrat@linux.ibm.com>
Cc: "Matheus K. Ferst" <matheus.ferst@eldorado.org.br>,
	qemu-ppc@nongnu.org, qemu-devel@nongnu.org
Subject: Re: Slowness with multi-thread TCG?
Date: Tue, 28 Jun 2022 23:17:37 +0100	[thread overview]
Message-ID: <8735fojukm.fsf@linaro.org> (raw)
In-Reply-To: <348a8b91-6d91-7429-615b-c0e3d23e4fce@linux.ibm.com>

Frederic Barrat <fbarrat@linux.ibm.com> writes:

> On 28/06/2022 17:12, Alex Bennée wrote:
>> Frederic Barrat <fbarrat@linux.ibm.com> writes:
>> 
>>> On 28/06/2022 13:25, Matheus K. Ferst wrote:
>>>> On 27/06/2022 15:25, Frederic Barrat wrote:
>>>>> [ Resending as it was meant for the qemu-ppc list ]
>>>>>
>>>>> Hello,
>>>>>
>>>>> I've been looking at why our qemu powernv model is so slow when booting
>>>>> a compressed linux kernel, using multiple vcpus and multi-thread tcg.
>>>>> With only one vcpu, the decompression time of the kernel is what it is,
>>>>> but when using multiple vcpus, the decompression is actually slower. And
>>>>> worse: it degrades very fast with the number of vcpus!
>>>>>
>>>>> Rough measurement of the decompression time on a x86 laptop with
>>>>> multi-thread tcg and using the qemu powernv10 machine:
>>>>> 1 vcpu => 15 seconds
>>>>> 2 vcpus => 45 seconds
>>>>> 4 vcpus => 1 min 30 seconds
>>>>>
>>>>> Looking in details, when the firmware (skiboot) hands over execution to
>>>>> the linux kernel, there's one main thread entering some bootstrap code
>>>>> and running the kernel decompression algorithm. All the other secondary
>>>>> threads are left spinning in skiboot (1 thread per vpcu). So on paper,
>>>>> with multi-thread tcg and assuming the system has enough available
>>>>> physical cpus, I would expect the decompression to hog one physical cpu
>>>>> and the time needed to be constant, no matter the number of vpcus.
>> <snip>
>>>>>
>>>>> Ironically, the behavior seen with single thread tcg is what I would
>>>>> expect: 1 thread decompressing in 15 seconds, all the other threads
>>>>> spinning for that same amount of time, all sharing the same physical
>>>>> cpu, so it all adds up nicely: I see 60 seconds decompression time with
>>>>> 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.
>>>>> And single thread tcg hogs one physical cpu of the laptop vs. 4 physical
>>>>> cpus for the slower multi-thread tcg.
>>>>>
>>>>> Does anybody have an idea of what might happen or have suggestion to
>>>>> keep investigating?
>>>>> Thanks for your help!
>>>>>
>>>>>     Fred
>>>>>
>>>>>
>>>> Hi Frederic,
>>>> I did some boot time tests recently and didn't notice this behavior.
>>>> Could you share your QEMU command line with us? Did you build QEMU
>>>> with any debug option or sanitizer enabled?
>>>
>>>
>>> You should be able to see it with:
>>>
>>> qemu-system-ppc64 -machine powernv10 -smp 4 -m 4G -nographic -bios
>>> <path to skiboot.lid> -kernel <path to compresses kernel>   -initrd
>>> <path to initd>  -serial mon:stdio
>>>
>>>
>>> -smp is what matters.
>>>
>>> When simplifying the command line above, I noticed something
>>> interesting: the problem doesn't show using the skiboot.lid shipped
>>> with qemu! I'm using something closer to the current upstream head and
>>> the idle code (the for loop in my initial mail) had been reworked in
>>> between. So, clearly, the way the guest code is written matters. But
>>> that doesn't explain it.
>>>
>>> I'm using a kernel in debug mode, so it's pretty big and that's why I
>>> was using a compressed image. The compressed image is about 8 MB.

You can use split debug to avoid keeping the symbol in the final
vmimage. Or are there other debugging options turned on?

>> If the debug mode on PPC enables live patching of kernel functions
>> for
>> instrumentation that can certainly slow things down. You would see that
>> in tcg_optimize appearing in the perf log and "info jit" showing
>> constantly growing translation buffers.
>
>
> The part where I'm seeing the huge slowdown is not quite in kernel
> yet. Only one thread is in bootstrap code decompressing the real
> kernel. All the other threads are still spinning in firmware.
>
> Anyway, I've run perf. I couldn't figure out how to trigger the
> recording only around the decompression part with the slowdown. So I
> booted with 4 cpus to make it really slow, expecting the initial steps
> of the boot, which happen quickly enough, would be dwarfed by the time
> spent while one thread is decompressing the kernel (the part where I
> see the huge slowdown). I'd say the recording was taken with ~80% of
> the time in the interesting part. Here is what I got:
>
>
>   12,62%  qemu-system-ppc  [kernel.kallsyms]          [k]
>   syscall_exit_to_user_mode
>    6,93%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    syscall_return_via_sysret
>    5,64%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    __entry_text_start
>    3,93%  qemu-system-ppc  libc.so.6                  [.]
>    pthread_mutex_lock@@GLIBC_2.2.5
>    3,21%  qemu-system-ppc  libc.so.6                  [.]
>    __GI___pthread_mutex_unlock_usercnt
>    3,12%  qemu-system-ppc  libc.so.6                  [.]
>    __GI___lll_lock_wait
>    2,60%  qemu-system-ppc  qemu-system-ppc64          [.]
>    cpu_handle_interrupt
>    2,55%  qemu-system-ppc  [kernel.kallsyms]          [k] futex_wake
>    2,43%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    native_queued_spin_lock_slowpath
>    1,97%  qemu-system-ppc  [kernel.kallsyms]          [k] _raw_spin_lock
>    1,89%  qemu-system-ppc  qemu-system-ppc64          [.]
>    qemu_mutex_lock_impl
>    1,83%  qemu-system-ppc  qemu-system-ppc64          [.] tb_lookup
>    1,71%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    __get_user_nocheck_4
>    1,55%  qemu-system-ppc  qemu-system-ppc64          [.]
>    hreg_compute_hflags_value
>    1,46%  qemu-system-ppc  [kernel.kallsyms]          [k] futex_q_lock
>    1,39%  qemu-system-ppc  [kernel.kallsyms]          [k] futex_q_unlock
>    1,23%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    audit_reset_context.part.0.constprop.0
>    1,14%  qemu-system-ppc  qemu-system-ppc64          [.]
>    object_class_dynamic_cast_assert
>    1,09%  qemu-system-ppc  qemu-system-ppc64          [.]
>    qemu_mutex_unlock_impl
>    1,02%  qemu-system-ppc  qemu-system-ppc64          [.]
>    object_dynamic_cast_assert
>    1,00%  qemu-system-ppc  [kernel.kallsyms]          [k] __x64_sys_futex
>
>
> Any known pattern here? There seems to be some contention with the
> mutex/futex call, but it's not obvious to me what it is.

If you run the sync-profiler (via the HMP "sync-profile on") you can
then get a breakdown of which mutex's are being held and for how long
("info sync-profile").

> I was also pointed to enabling gprof in qemu. I'll look into it.

gprof will likely change the behaviour due to overhead.

>
> Thanks!
>
>   Fred
>
>
>>>
>>> The initrd shouldn't matter, the issue is seen during kernel
>>> decompression, before the init ram is used.
>>>
>>> I can share my binaries if you'd like. Especially a recent version of
>>> skiboot showing the problem.
>>>
>>>    Fred
>> 

-- 
Alex Bennée