All of lore.kernel.org
 help / color / mirror / Atom feed
* Slowness with multi-thread TCG?
@ 2022-06-27 18:25 Frederic Barrat
  2022-06-27 21:10 ` Alex Bennée
  2022-06-28 11:25 ` Matheus K. Ferst
  0 siblings, 2 replies; 13+ messages in thread
From: Frederic Barrat @ 2022-06-27 18:25 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc

[ Resending as it was meant for the qemu-ppc list ]

Hello,

I've been looking at why our qemu powernv model is so slow when booting 
a compressed linux kernel, using multiple vcpus and multi-thread tcg. 
With only one vcpu, the decompression time of the kernel is what it is, 
but when using multiple vcpus, the decompression is actually slower. And 
worse: it degrades very fast with the number of vcpus!

Rough measurement of the decompression time on a x86 laptop with 
multi-thread tcg and using the qemu powernv10 machine:
1 vcpu => 15 seconds
2 vcpus => 45 seconds
4 vcpus => 1 min 30 seconds

Looking in details, when the firmware (skiboot) hands over execution to 
the linux kernel, there's one main thread entering some bootstrap code 
and running the kernel decompression algorithm. All the other secondary 
threads are left spinning in skiboot (1 thread per vpcu). So on paper, 
with multi-thread tcg and assuming the system has enough available 
physical cpus, I would expect the decompression to hog one physical cpu 
and the time needed to be constant, no matter the number of vpcus.

All the secondary threads are left spinning in code like this:

	for (;;) {
		if (cpu_check_jobs(cpu))  // reading cpu-local data
			break;
		if (reconfigure_idle)     // global variable
			break;
		barrier();
	}

The barrier is to force reading the memory with each iteration. It's 
defined as:

   asm volatile("" : : : "memory");


Some time later, the main thread in the linux kernel will get the 
secondary threads out of that loop by posting a job.

My first thought was that the translation of that code through tcg was 
somehow causing some abnormally slow behavior, maybe due to some 
non-obvious contention between the threads. However, if I send the 
threads spinning forever with simply:

     for (;;) ;

supposedly removing any contention, then the decompression time is the same.

Ironically, the behavior seen with single thread tcg is what I would 
expect: 1 thread decompressing in 15 seconds, all the other threads 
spinning for that same amount of time, all sharing the same physical 
cpu, so it all adds up nicely: I see 60 seconds decompression time with 
4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit. 
And single thread tcg hogs one physical cpu of the laptop vs. 4 physical 
cpus for the slower multi-thread tcg.

Does anybody have an idea of what might happen or have suggestion to 
keep investigating?
Thanks for your help!

   Fred



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-27 18:25 Slowness with multi-thread TCG? Frederic Barrat
@ 2022-06-27 21:10 ` Alex Bennée
  2022-06-28 11:25 ` Matheus K. Ferst
  1 sibling, 0 replies; 13+ messages in thread
From: Alex Bennée @ 2022-06-27 21:10 UTC (permalink / raw)
  To: Frederic Barrat; +Cc: qemu-ppc, qemu-devel


Frederic Barrat <fbarrat@linux.ibm.com> writes:

> [ Resending as it was meant for the qemu-ppc list ]
>
> Hello,
>
> I've been looking at why our qemu powernv model is so slow when
> booting a compressed linux kernel, using multiple vcpus and
> multi-thread tcg. With only one vcpu, the decompression time of the
> kernel is what it is, but when using multiple vcpus, the decompression
> is actually slower. And worse: it degrades very fast with the number
> of vcpus!
>
> Rough measurement of the decompression time on a x86 laptop with
> multi-thread tcg and using the qemu powernv10 machine:
> 1 vcpu => 15 seconds
> 2 vcpus => 45 seconds
> 4 vcpus => 1 min 30 seconds
>
> Looking in details, when the firmware (skiboot) hands over execution
> to the linux kernel, there's one main thread entering some bootstrap
> code and running the kernel decompression algorithm. All the other
> secondary threads are left spinning in skiboot (1 thread per vpcu). So
> on paper, with multi-thread tcg and assuming the system has enough
> available physical cpus, I would expect the decompression to hog one
> physical cpu and the time needed to be constant, no matter the number
> of vpcus.
>
> All the secondary threads are left spinning in code like this:
>
> 	for (;;) {
> 		if (cpu_check_jobs(cpu))  // reading cpu-local data
> 			break;
> 		if (reconfigure_idle)     // global variable
> 			break;
> 		barrier();
> 	}
>
> The barrier is to force reading the memory with each iteration. It's
> defined as:
>
>   asm volatile("" : : : "memory");
>
>
> Some time later, the main thread in the linux kernel will get the
> secondary threads out of that loop by posting a job.
>
> My first thought was that the translation of that code through tcg was
> somehow causing some abnormally slow behavior, maybe due to some
> non-obvious contention between the threads. However, if I send the
> threads spinning forever with simply:
>
>     for (;;) ;
>
> supposedly removing any contention, then the decompression time is the same.
>
> Ironically, the behavior seen with single thread tcg is what I would
> expect: 1 thread decompressing in 15 seconds, all the other threads
> spinning for that same amount of time, all sharing the same physical
> cpu, so it all adds up nicely: I see 60 seconds decompression time
> with 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a
> bit. And single thread tcg hogs one physical cpu of the laptop vs. 4
> physical cpus for the slower multi-thread tcg.
>
> Does anybody have an idea of what might happen or have suggestion to
> keep investigating?

Usually it becomes clear when running under "perf record" and then you
can post the top 20 functions in perf report. It's usually some event
that triggers syncronisation between all the threads which is a costly
thing to do.

> Thanks for your help!
>
>   Fred


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-27 18:25 Slowness with multi-thread TCG? Frederic Barrat
  2022-06-27 21:10 ` Alex Bennée
@ 2022-06-28 11:25 ` Matheus K. Ferst
  2022-06-28 13:08   ` Frederic Barrat
  1 sibling, 1 reply; 13+ messages in thread
From: Matheus K. Ferst @ 2022-06-28 11:25 UTC (permalink / raw)
  To: Frederic Barrat, qemu-devel, qemu-ppc

On 27/06/2022 15:25, Frederic Barrat wrote:
> [ Resending as it was meant for the qemu-ppc list ]
> 
> Hello,
> 
> I've been looking at why our qemu powernv model is so slow when booting
> a compressed linux kernel, using multiple vcpus and multi-thread tcg.
> With only one vcpu, the decompression time of the kernel is what it is,
> but when using multiple vcpus, the decompression is actually slower. And
> worse: it degrades very fast with the number of vcpus!
> 
> Rough measurement of the decompression time on a x86 laptop with
> multi-thread tcg and using the qemu powernv10 machine:
> 1 vcpu => 15 seconds
> 2 vcpus => 45 seconds
> 4 vcpus => 1 min 30 seconds
> 
> Looking in details, when the firmware (skiboot) hands over execution to
> the linux kernel, there's one main thread entering some bootstrap code
> and running the kernel decompression algorithm. All the other secondary
> threads are left spinning in skiboot (1 thread per vpcu). So on paper,
> with multi-thread tcg and assuming the system has enough available
> physical cpus, I would expect the decompression to hog one physical cpu
> and the time needed to be constant, no matter the number of vpcus.
> 
> All the secondary threads are left spinning in code like this:
> 
>         for (;;) {
>                 if (cpu_check_jobs(cpu))  // reading cpu-local data
>                         break;
>                 if (reconfigure_idle)     // global variable
>                         break;
>                 barrier();
>         }
> 
> The barrier is to force reading the memory with each iteration. It's
> defined as:
> 
>    asm volatile("" : : : "memory");
> 
> 
> Some time later, the main thread in the linux kernel will get the
> secondary threads out of that loop by posting a job.
> 
> My first thought was that the translation of that code through tcg was
> somehow causing some abnormally slow behavior, maybe due to some
> non-obvious contention between the threads. However, if I send the
> threads spinning forever with simply:
> 
>      for (;;) ;
> 
> supposedly removing any contention, then the decompression time is the 
> same.
> 
> Ironically, the behavior seen with single thread tcg is what I would
> expect: 1 thread decompressing in 15 seconds, all the other threads
> spinning for that same amount of time, all sharing the same physical
> cpu, so it all adds up nicely: I see 60 seconds decompression time with
> 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.
> And single thread tcg hogs one physical cpu of the laptop vs. 4 physical
> cpus for the slower multi-thread tcg.
> 
> Does anybody have an idea of what might happen or have suggestion to
> keep investigating?
> Thanks for your help!
> 
>    Fred
> 
> 

Hi Frederic,

I did some boot time tests recently and didn't notice this behavior. 
Could you share your QEMU command line with us? Did you build QEMU with 
any debug option or sanitizer enabled?

-- 
Matheus K. Ferst
Instituto de Pesquisas ELDORADO <http://www.eldorado.org.br/>
Analista de Software
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-28 11:25 ` Matheus K. Ferst
@ 2022-06-28 13:08   ` Frederic Barrat
  2022-06-28 15:12     ` Alex Bennée
  0 siblings, 1 reply; 13+ messages in thread
From: Frederic Barrat @ 2022-06-28 13:08 UTC (permalink / raw)
  To: Matheus K. Ferst, qemu-devel, qemu-ppc



On 28/06/2022 13:25, Matheus K. Ferst wrote:
> On 27/06/2022 15:25, Frederic Barrat wrote:
>> [ Resending as it was meant for the qemu-ppc list ]
>>
>> Hello,
>>
>> I've been looking at why our qemu powernv model is so slow when booting
>> a compressed linux kernel, using multiple vcpus and multi-thread tcg.
>> With only one vcpu, the decompression time of the kernel is what it is,
>> but when using multiple vcpus, the decompression is actually slower. And
>> worse: it degrades very fast with the number of vcpus!
>>
>> Rough measurement of the decompression time on a x86 laptop with
>> multi-thread tcg and using the qemu powernv10 machine:
>> 1 vcpu => 15 seconds
>> 2 vcpus => 45 seconds
>> 4 vcpus => 1 min 30 seconds
>>
>> Looking in details, when the firmware (skiboot) hands over execution to
>> the linux kernel, there's one main thread entering some bootstrap code
>> and running the kernel decompression algorithm. All the other secondary
>> threads are left spinning in skiboot (1 thread per vpcu). So on paper,
>> with multi-thread tcg and assuming the system has enough available
>> physical cpus, I would expect the decompression to hog one physical cpu
>> and the time needed to be constant, no matter the number of vpcus.
>>
>> All the secondary threads are left spinning in code like this:
>>
>>         for (;;) {
>>                 if (cpu_check_jobs(cpu))  // reading cpu-local data
>>                         break;
>>                 if (reconfigure_idle)     // global variable
>>                         break;
>>                 barrier();
>>         }
>>
>> The barrier is to force reading the memory with each iteration. It's
>> defined as:
>>
>>    asm volatile("" : : : "memory");
>>
>>
>> Some time later, the main thread in the linux kernel will get the
>> secondary threads out of that loop by posting a job.
>>
>> My first thought was that the translation of that code through tcg was
>> somehow causing some abnormally slow behavior, maybe due to some
>> non-obvious contention between the threads. However, if I send the
>> threads spinning forever with simply:
>>
>>      for (;;) ;
>>
>> supposedly removing any contention, then the decompression time is the 
>> same.
>>
>> Ironically, the behavior seen with single thread tcg is what I would
>> expect: 1 thread decompressing in 15 seconds, all the other threads
>> spinning for that same amount of time, all sharing the same physical
>> cpu, so it all adds up nicely: I see 60 seconds decompression time with
>> 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.
>> And single thread tcg hogs one physical cpu of the laptop vs. 4 physical
>> cpus for the slower multi-thread tcg.
>>
>> Does anybody have an idea of what might happen or have suggestion to
>> keep investigating?
>> Thanks for your help!
>>
>>    Fred
>>
>>
> 
> Hi Frederic,
> 
> I did some boot time tests recently and didn't notice this behavior. 
> Could you share your QEMU command line with us? Did you build QEMU with 
> any debug option or sanitizer enabled?


You should be able to see it with:

qemu-system-ppc64 -machine powernv10 -smp 4 -m 4G -nographic -bios <path 
to skiboot.lid> -kernel <path to compresses kernel>   -initrd <path to 
initd>  -serial mon:stdio


-smp is what matters.

When simplifying the command line above, I noticed something 
interesting: the problem doesn't show using the skiboot.lid shipped with 
qemu! I'm using something closer to the current upstream head and the 
idle code (the for loop in my initial mail) had been reworked in 
between. So, clearly, the way the guest code is written matters. But 
that doesn't explain it.

I'm using a kernel in debug mode, so it's pretty big and that's why I 
was using a compressed image. The compressed image is about 8 MB.

The initrd shouldn't matter, the issue is seen during kernel 
decompression, before the init ram is used.

I can share my binaries if you'd like. Especially a recent version of 
skiboot showing the problem.

   Fred






^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-28 13:08   ` Frederic Barrat
@ 2022-06-28 15:12     ` Alex Bennée
  2022-06-28 16:16       ` Frederic Barrat
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Bennée @ 2022-06-28 15:12 UTC (permalink / raw)
  To: Frederic Barrat; +Cc: Matheus K. Ferst, qemu-ppc, qemu-devel


Frederic Barrat <fbarrat@linux.ibm.com> writes:

> On 28/06/2022 13:25, Matheus K. Ferst wrote:
>> On 27/06/2022 15:25, Frederic Barrat wrote:
>>> [ Resending as it was meant for the qemu-ppc list ]
>>>
>>> Hello,
>>>
>>> I've been looking at why our qemu powernv model is so slow when booting
>>> a compressed linux kernel, using multiple vcpus and multi-thread tcg.
>>> With only one vcpu, the decompression time of the kernel is what it is,
>>> but when using multiple vcpus, the decompression is actually slower. And
>>> worse: it degrades very fast with the number of vcpus!
>>>
>>> Rough measurement of the decompression time on a x86 laptop with
>>> multi-thread tcg and using the qemu powernv10 machine:
>>> 1 vcpu => 15 seconds
>>> 2 vcpus => 45 seconds
>>> 4 vcpus => 1 min 30 seconds
>>>
>>> Looking in details, when the firmware (skiboot) hands over execution to
>>> the linux kernel, there's one main thread entering some bootstrap code
>>> and running the kernel decompression algorithm. All the other secondary
>>> threads are left spinning in skiboot (1 thread per vpcu). So on paper,
>>> with multi-thread tcg and assuming the system has enough available
>>> physical cpus, I would expect the decompression to hog one physical cpu
>>> and the time needed to be constant, no matter the number of vpcus.
<snip>
>>>
>>> Ironically, the behavior seen with single thread tcg is what I would
>>> expect: 1 thread decompressing in 15 seconds, all the other threads
>>> spinning for that same amount of time, all sharing the same physical
>>> cpu, so it all adds up nicely: I see 60 seconds decompression time with
>>> 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.
>>> And single thread tcg hogs one physical cpu of the laptop vs. 4 physical
>>> cpus for the slower multi-thread tcg.
>>>
>>> Does anybody have an idea of what might happen or have suggestion to
>>> keep investigating?
>>> Thanks for your help!
>>>
>>>    Fred
>>>
>>>
>> Hi Frederic,
>> I did some boot time tests recently and didn't notice this behavior.
>> Could you share your QEMU command line with us? Did you build QEMU
>> with any debug option or sanitizer enabled?
>
>
> You should be able to see it with:
>
> qemu-system-ppc64 -machine powernv10 -smp 4 -m 4G -nographic -bios
> <path to skiboot.lid> -kernel <path to compresses kernel>   -initrd
> <path to initd>  -serial mon:stdio
>
>
> -smp is what matters.
>
> When simplifying the command line above, I noticed something
> interesting: the problem doesn't show using the skiboot.lid shipped
> with qemu! I'm using something closer to the current upstream head and
> the idle code (the for loop in my initial mail) had been reworked in
> between. So, clearly, the way the guest code is written matters. But
> that doesn't explain it.
>
> I'm using a kernel in debug mode, so it's pretty big and that's why I
> was using a compressed image. The compressed image is about 8 MB.

If the debug mode on PPC enables live patching of kernel functions for
instrumentation that can certainly slow things down. You would see that
in tcg_optimize appearing in the perf log and "info jit" showing
constantly growing translation buffers.

>
> The initrd shouldn't matter, the issue is seen during kernel
> decompression, before the init ram is used.
>
> I can share my binaries if you'd like. Especially a recent version of
> skiboot showing the problem.
>
>   Fred


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-28 15:12     ` Alex Bennée
@ 2022-06-28 16:16       ` Frederic Barrat
  2022-06-28 22:17         ` Alex Bennée
  0 siblings, 1 reply; 13+ messages in thread
From: Frederic Barrat @ 2022-06-28 16:16 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Matheus K. Ferst, qemu-ppc, qemu-devel



On 28/06/2022 17:12, Alex Bennée wrote:
> 
> Frederic Barrat <fbarrat@linux.ibm.com> writes:
> 
>> On 28/06/2022 13:25, Matheus K. Ferst wrote:
>>> On 27/06/2022 15:25, Frederic Barrat wrote:
>>>> [ Resending as it was meant for the qemu-ppc list ]
>>>>
>>>> Hello,
>>>>
>>>> I've been looking at why our qemu powernv model is so slow when booting
>>>> a compressed linux kernel, using multiple vcpus and multi-thread tcg.
>>>> With only one vcpu, the decompression time of the kernel is what it is,
>>>> but when using multiple vcpus, the decompression is actually slower. And
>>>> worse: it degrades very fast with the number of vcpus!
>>>>
>>>> Rough measurement of the decompression time on a x86 laptop with
>>>> multi-thread tcg and using the qemu powernv10 machine:
>>>> 1 vcpu => 15 seconds
>>>> 2 vcpus => 45 seconds
>>>> 4 vcpus => 1 min 30 seconds
>>>>
>>>> Looking in details, when the firmware (skiboot) hands over execution to
>>>> the linux kernel, there's one main thread entering some bootstrap code
>>>> and running the kernel decompression algorithm. All the other secondary
>>>> threads are left spinning in skiboot (1 thread per vpcu). So on paper,
>>>> with multi-thread tcg and assuming the system has enough available
>>>> physical cpus, I would expect the decompression to hog one physical cpu
>>>> and the time needed to be constant, no matter the number of vpcus.
> <snip>
>>>>
>>>> Ironically, the behavior seen with single thread tcg is what I would
>>>> expect: 1 thread decompressing in 15 seconds, all the other threads
>>>> spinning for that same amount of time, all sharing the same physical
>>>> cpu, so it all adds up nicely: I see 60 seconds decompression time with
>>>> 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.
>>>> And single thread tcg hogs one physical cpu of the laptop vs. 4 physical
>>>> cpus for the slower multi-thread tcg.
>>>>
>>>> Does anybody have an idea of what might happen or have suggestion to
>>>> keep investigating?
>>>> Thanks for your help!
>>>>
>>>>     Fred
>>>>
>>>>
>>> Hi Frederic,
>>> I did some boot time tests recently and didn't notice this behavior.
>>> Could you share your QEMU command line with us? Did you build QEMU
>>> with any debug option or sanitizer enabled?
>>
>>
>> You should be able to see it with:
>>
>> qemu-system-ppc64 -machine powernv10 -smp 4 -m 4G -nographic -bios
>> <path to skiboot.lid> -kernel <path to compresses kernel>   -initrd
>> <path to initd>  -serial mon:stdio
>>
>>
>> -smp is what matters.
>>
>> When simplifying the command line above, I noticed something
>> interesting: the problem doesn't show using the skiboot.lid shipped
>> with qemu! I'm using something closer to the current upstream head and
>> the idle code (the for loop in my initial mail) had been reworked in
>> between. So, clearly, the way the guest code is written matters. But
>> that doesn't explain it.
>>
>> I'm using a kernel in debug mode, so it's pretty big and that's why I
>> was using a compressed image. The compressed image is about 8 MB.
> 
> If the debug mode on PPC enables live patching of kernel functions for
> instrumentation that can certainly slow things down. You would see that
> in tcg_optimize appearing in the perf log and "info jit" showing
> constantly growing translation buffers.


The part where I'm seeing the huge slowdown is not quite in kernel yet. 
Only one thread is in bootstrap code decompressing the real kernel. All 
the other threads are still spinning in firmware.

Anyway, I've run perf. I couldn't figure out how to trigger the 
recording only around the decompression part with the slowdown. So I 
booted with 4 cpus to make it really slow, expecting the initial steps 
of the boot, which happen quickly enough, would be dwarfed by the time 
spent while one thread is decompressing the kernel (the part where I see 
the huge slowdown). I'd say the recording was taken with ~80% of the 
time in the interesting part. Here is what I got:


   12,62%  qemu-system-ppc  [kernel.kallsyms]          [k] 
syscall_exit_to_user_mode
    6,93%  qemu-system-ppc  [kernel.kallsyms]          [k] 
syscall_return_via_sysret
    5,64%  qemu-system-ppc  [kernel.kallsyms]          [k] 
__entry_text_start
    3,93%  qemu-system-ppc  libc.so.6                  [.] 
pthread_mutex_lock@@GLIBC_2.2.5
    3,21%  qemu-system-ppc  libc.so.6                  [.] 
__GI___pthread_mutex_unlock_usercnt
    3,12%  qemu-system-ppc  libc.so.6                  [.] 
__GI___lll_lock_wait
    2,60%  qemu-system-ppc  qemu-system-ppc64          [.] 
cpu_handle_interrupt
    2,55%  qemu-system-ppc  [kernel.kallsyms]          [k] futex_wake
    2,43%  qemu-system-ppc  [kernel.kallsyms]          [k] 
native_queued_spin_lock_slowpath
    1,97%  qemu-system-ppc  [kernel.kallsyms]          [k] _raw_spin_lock
    1,89%  qemu-system-ppc  qemu-system-ppc64          [.] 
qemu_mutex_lock_impl
    1,83%  qemu-system-ppc  qemu-system-ppc64          [.] tb_lookup
    1,71%  qemu-system-ppc  [kernel.kallsyms]          [k] 
__get_user_nocheck_4
    1,55%  qemu-system-ppc  qemu-system-ppc64          [.] 
hreg_compute_hflags_value
    1,46%  qemu-system-ppc  [kernel.kallsyms]          [k] futex_q_lock
    1,39%  qemu-system-ppc  [kernel.kallsyms]          [k] futex_q_unlock
    1,23%  qemu-system-ppc  [kernel.kallsyms]          [k] 
audit_reset_context.part.0.constprop.0
    1,14%  qemu-system-ppc  qemu-system-ppc64          [.] 
object_class_dynamic_cast_assert
    1,09%  qemu-system-ppc  qemu-system-ppc64          [.] 
qemu_mutex_unlock_impl
    1,02%  qemu-system-ppc  qemu-system-ppc64          [.] 
object_dynamic_cast_assert
    1,00%  qemu-system-ppc  [kernel.kallsyms]          [k] __x64_sys_futex


Any known pattern here? There seems to be some contention with the 
mutex/futex call, but it's not obvious to me what it is.

I was also pointed to enabling gprof in qemu. I'll look into it.

Thanks!

   Fred


>>
>> The initrd shouldn't matter, the issue is seen during kernel
>> decompression, before the init ram is used.
>>
>> I can share my binaries if you'd like. Especially a recent version of
>> skiboot showing the problem.
>>
>>    Fred
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-28 16:16       ` Frederic Barrat
@ 2022-06-28 22:17         ` Alex Bennée
  2022-06-29 15:36           ` Frederic Barrat
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Bennée @ 2022-06-28 22:17 UTC (permalink / raw)
  To: Frederic Barrat; +Cc: Matheus K. Ferst, qemu-ppc, qemu-devel


Frederic Barrat <fbarrat@linux.ibm.com> writes:

> On 28/06/2022 17:12, Alex Bennée wrote:
>> Frederic Barrat <fbarrat@linux.ibm.com> writes:
>> 
>>> On 28/06/2022 13:25, Matheus K. Ferst wrote:
>>>> On 27/06/2022 15:25, Frederic Barrat wrote:
>>>>> [ Resending as it was meant for the qemu-ppc list ]
>>>>>
>>>>> Hello,
>>>>>
>>>>> I've been looking at why our qemu powernv model is so slow when booting
>>>>> a compressed linux kernel, using multiple vcpus and multi-thread tcg.
>>>>> With only one vcpu, the decompression time of the kernel is what it is,
>>>>> but when using multiple vcpus, the decompression is actually slower. And
>>>>> worse: it degrades very fast with the number of vcpus!
>>>>>
>>>>> Rough measurement of the decompression time on a x86 laptop with
>>>>> multi-thread tcg and using the qemu powernv10 machine:
>>>>> 1 vcpu => 15 seconds
>>>>> 2 vcpus => 45 seconds
>>>>> 4 vcpus => 1 min 30 seconds
>>>>>
>>>>> Looking in details, when the firmware (skiboot) hands over execution to
>>>>> the linux kernel, there's one main thread entering some bootstrap code
>>>>> and running the kernel decompression algorithm. All the other secondary
>>>>> threads are left spinning in skiboot (1 thread per vpcu). So on paper,
>>>>> with multi-thread tcg and assuming the system has enough available
>>>>> physical cpus, I would expect the decompression to hog one physical cpu
>>>>> and the time needed to be constant, no matter the number of vpcus.
>> <snip>
>>>>>
>>>>> Ironically, the behavior seen with single thread tcg is what I would
>>>>> expect: 1 thread decompressing in 15 seconds, all the other threads
>>>>> spinning for that same amount of time, all sharing the same physical
>>>>> cpu, so it all adds up nicely: I see 60 seconds decompression time with
>>>>> 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.
>>>>> And single thread tcg hogs one physical cpu of the laptop vs. 4 physical
>>>>> cpus for the slower multi-thread tcg.
>>>>>
>>>>> Does anybody have an idea of what might happen or have suggestion to
>>>>> keep investigating?
>>>>> Thanks for your help!
>>>>>
>>>>>     Fred
>>>>>
>>>>>
>>>> Hi Frederic,
>>>> I did some boot time tests recently and didn't notice this behavior.
>>>> Could you share your QEMU command line with us? Did you build QEMU
>>>> with any debug option or sanitizer enabled?
>>>
>>>
>>> You should be able to see it with:
>>>
>>> qemu-system-ppc64 -machine powernv10 -smp 4 -m 4G -nographic -bios
>>> <path to skiboot.lid> -kernel <path to compresses kernel>   -initrd
>>> <path to initd>  -serial mon:stdio
>>>
>>>
>>> -smp is what matters.
>>>
>>> When simplifying the command line above, I noticed something
>>> interesting: the problem doesn't show using the skiboot.lid shipped
>>> with qemu! I'm using something closer to the current upstream head and
>>> the idle code (the for loop in my initial mail) had been reworked in
>>> between. So, clearly, the way the guest code is written matters. But
>>> that doesn't explain it.
>>>
>>> I'm using a kernel in debug mode, so it's pretty big and that's why I
>>> was using a compressed image. The compressed image is about 8 MB.

You can use split debug to avoid keeping the symbol in the final
vmimage. Or are there other debugging options turned on?

>> If the debug mode on PPC enables live patching of kernel functions
>> for
>> instrumentation that can certainly slow things down. You would see that
>> in tcg_optimize appearing in the perf log and "info jit" showing
>> constantly growing translation buffers.
>
>
> The part where I'm seeing the huge slowdown is not quite in kernel
> yet. Only one thread is in bootstrap code decompressing the real
> kernel. All the other threads are still spinning in firmware.
>
> Anyway, I've run perf. I couldn't figure out how to trigger the
> recording only around the decompression part with the slowdown. So I
> booted with 4 cpus to make it really slow, expecting the initial steps
> of the boot, which happen quickly enough, would be dwarfed by the time
> spent while one thread is decompressing the kernel (the part where I
> see the huge slowdown). I'd say the recording was taken with ~80% of
> the time in the interesting part. Here is what I got:
>
>
>   12,62%  qemu-system-ppc  [kernel.kallsyms]          [k]
>   syscall_exit_to_user_mode
>    6,93%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    syscall_return_via_sysret
>    5,64%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    __entry_text_start
>    3,93%  qemu-system-ppc  libc.so.6                  [.]
>    pthread_mutex_lock@@GLIBC_2.2.5
>    3,21%  qemu-system-ppc  libc.so.6                  [.]
>    __GI___pthread_mutex_unlock_usercnt
>    3,12%  qemu-system-ppc  libc.so.6                  [.]
>    __GI___lll_lock_wait
>    2,60%  qemu-system-ppc  qemu-system-ppc64          [.]
>    cpu_handle_interrupt
>    2,55%  qemu-system-ppc  [kernel.kallsyms]          [k] futex_wake
>    2,43%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    native_queued_spin_lock_slowpath
>    1,97%  qemu-system-ppc  [kernel.kallsyms]          [k] _raw_spin_lock
>    1,89%  qemu-system-ppc  qemu-system-ppc64          [.]
>    qemu_mutex_lock_impl
>    1,83%  qemu-system-ppc  qemu-system-ppc64          [.] tb_lookup
>    1,71%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    __get_user_nocheck_4
>    1,55%  qemu-system-ppc  qemu-system-ppc64          [.]
>    hreg_compute_hflags_value
>    1,46%  qemu-system-ppc  [kernel.kallsyms]          [k] futex_q_lock
>    1,39%  qemu-system-ppc  [kernel.kallsyms]          [k] futex_q_unlock
>    1,23%  qemu-system-ppc  [kernel.kallsyms]          [k]
>    audit_reset_context.part.0.constprop.0
>    1,14%  qemu-system-ppc  qemu-system-ppc64          [.]
>    object_class_dynamic_cast_assert
>    1,09%  qemu-system-ppc  qemu-system-ppc64          [.]
>    qemu_mutex_unlock_impl
>    1,02%  qemu-system-ppc  qemu-system-ppc64          [.]
>    object_dynamic_cast_assert
>    1,00%  qemu-system-ppc  [kernel.kallsyms]          [k] __x64_sys_futex
>
>
> Any known pattern here? There seems to be some contention with the
> mutex/futex call, but it's not obvious to me what it is.

If you run the sync-profiler (via the HMP "sync-profile on") you can
then get a breakdown of which mutex's are being held and for how long
("info sync-profile").

> I was also pointed to enabling gprof in qemu. I'll look into it.

gprof will likely change the behaviour due to overhead.

>
> Thanks!
>
>   Fred
>
>
>>>
>>> The initrd shouldn't matter, the issue is seen during kernel
>>> decompression, before the init ram is used.
>>>
>>> I can share my binaries if you'd like. Especially a recent version of
>>> skiboot showing the problem.
>>>
>>>    Fred
>> 


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-28 22:17         ` Alex Bennée
@ 2022-06-29 15:36           ` Frederic Barrat
  2022-06-29 16:01             ` Alex Bennée
  2022-06-29 16:25             ` Matheus K. Ferst
  0 siblings, 2 replies; 13+ messages in thread
From: Frederic Barrat @ 2022-06-29 15:36 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Matheus K. Ferst, qemu-ppc, qemu-devel



On 29/06/2022 00:17, Alex Bennée wrote:
> If you run the sync-profiler (via the HMP "sync-profile on") you can
> then get a breakdown of which mutex's are being held and for how long
> ("info sync-profile").


Alex, a huge thank you!

For the record, the "info sync-profile" showed:
Type               Object  Call site                     Wait Time (s) 
        Count  Average (us)
--------------------------------------------------------------------------------------------------
BQL mutex  0x55eb89425540  accel/tcg/cpu-exec.c:744           96.31578 
     73589937          1.31
BQL mutex  0x55eb89425540  target/ppc/helper_regs.c:207        0.00150 
         1178          1.27


And it points to a lock in the interrupt delivery path, in 
cpu_handle_interrupt().

I now understand the root cause. The interrupt signal for the 
decrementer interrupt remains set because the interrupt is not being 
delivered, per the config. I'm not quite sure what the proper fix is yet 
(there seems to be several implementations of the decrementer on ppc), 
but at least I understand why we are so slow.

With a quick hack, I could verify that by moving that signal out of the 
way, the decompression time of the kernel is now peanuts, no matter the 
number of cpus. Even with one cpu, the 15 seconds measured before was 
already a huge waste, so it was not really a multiple-cpus problem. 
Multiple cpus were just highlighting it.

Thanks again!

   Fred


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-29 15:36           ` Frederic Barrat
@ 2022-06-29 16:01             ` Alex Bennée
  2022-06-29 16:25             ` Matheus K. Ferst
  1 sibling, 0 replies; 13+ messages in thread
From: Alex Bennée @ 2022-06-29 16:01 UTC (permalink / raw)
  To: Frederic Barrat; +Cc: Matheus K. Ferst, qemu-ppc, qemu-devel


Frederic Barrat <fbarrat@linux.ibm.com> writes:

> On 29/06/2022 00:17, Alex Bennée wrote:
>> If you run the sync-profiler (via the HMP "sync-profile on") you can
>> then get a breakdown of which mutex's are being held and for how long
>> ("info sync-profile").
>
>
> Alex, a huge thank you!
>
> For the record, the "info sync-profile" showed:
> Type               Object  Call site                     Wait Time (s)
> Count  Average (us)
> --------------------------------------------------------------------------------------------------
> BQL mutex  0x55eb89425540  accel/tcg/cpu-exec.c:744           96.31578
> 73589937          1.31
> BQL mutex  0x55eb89425540  target/ppc/helper_regs.c:207        0.00150
> 1178          1.27
>
>
> And it points to a lock in the interrupt delivery path, in
> cpu_handle_interrupt().
>
> I now understand the root cause. The interrupt signal for the
> decrementer interrupt remains set because the interrupt is not being
> delivered, per the config. I'm not quite sure what the proper fix is
> yet (there seems to be several implementations of the decrementer on
> ppc), but at least I understand why we are so slow.

That sounds like a bug in the interrupt controller emulation. It should
not even be attempting to cpu_exit() and set cpu->interrupt_request
(which are TCG internals) unless the IRQ is unmasked. Usually when
updates are made to an emulated IRQ controller you re-calculate the
state and decide if an interrupt needs to be asserted to QEMU.

> With a quick hack, I could verify that by moving that signal out of
> the way, the decompression time of the kernel is now peanuts, no
> matter the number of cpus. Even with one cpu, the 15 seconds measured
> before was already a huge waste, so it was not really a multiple-cpus
> problem. Multiple cpus were just highlighting it.
>
> Thanks again!
>
>   Fred


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-29 15:36           ` Frederic Barrat
  2022-06-29 16:01             ` Alex Bennée
@ 2022-06-29 16:25             ` Matheus K. Ferst
  2022-06-29 17:13               ` Alex Bennée
  1 sibling, 1 reply; 13+ messages in thread
From: Matheus K. Ferst @ 2022-06-29 16:25 UTC (permalink / raw)
  To: Frederic Barrat, Alex Bennée; +Cc: qemu-ppc, qemu-devel

On 29/06/2022 12:36, Frederic Barrat wrote:
> [E-MAIL EXTERNO] Não clique em links ou abra anexos, a menos que você 
> possa confirmar o remetente e saber que o conteúdo é seguro. Em caso de 
> e-mail suspeito entre imediatamente em contato com o DTI.
> 
> On 29/06/2022 00:17, Alex Bennée wrote:
>> If you run the sync-profiler (via the HMP "sync-profile on") you can
>> then get a breakdown of which mutex's are being held and for how long
>> ("info sync-profile").
> 
> 
> Alex, a huge thank you!
> 
> For the record, the "info sync-profile" showed:
> Type               Object  Call site                     Wait Time (s)
>         Count  Average (us)
> -------------------------------------------------------------------------------------------------- 
> 
> BQL mutex  0x55eb89425540  accel/tcg/cpu-exec.c:744           96.31578
>      73589937          1.31
> BQL mutex  0x55eb89425540  target/ppc/helper_regs.c:207        0.00150
>          1178          1.27
> 
> 
> And it points to a lock in the interrupt delivery path, in
> cpu_handle_interrupt().
> 
> I now understand the root cause. The interrupt signal for the
> decrementer interrupt remains set because the interrupt is not being
> delivered, per the config. I'm not quite sure what the proper fix is yet
> (there seems to be several implementations of the decrementer on ppc),
> but at least I understand why we are so slow.
> 

To summarize what we talked elsewhere:
1 - The threads that are not decompressing the kernel have a pending 
PPC_INTERRUPT_DECR, and cs->interrupt_request is CPU_INTERRUPT_HARD;
2 - cpu_handle_interrupt calls ppc_cpu_exec_interrupt, that calls 
ppc_hw_interrupt to handle the interrupt;
3 - ppc_cpu_exec_interrupt decides that the interrupt cannot be 
delivered immediately, so the corresponding bit in 
env->pending_interrupts is not reset;
4 - ppc_cpu_exec_interrupt does not change cs->interrupt_request because 
pending_interrupts != 0, so cpu_handle_interrupt will be called again.

This loop will acquire and release qemu_mutex_lock_iothread, slowing 
down other threads that need this lock.

> With a quick hack, I could verify that by moving that signal out of the
> way, the decompression time of the kernel is now peanuts, no matter the
> number of cpus. Even with one cpu, the 15 seconds measured before was
> already a huge waste, so it was not really a multiple-cpus problem.
> Multiple cpus were just highlighting it.
> 
> Thanks again!
> 
>    Fred
-- 
Matheus K. Ferst
Instituto de Pesquisas ELDORADO <http://www.eldorado.org.br/>
Analista de Software
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-29 16:25             ` Matheus K. Ferst
@ 2022-06-29 17:13               ` Alex Bennée
  2022-06-29 20:55                 ` Cédric Le Goater
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Bennée @ 2022-06-29 17:13 UTC (permalink / raw)
  To: Matheus K. Ferst; +Cc: Frederic Barrat, qemu-ppc, qemu-devel


"Matheus K. Ferst" <matheus.ferst@eldorado.org.br> writes:

> On 29/06/2022 12:36, Frederic Barrat wrote:
>> [E-MAIL EXTERNO] Não clique em links ou abra anexos, a menos que
>> você possa confirmar o remetente e saber que o conteúdo é seguro. Em
>> caso de e-mail suspeito entre imediatamente em contato com o DTI.
>> On 29/06/2022 00:17, Alex Bennée wrote:
>>> If you run the sync-profiler (via the HMP "sync-profile on") you can
>>> then get a breakdown of which mutex's are being held and for how long
>>> ("info sync-profile").
>> Alex, a huge thank you!
>> For the record, the "info sync-profile" showed:
>> Type               Object  Call site                     Wait Time (s)
>>         Count  Average (us)
>> --------------------------------------------------------------------------------------------------
>> BQL mutex  0x55eb89425540  accel/tcg/cpu-exec.c:744          
>> 96.31578
>>      73589937          1.31
>> BQL mutex  0x55eb89425540  target/ppc/helper_regs.c:207        0.00150
>>          1178          1.27
>> And it points to a lock in the interrupt delivery path, in
>> cpu_handle_interrupt().
>> I now understand the root cause. The interrupt signal for the
>> decrementer interrupt remains set because the interrupt is not being
>> delivered, per the config. I'm not quite sure what the proper fix is yet
>> (there seems to be several implementations of the decrementer on ppc),
>> but at least I understand why we are so slow.
>> 
>
> To summarize what we talked elsewhere:
> 1 - The threads that are not decompressing the kernel have a pending
> PPC_INTERRUPT_DECR, and cs->interrupt_request is CPU_INTERRUPT_HARD;

I think ppc_set_irq should be doing some gating before calling to set
cs->interrupt_request.

> 2 - cpu_handle_interrupt calls ppc_cpu_exec_interrupt, that calls
> ppc_hw_interrupt to handle the interrupt;
> 3 - ppc_cpu_exec_interrupt decides that the interrupt cannot be
> delivered immediately, so the corresponding bit in
> env->pending_interrupts is not reset;

Is the logic controlled by ppc_hw_interrupt()? The stuff around
async_deliver?

I think maybe some of the logic needs to be factored out and checked
above. Also anywhere where env->msr is updated would need to check if
we've just enabled a load of pending interrupts and then call
ppc_set_irq.

However I'm not super familiar with the PPC code so I'll defer to the
maintainers here ;-)

> 4 - ppc_cpu_exec_interrupt does not change cs->interrupt_request
> because pending_interrupts != 0, so cpu_handle_interrupt will be
> called again.
>
> This loop will acquire and release qemu_mutex_lock_iothread, slowing
> down other threads that need this lock.
>
>> With a quick hack, I could verify that by moving that signal out of the
>> way, the decompression time of the kernel is now peanuts, no matter the
>> number of cpus. Even with one cpu, the 15 seconds measured before was
>> already a huge waste, so it was not really a multiple-cpus problem.
>> Multiple cpus were just highlighting it.
>> Thanks again!
>>    Fred


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Slowness with multi-thread TCG?
  2022-06-29 17:13               ` Alex Bennée
@ 2022-06-29 20:55                 ` Cédric Le Goater
  0 siblings, 0 replies; 13+ messages in thread
From: Cédric Le Goater @ 2022-06-29 20:55 UTC (permalink / raw)
  To: Alex Bennée, Matheus K. Ferst; +Cc: Frederic Barrat, qemu-ppc, qemu-devel

On 6/29/22 19:13, Alex Bennée wrote:
> 
> "Matheus K. Ferst" <matheus.ferst@eldorado.org.br> writes:
> 
>> On 29/06/2022 12:36, Frederic Barrat wrote:
>>> [E-MAIL EXTERNO] Não clique em links ou abra anexos, a menos que
>>> você possa confirmar o remetente e saber que o conteúdo é seguro. Em
>>> caso de e-mail suspeito entre imediatamente em contato com o DTI.
>>> On 29/06/2022 00:17, Alex Bennée wrote:
>>>> If you run the sync-profiler (via the HMP "sync-profile on") you can
>>>> then get a breakdown of which mutex's are being held and for how long
>>>> ("info sync-profile").
>>> Alex, a huge thank you!
>>> For the record, the "info sync-profile" showed:
>>> Type               Object  Call site                     Wait Time (s)
>>>          Count  Average (us)
>>> --------------------------------------------------------------------------------------------------
>>> BQL mutex  0x55eb89425540  accel/tcg/cpu-exec.c:744
>>> 96.31578
>>>       73589937          1.31
>>> BQL mutex  0x55eb89425540  target/ppc/helper_regs.c:207        0.00150
>>>           1178          1.27
>>> And it points to a lock in the interrupt delivery path, in
>>> cpu_handle_interrupt().
>>> I now understand the root cause. The interrupt signal for the
>>> decrementer interrupt remains set because the interrupt is not being
>>> delivered, per the config. I'm not quite sure what the proper fix is yet
>>> (there seems to be several implementations of the decrementer on ppc),
>>> but at least I understand why we are so slow.
>>>
>>
>> To summarize what we talked elsewhere:
>> 1 - The threads that are not decompressing the kernel have a pending
>> PPC_INTERRUPT_DECR, and cs->interrupt_request is CPU_INTERRUPT_HARD;
> 
> I think ppc_set_irq should be doing some gating before calling to set
> cs->interrupt_request.
> 
>> 2 - cpu_handle_interrupt calls ppc_cpu_exec_interrupt, that calls
>> ppc_hw_interrupt to handle the interrupt;
>> 3 - ppc_cpu_exec_interrupt decides that the interrupt cannot be
>> delivered immediately, so the corresponding bit in
>> env->pending_interrupts is not reset;
> 
> Is the logic controlled by ppc_hw_interrupt()? The stuff around
> async_deliver?
> 
> I think maybe some of the logic needs to be factored out and checked
> above. Also anywhere where env->msr is updated would need to check if
> we've just enabled a load of pending interrupts and then call
> ppc_set_irq.
> 
> However I'm not super familiar with the PPC code so I'll defer to the
> maintainers here ;-)


That part is a nightmare with a lot of history. It needs a rewrite.
we have a good testing environment and we should catch regressions.
Not for 7.1 though.



> 
>> 4 - ppc_cpu_exec_interrupt does not change cs->interrupt_request
>> because pending_interrupts != 0, so cpu_handle_interrupt will be
>> called again.
>>
>> This loop will acquire and release qemu_mutex_lock_iothread, slowing
>> down other threads that need this lock.
>>
>>> With a quick hack, I could verify that by moving that signal out of the
>>> way, the decompression time of the kernel is now peanuts, no matter the
>>> number of cpus. Even with one cpu, the 15 seconds measured before was
>>> already a huge waste, so it was not really a multiple-cpus problem.
>>> Multiple cpus were just highlighting it.
>>> Thanks again!
>>>     Fred
> 
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Slowness with multi-thread TCG?
@ 2022-06-27 16:25 Frederic Barrat
  0 siblings, 0 replies; 13+ messages in thread
From: Frederic Barrat @ 2022-06-27 16:25 UTC (permalink / raw)
  To: qemu-devel

Hello,

I've been looking at why our qemu powernv model is so slow when booting 
a compressed linux kernel, using multiple vcpus and multi-thread tcg. 
With only one vcpu, the decompression time of the kernel is what it is, 
but when using multiple vcpus, the decompression is actually slower. And 
worse: it degrades very fast with the number of vcpus!

Rough measurement of the decompression time on a x86 laptop with 
multi-thread tcg and using the qemu powernv10 machine:
1 vcpu => 15 seconds
2 vcpus => 45 seconds
4 vcpus => 1 min 30 seconds

Looking in details, when the firmware (skiboot) hands over execution to 
the linux kernel, there's one main thread entering some bootstrap code 
and running the kernel decompression algorithm. All the other secondary 
threads are left spinning in skiboot (1 thread per vpcu). So on paper, 
with multi-thread tcg and assuming the system has enough available 
physical cpus, I would expect the decompression to hog one physical cpu 
and the time needed to be constant, no matter the number of vpcus.

All the secondary threads are left spinning in code like this:

	for (;;) {
		if (cpu_check_jobs(cpu))  // reading cpu-local data
			break;
		if (reconfigure_idle)     // global variable
			break;
		barrier();
	}

The barrier is to force reading the memory with each iteration. It's 
defined as:

   asm volatile("" : : : "memory");


Some time later, the main thread in the linux kernel will get the 
secondary threads out of that loop by posting a job.

My first thought was that the translation of that code through tcg was 
somehow causing some abnormally slow behavior, maybe due to some 
non-obvious contention between the threads. However, if I send the 
threads spinning forever with simply:

     for (;;) ;

supposedly removing any contention, then the decompression time is the same.

Ironically, the behavior seen with single thread tcg is what I would 
expect: 1 thread decompressing in 15 seconds, all the other threads 
spinning for that same amount of time, all sharing the same physical 
cpu, so it all adds up nicely: I see 60 seconds decompression time with 
4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit. 
And single thread tcg hogs one physical cpu of the laptop vs. 4 physical 
cpus for the slower multi-thread tcg.

Does anybody have an idea of what might happen or have suggestion to 
keep investigating?
Thanks for your help!

   Fred



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-06-29 20:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-27 18:25 Slowness with multi-thread TCG? Frederic Barrat
2022-06-27 21:10 ` Alex Bennée
2022-06-28 11:25 ` Matheus K. Ferst
2022-06-28 13:08   ` Frederic Barrat
2022-06-28 15:12     ` Alex Bennée
2022-06-28 16:16       ` Frederic Barrat
2022-06-28 22:17         ` Alex Bennée
2022-06-29 15:36           ` Frederic Barrat
2022-06-29 16:01             ` Alex Bennée
2022-06-29 16:25             ` Matheus K. Ferst
2022-06-29 17:13               ` Alex Bennée
2022-06-29 20:55                 ` Cédric Le Goater
  -- strict thread matches above, loose matches on Subject: below --
2022-06-27 16:25 Frederic Barrat

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.