linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* unixbench context switch perfomance & cpu topology
@ 2018-01-22 11:47 Wanpeng Li
  2018-01-22 12:08 ` Mike Galbraith
  2018-01-22 12:53 ` Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Wanpeng Li @ 2018-01-22 11:47 UTC (permalink / raw)
  To: linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Zijlstra, Radim Krcmar, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar

Hi all,

We can observe unixbench context switch performance is heavily
influenced by cpu topology which is exposed to the guest. the score is
posted below, bigger is better, both the guest and the host kernel are
3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC
is exposed to the guest, kvm adaptive halt-polling is default enabled,
then start a guest w/ 8 logical cpus.



unixbench context switch
-smp 8, sockets=8, cores=1, threads=1    382036
-smp 8, sockets=4, cores=2, threads=1    132480
-smp 8, sockets=2, cores=4, threads=1    128032
-smp 8, sockets=2, cores=2, threads=2    131767
-smp 8, sockets=1, cores=4, threads=2    132742
-smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll)    331471

I can observe there are a lot of reschedule IPIs sent from one vCPU to
another vCPU, the context switch workload switches between running and
idle frequently which results in HLT instruction in the idle path, I
use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs
since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can
stop to program lapic timer/other nohz stuffs. Any idea why sockets=8
can get best performance?


Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unixbench context switch perfomance & cpu topology
  2018-01-22 11:47 unixbench context switch perfomance & cpu topology Wanpeng Li
@ 2018-01-22 12:08 ` Mike Galbraith
  2018-01-22 12:27   ` Wanpeng Li
  2018-01-22 12:53 ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Mike Galbraith @ 2018-01-22 12:08 UTC (permalink / raw)
  To: Wanpeng Li, linux-kernel, kvm
  Cc: Paolo Bonzini, Peter Zijlstra, Radim Krcmar, Frederic Weisbecker,
	Thomas Gleixner, Ingo Molnar

On Mon, 2018-01-22 at 19:47 +0800, Wanpeng Li wrote:
> Hi all,
> 
> We can observe unixbench context switch performance is heavily
> influenced by cpu topology which is exposed to the guest. the score is
> posted below, bigger is better, both the guest and the host kernel are
> 3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC
> is exposed to the guest, kvm adaptive halt-polling is default enabled,
> then start a guest w/ 8 logical cpus.
> 
> 
> 
> unixbench context switch
> -smp 8, sockets=8, cores=1, threads=1    382036
> -smp 8, sockets=4, cores=2, threads=1    132480
> -smp 8, sockets=2, cores=4, threads=1    128032
> -smp 8, sockets=2, cores=2, threads=2    131767
> -smp 8, sockets=1, cores=4, threads=2    132742
> -smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll)    331471
> 
> I can observe there are a lot of reschedule IPIs sent from one vCPU to
> another vCPU, the context switch workload switches between running and
> idle frequently which results in HLT instruction in the idle path, I
> use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs
> since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can
> stop to program lapic timer/other nohz stuffs. Any idea why sockets=8
> can get best performance?

Probably because with that topology, there is no shared llc, thus no
cross-core scheduling, micro-benchmark waker/wakee are stacked.  If
your benchmark does nothing but schedule, stacking makes beautiful (but
utterly meaningless) numbers.

	-Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unixbench context switch perfomance & cpu topology
  2018-01-22 12:08 ` Mike Galbraith
@ 2018-01-22 12:27   ` Wanpeng Li
  2018-01-22 13:37     ` Mike Galbraith
  0 siblings, 1 reply; 9+ messages in thread
From: Wanpeng Li @ 2018-01-22 12:27 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Zijlstra, Radim Krcmar,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar

2018-01-22 20:08 GMT+08:00 Mike Galbraith <efault@gmx.de>:
> On Mon, 2018-01-22 at 19:47 +0800, Wanpeng Li wrote:
>> Hi all,
>>
>> We can observe unixbench context switch performance is heavily
>> influenced by cpu topology which is exposed to the guest. the score is
>> posted below, bigger is better, both the guest and the host kernel are
>> 3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC
>> is exposed to the guest, kvm adaptive halt-polling is default enabled,
>> then start a guest w/ 8 logical cpus.
>>
>>
>>
>> unixbench context switch
>> -smp 8, sockets=8, cores=1, threads=1    382036
>> -smp 8, sockets=4, cores=2, threads=1    132480
>> -smp 8, sockets=2, cores=4, threads=1    128032
>> -smp 8, sockets=2, cores=2, threads=2    131767
>> -smp 8, sockets=1, cores=4, threads=2    132742
>> -smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll)    331471
>>
>> I can observe there are a lot of reschedule IPIs sent from one vCPU to
>> another vCPU, the context switch workload switches between running and
>> idle frequently which results in HLT instruction in the idle path, I
>> use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs
>> since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can
>> stop to program lapic timer/other nohz stuffs. Any idea why sockets=8
>> can get best performance?
>
> Probably because with that topology, there is no shared llc, thus no
> cross-core scheduling, micro-benchmark waker/wakee are stacked.  If
> your benchmark does nothing but schedule, stacking makes beautiful (but
> utterly meaningless) numbers.

The waker and wakee are just sporadic on the same logical cpu in the
guest(-smp 8, sockets=8, cores=1, threads=1) during the testing, in
addition, binding the waker/wakee to one logical cpu in the guest(-smp
8, sockets=1, cores=4, threads=2) also can get the performance as
better as 8 sockets setup.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unixbench context switch perfomance & cpu topology
  2018-01-22 11:47 unixbench context switch perfomance & cpu topology Wanpeng Li
  2018-01-22 12:08 ` Mike Galbraith
@ 2018-01-22 12:53 ` Peter Zijlstra
  2018-01-23 10:33   ` Wanpeng Li
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2018-01-22 12:53 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Paolo Bonzini, Radim Krcmar,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar

On Mon, Jan 22, 2018 at 07:47:45PM +0800, Wanpeng Li wrote:
> Hi all,
> 
> We can observe unixbench context switch performance is heavily
> influenced by cpu topology which is exposed to the guest. the score is
> posted below, bigger is better, both the guest and the host kernel are
> 3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC
> is exposed to the guest, kvm adaptive halt-polling is default enabled,
> then start a guest w/ 8 logical cpus.
> 
> 
> 
> unixbench context switch
> -smp 8, sockets=8, cores=1, threads=1    382036
> -smp 8, sockets=4, cores=2, threads=1    132480
> -smp 8, sockets=2, cores=4, threads=1    128032
> -smp 8, sockets=2, cores=2, threads=2    131767
> -smp 8, sockets=1, cores=4, threads=2    132742
> -smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll)    331471
> 
> I can observe there are a lot of reschedule IPIs sent from one vCPU to
> another vCPU, the context switch workload switches between running and
> idle frequently which results in HLT instruction in the idle path, I
> use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs
> since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can
> stop to program lapic timer/other nohz stuffs. Any idea why sockets=8
> can get best performance?

I suspect because we load-balance less agressively across nodes than we
do within a cache domain.

Fix you benchmark to pin itself to a single CPU, that's the only
sensible way to obtain this number in any case.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unixbench context switch perfomance & cpu topology
  2018-01-22 12:27   ` Wanpeng Li
@ 2018-01-22 13:37     ` Mike Galbraith
  2018-01-23 10:36       ` Wanpeng Li
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Galbraith @ 2018-01-22 13:37 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Zijlstra, Radim Krcmar,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar

On Mon, 2018-01-22 at 20:27 +0800, Wanpeng Li wrote:
> 2018-01-22 20:08 GMT+08:00 Mike Galbraith <efault@gmx.de>:
> > On Mon, 2018-01-22 at 19:47 +0800, Wanpeng Li wrote:
> >> Hi all,
> >>
> >> We can observe unixbench context switch performance is heavily
> >> influenced by cpu topology which is exposed to the guest. the score is
> >> posted below, bigger is better, both the guest and the host kernel are
> >> 3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC
> >> is exposed to the guest, kvm adaptive halt-polling is default enabled,
> >> then start a guest w/ 8 logical cpus.
> >>
> >>
> >>
> >> unixbench context switch
> >> -smp 8, sockets=8, cores=1, threads=1    382036
> >> -smp 8, sockets=4, cores=2, threads=1    132480
> >> -smp 8, sockets=2, cores=4, threads=1    128032
> >> -smp 8, sockets=2, cores=2, threads=2    131767
> >> -smp 8, sockets=1, cores=4, threads=2    132742
> >> -smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll)    331471
> >>
> >> I can observe there are a lot of reschedule IPIs sent from one vCPU to
> >> another vCPU, the context switch workload switches between running and
> >> idle frequently which results in HLT instruction in the idle path, I
> >> use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs
> >> since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can
> >> stop to program lapic timer/other nohz stuffs. Any idea why sockets=8
> >> can get best performance?
> >
> > Probably because with that topology, there is no shared llc, thus no
> > cross-core scheduling, micro-benchmark waker/wakee are stacked.  If
> > your benchmark does nothing but schedule, stacking makes beautiful (but
> > utterly meaningless) numbers.
> 
> The waker and wakee are just sporadic on the same logical cpu in the
> guest(-smp 8, sockets=8, cores=1, threads=1) during the testing, in
> addition, binding the waker/wakee to one logical cpu in the guest(-smp
> 8, sockets=1, cores=4, threads=2) also can get the performance as
> better as 8 sockets setup.

Here, with tip.today and that topology, context1 does stack up on one core.

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND                                               
 4218 root      20   0    4048    808    732 R 52.16 0.022   0:12.77 4 context1                                              
 4219 root      20   0    4048     80      0 S 47.18 0.002   0:11.96 4 context1

There's a bit of bouncing, but the two stack right back up.  But
whatever, what Peter said, the benchmark should pin itself to do this.

	-Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unixbench context switch perfomance & cpu topology
  2018-01-22 12:53 ` Peter Zijlstra
@ 2018-01-23 10:33   ` Wanpeng Li
  0 siblings, 0 replies; 9+ messages in thread
From: Wanpeng Li @ 2018-01-23 10:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, kvm, Paolo Bonzini, Radim Krcmar,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar

2018-01-22 20:53 GMT+08:00 Peter Zijlstra <peterz@infradead.org>:
> On Mon, Jan 22, 2018 at 07:47:45PM +0800, Wanpeng Li wrote:
>> Hi all,
>>
>> We can observe unixbench context switch performance is heavily
>> influenced by cpu topology which is exposed to the guest. the score is
>> posted below, bigger is better, both the guest and the host kernel are
>> 3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC
>> is exposed to the guest, kvm adaptive halt-polling is default enabled,
>> then start a guest w/ 8 logical cpus.
>>
>>
>>
>> unixbench context switch
>> -smp 8, sockets=8, cores=1, threads=1    382036
>> -smp 8, sockets=4, cores=2, threads=1    132480
>> -smp 8, sockets=2, cores=4, threads=1    128032
>> -smp 8, sockets=2, cores=2, threads=2    131767
>> -smp 8, sockets=1, cores=4, threads=2    132742
>> -smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll)    331471
>>
>> I can observe there are a lot of reschedule IPIs sent from one vCPU to
>> another vCPU, the context switch workload switches between running and
>> idle frequently which results in HLT instruction in the idle path, I
>> use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs
>> since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can
>> stop to program lapic timer/other nohz stuffs. Any idea why sockets=8
>> can get best performance?
>
> I suspect because we load-balance less agressively across nodes than we
> do within a cache domain.

It is true. after taking a more closer look by kernelshark, the
context1 in the guest will be migrated to another logical cpu after
several milliseconds for sockets=1, cores=4, threads=2,  however, it
can keep on one logical cpu around several seconds for sockets=8,
cores=1, threads=1 before migrating to another one.

>
> Fix you benchmark to pin itself to a single CPU, that's the only
> sensible way to obtain this number in any case.

Yeah, this setup can get a good performance. Actually the two context1
tasks don't stack up on one logical cpu at the most of time which is
observed by kernelshark opposed to Mike's reply. In addition, I can
observe the sum of RESCHED IPIs in the guest for sockets=1, cores=4,
threads=2 is 4.5 times for sockets=8, cores=1, threads=1. Any idea how
this can happen? I suspect the TTWU path selects another idle logical
cpu which results in a RESCHED IPI is avoidless. However, there is
still no benefit for performance after I clear the SD_BALANCE_WAKE for
correlative sched_domains.

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unixbench context switch perfomance & cpu topology
  2018-01-22 13:37     ` Mike Galbraith
@ 2018-01-23 10:36       ` Wanpeng Li
  2018-01-23 13:49         ` Mike Galbraith
  0 siblings, 1 reply; 9+ messages in thread
From: Wanpeng Li @ 2018-01-23 10:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Zijlstra, Radim Krcmar,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar

2018-01-22 21:37 GMT+08:00 Mike Galbraith <efault@gmx.de>:
> On Mon, 2018-01-22 at 20:27 +0800, Wanpeng Li wrote:
>> 2018-01-22 20:08 GMT+08:00 Mike Galbraith <efault@gmx.de>:
>> > On Mon, 2018-01-22 at 19:47 +0800, Wanpeng Li wrote:
>> >> Hi all,
>> >>
>> >> We can observe unixbench context switch performance is heavily
>> >> influenced by cpu topology which is exposed to the guest. the score is
>> >> posted below, bigger is better, both the guest and the host kernel are
>> >> 3.15-rc3(we can also reproduce against centos 7.4 693 guest/host), LLC
>> >> is exposed to the guest, kvm adaptive halt-polling is default enabled,
>> >> then start a guest w/ 8 logical cpus.
>> >>
>> >>
>> >>
>> >> unixbench context switch
>> >> -smp 8, sockets=8, cores=1, threads=1    382036
>> >> -smp 8, sockets=4, cores=2, threads=1    132480
>> >> -smp 8, sockets=2, cores=4, threads=1    128032
>> >> -smp 8, sockets=2, cores=2, threads=2    131767
>> >> -smp 8, sockets=1, cores=4, threads=2    132742
>> >> -smp 8, sockets=1, cores=4, threads=2 (guest w/ nohz=off idle=poll)    331471
>> >>
>> >> I can observe there are a lot of reschedule IPIs sent from one vCPU to
>> >> another vCPU, the context switch workload switches between running and
>> >> idle frequently which results in HLT instruction in the idle path, I
>> >> use idle=poll to avoid vmexit due to HLT and to avoid reschedule IPIs
>> >> since idle task checks TIF_NEED_RESCHED flags in a loop, nohz=off can
>> >> stop to program lapic timer/other nohz stuffs. Any idea why sockets=8
>> >> can get best performance?
>> >
>> > Probably because with that topology, there is no shared llc, thus no
>> > cross-core scheduling, micro-benchmark waker/wakee are stacked.  If
>> > your benchmark does nothing but schedule, stacking makes beautiful (but
>> > utterly meaningless) numbers.
>>
>> The waker and wakee are just sporadic on the same logical cpu in the
>> guest(-smp 8, sockets=8, cores=1, threads=1) during the testing, in
>> addition, binding the waker/wakee to one logical cpu in the guest(-smp
>> 8, sockets=1, cores=4, threads=2) also can get the performance as
>> better as 8 sockets setup.
>
> Here, with tip.today and that topology, context1 does stack up on one core.
>
>  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ P COMMAND
>  4218 root      20   0    4048    808    732 R 52.16 0.022   0:12.77 4 context1
>  4219 root      20   0    4048     80      0 S 47.18 0.002   0:11.96 4 context1
>
> There's a bit of bouncing, but the two stack right back up.  But
> whatever, what Peter said, the benchmark should pin itself to do this.

Thanks for having a try, Mike. :) Actually the two context1 tasks
don't stack up on one logical cpu at the most of time which is
observed by kernelshark. Do you have any idea why there is 4.5 times
RESCHED IPIs which is mentioned in another reply for this thread?

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unixbench context switch perfomance & cpu topology
  2018-01-23 10:36       ` Wanpeng Li
@ 2018-01-23 13:49         ` Mike Galbraith
  2018-01-24  8:07           ` Wanpeng Li
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Galbraith @ 2018-01-23 13:49 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Zijlstra, Radim Krcmar,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar

On Tue, 2018-01-23 at 18:36 +0800, Wanpeng Li wrote:
> 
> Thanks for having a try, Mike. :) Actually the two context1 tasks
> don't stack up on one logical cpu at the most of time which is
> observed by kernelshark. Do you have any idea why there is 4.5 times
> RESCHED IPIs which is mentioned in another reply for this thread?

See resched_curr().

	-Mike

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unixbench context switch perfomance & cpu topology
  2018-01-23 13:49         ` Mike Galbraith
@ 2018-01-24  8:07           ` Wanpeng Li
  0 siblings, 0 replies; 9+ messages in thread
From: Wanpeng Li @ 2018-01-24  8:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, kvm, Paolo Bonzini, Peter Zijlstra, Radim Krcmar,
	Frederic Weisbecker, Thomas Gleixner, Ingo Molnar

2018-01-23 21:49 GMT+08:00 Mike Galbraith <efault@gmx.de>:
> On Tue, 2018-01-23 at 18:36 +0800, Wanpeng Li wrote:
>>
>> Thanks for having a try, Mike. :) Actually the two context1 tasks
>> don't stack up on one logical cpu at the most of time which is
>> observed by kernelshark. Do you have any idea why there is 4.5 times
>> RESCHED IPIs which is mentioned in another reply for this thread?
>
> See resched_curr().

Yeah, I observe writer/reader pair is running on the same core
sometimes after more digging. Thanks Mike!

Regards,
Wanpeng Li

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-01-24  8:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-22 11:47 unixbench context switch perfomance & cpu topology Wanpeng Li
2018-01-22 12:08 ` Mike Galbraith
2018-01-22 12:27   ` Wanpeng Li
2018-01-22 13:37     ` Mike Galbraith
2018-01-23 10:36       ` Wanpeng Li
2018-01-23 13:49         ` Mike Galbraith
2018-01-24  8:07           ` Wanpeng Li
2018-01-22 12:53 ` Peter Zijlstra
2018-01-23 10:33   ` Wanpeng Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).