All of lore.kernel.org
 help / color / mirror / Atom feed
* netperf ~50% regression with 2.6.33-rc1, bisect to 1b9508f
@ 2010-01-25 10:03 Lin Ming
  2010-01-25 11:35 ` Mike Galbraith
  2010-01-25 14:03 ` Peter Zijlstra
  0 siblings, 2 replies; 6+ messages in thread
From: Lin Ming @ 2010-01-25 10:03 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra, Ingo Molnar; +Cc: Zhang, Yanmin, lkml

Hi, 

The netperf lookback regression comes back again.
UDP stream test has ~50% regression with 2.6.33-rc1 compared to 2.6.32.

Testing machine: Nehalem, 2 sockets, 4 cores, hyper thread, 4G mem
Server and client are bind to different physical cpu. 

taskset -c 15 ./netserver
taskset -c 0 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50,3 -I 99,5
-- -P 12384,12888 -s 32768 -S 32768 -m 1024

Bisect to below commit,

commit 1b9508f6831e10d53256825de8904caa22d1ca2c
Author: Mike Galbraith <efault@gmx.de>
Date:   Wed Nov 4 17:53:50 2009 +0100

    sched: Rate-limit newidle
    
    Rate limit newidle to migration_cost. It's a win for all
    stages of sysbench oltp tests.
    
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    LKML-Reference: <new-submission>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

Interesting, this commit was originally fixed the similar UDP stream
test regression on a tulsa machine in 2.6.32-rc1, see below threads
http://marc.info/?t=125722014400001&r=1&w=2
But now it introduces a regression on a Nehalem machine.

This regression seems caused by a lot of rescheduling IPI.

Perf top data as below, note "default_send_IPI_mask_sequence_phys"

             samples  pcnt function                            DSO
             _______ _____ ___________________________________ _________________

             1407.00  6.8% _spin_lock_bh                       [kernel.kallsyms]
             1330.00  6.4% copy_user_generic_string            [kernel.kallsyms]
             1017.00  4.9% _spin_lock_irq                      [kernel.kallsyms]
              968.00  4.7% default_send_IPI_mask_sequence_phys [kernel.kallsyms]
              891.00  4.3% acpi_os_read_port                   [kernel.kallsyms]
              818.00  4.0% sock_alloc_send_pskb                [kernel.kallsyms]
              776.00  3.8% _spin_lock_irqsave                  [kernel.kallsyms]
              757.00  3.7% __udp4_lib_lookup                   [kernel.kallsyms]

/proc/interrupts shows a lot of "Rescheduling interrupts" that are send
from CPU 0(client) to CPU 15(server).

With above commit, the idle balance was rate limited, so CPU 15(server,
waiting data from client) is idle at most time.

CPU0(client) executes as below,

try_to_wake_up
   check_preempt_curr_idle
      resched_task
         smp_send_reschedule

This causes a lot of rescheduling IPI.

This commit can't be reverted due to conflict, so I just add below code
to disable "Rate-limit newidle" and the performance was recovered.

diff --git a/kernel/sched.c b/kernel/sched.c
index 18cceee..588fdef 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4421,9 +4421,6 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
 
 	this_rq->idle_stamp = this_rq->clock;
 
-	if (this_rq->avg_idle < sysctl_sched_migration_cost)
-		return;
-
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
 

Lin Ming


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: netperf ~50% regression with 2.6.33-rc1, bisect to 1b9508f
  2010-01-25 10:03 netperf ~50% regression with 2.6.33-rc1, bisect to 1b9508f Lin Ming
@ 2010-01-25 11:35 ` Mike Galbraith
  2010-01-25 11:45   ` Mike Galbraith
  2010-01-26  9:03   ` Lin Ming
  2010-01-25 14:03 ` Peter Zijlstra
  1 sibling, 2 replies; 6+ messages in thread
From: Mike Galbraith @ 2010-01-25 11:35 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, Ingo Molnar, Zhang, Yanmin, lkml

On Mon, 2010-01-25 at 18:03 +0800, Lin Ming wrote:

> With above commit, the idle balance was rate limited, so CPU 15(server,
> waiting data from client) is idle at most time.
> 
> CPU0(client) executes as below,
> 
> try_to_wake_up
>    check_preempt_curr_idle
>       resched_task
>          smp_send_reschedule
> 
> This causes a lot of rescheduling IPI.
> 
> This commit can't be reverted due to conflict, so I just add below code
> to disable "Rate-limit newidle" and the performance was recovered.
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 18cceee..588fdef 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4421,9 +4421,6 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
>  
>  	this_rq->idle_stamp = this_rq->clock;
>  
> -	if (this_rq->avg_idle < sysctl_sched_migration_cost)
> -		return;
> -
>  	for_each_domain(this_cpu, sd) {
>  		unsigned long interval;
>  

Heh, so you should see the same thing with newidle disabled, as it was
in .31 and many kernels prior.  Do you?

So.  Rummaging around doing absolutely _nothing_ useful, there being
zero movable tasks in this load, prevents us switching to the idle
thread before the scheduling task is requeued.  Oh joy.

Hm.... <imagines Peter's reaction to a busy wait> :)  OTOH, that's what
it's _doing_ with above patch.

	-Mike




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: netperf ~50% regression with 2.6.33-rc1, bisect to 1b9508f
  2010-01-25 11:35 ` Mike Galbraith
@ 2010-01-25 11:45   ` Mike Galbraith
  2010-01-26  9:03   ` Lin Ming
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Galbraith @ 2010-01-25 11:45 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, Ingo Molnar, Zhang, Yanmin, lkml

On Mon, 2010-01-25 at 12:35 +0100, Mike Galbraith wrote:

> So.  Rummaging around doing absolutely _nothing_ useful, there being
> zero movable tasks in this load, prevents us switching to the idle
> thread before the scheduling task is requeued.  Oh joy.

Wait a minute.  The only way that can work is if we hit contended lock,
and drop our rq lock (and where would busiest->nr_running > 1 come from
with a 2 task load?)

Puzzled.

	-Mike


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: netperf ~50% regression with 2.6.33-rc1, bisect to 1b9508f
  2010-01-25 10:03 netperf ~50% regression with 2.6.33-rc1, bisect to 1b9508f Lin Ming
  2010-01-25 11:35 ` Mike Galbraith
@ 2010-01-25 14:03 ` Peter Zijlstra
  2010-01-25 14:19   ` Mike Galbraith
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2010-01-25 14:03 UTC (permalink / raw)
  To: Lin Ming; +Cc: Mike Galbraith, Ingo Molnar, Zhang, Yanmin, lkml

On Mon, 2010-01-25 at 18:03 +0800, Lin Ming wrote:
> Hi, 
> 
> The netperf lookback regression comes back again.
> UDP stream test has ~50% regression with 2.6.33-rc1 compared to 2.6.32.
> 
> Testing machine: Nehalem, 2 sockets, 4 cores, hyper thread, 4G mem
> Server and client are bind to different physical cpu. 
> 
> taskset -c 15 ./netserver
> taskset -c 0 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50,3 -I 99,5
> -- -P 12384,12888 -s 32768 -S 32768 -m 1024

I cannot reproduce this on a dual socket nehalem system, I get current
-linus to be about 10% faster than .32.

(Had to drop the -i -I thingies, otherwise this took ages)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: netperf ~50% regression with 2.6.33-rc1, bisect to 1b9508f
  2010-01-25 14:03 ` Peter Zijlstra
@ 2010-01-25 14:19   ` Mike Galbraith
  0 siblings, 0 replies; 6+ messages in thread
From: Mike Galbraith @ 2010-01-25 14:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Lin Ming, Ingo Molnar, Zhang, Yanmin, lkml

On Mon, 2010-01-25 at 15:03 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-25 at 18:03 +0800, Lin Ming wrote:
> > Hi, 
> > 
> > The netperf lookback regression comes back again.
> > UDP stream test has ~50% regression with 2.6.33-rc1 compared to 2.6.32.
> > 
> > Testing machine: Nehalem, 2 sockets, 4 cores, hyper thread, 4G mem
> > Server and client are bind to different physical cpu. 
> > 
> > taskset -c 15 ./netserver
> > taskset -c 0 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50,3 -I 99,5
> > -- -P 12384,12888 -s 32768 -S 32768 -m 1024
> 
> I cannot reproduce this on a dual socket nehalem system, I get current
> -linus to be about 10% faster than .32.

Nogo here too, but no nice HW here.

	-Mike


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: netperf ~50% regression with 2.6.33-rc1, bisect to 1b9508f
  2010-01-25 11:35 ` Mike Galbraith
  2010-01-25 11:45   ` Mike Galbraith
@ 2010-01-26  9:03   ` Lin Ming
  1 sibling, 0 replies; 6+ messages in thread
From: Lin Ming @ 2010-01-26  9:03 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, Ingo Molnar, Zhang, Yanmin, lkml

On Mon, 2010-01-25 at 19:35 +0800, Mike Galbraith wrote:
> On Mon, 2010-01-25 at 18:03 +0800, Lin Ming wrote:
> 
> > With above commit, the idle balance was rate limited, so CPU 15(server,
> > waiting data from client) is idle at most time.
> > 
> > CPU0(client) executes as below,
> > 
> > try_to_wake_up
> >    check_preempt_curr_idle
> >       resched_task
> >          smp_send_reschedule
> > 
> > This causes a lot of rescheduling IPI.
> > 
> > This commit can't be reverted due to conflict, so I just add below code
> > to disable "Rate-limit newidle" and the performance was recovered.
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 18cceee..588fdef 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -4421,9 +4421,6 @@ static void idle_balance(int this_cpu, struct rq *this_rq)
> >  
> >  	this_rq->idle_stamp = this_rq->clock;
> >  
> > -	if (this_rq->avg_idle < sysctl_sched_migration_cost)
> > -		return;
> > -
> >  	for_each_domain(this_cpu, sd) {
> >  		unsigned long interval;
> >  
> 
> Heh, so you should see the same thing with newidle disabled, as it was
> in .31 and many kernels prior.  Do you?

Weird.
2.6.31 does not have so many reschedule IPI.

This Nehalem machine has 3 domain levels,
$ grep . cpu0/domain*/name
cpu0/domain0/name:SIBLING
cpu0/domain1/name:MC
cpu0/domain2/name:NODE

For 2.6.31, SD_BALANCE_NEWIDLE is only set on SIBLING level.
For 2.6.32-rc1, SD_BALANCE_NEWIDLE is set on all 3 levels.

I can see many reschedule IPI in 2.6.32-rc1 if SD_BALANCE_NEWIDLE is
cleared for all 3 levels.
But for 2.6.31, I didn't see so many IPI even SD_BALANCE_NEWIDLE is
cleared on SIBLING level.

So it seems something happens between 2.6.31 and 2.6.32-rc1.
I'll bisect ...

Lin Ming


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-01-26  9:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-25 10:03 netperf ~50% regression with 2.6.33-rc1, bisect to 1b9508f Lin Ming
2010-01-25 11:35 ` Mike Galbraith
2010-01-25 11:45   ` Mike Galbraith
2010-01-26  9:03   ` Lin Ming
2010-01-25 14:03 ` Peter Zijlstra
2010-01-25 14:19   ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.