All of lore.kernel.org
 help / color / mirror / Atom feed
* UDP-U stream performance regression on 32-rc1 kernel
@ 2009-11-03  3:47 Alex Shi
  2009-11-03  4:33 ` Zhang, Yanmin
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Shi @ 2009-11-03  3:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, Zhang, Yanmin

We found the UDP-U 1k/4k stream of netperf benchmark have some
performance regression from 10% to 20% on our Tulsa and some NHM
machines. Bisecting found it is due to the following commitment.  

commit 840a0653100dbde599ae8ddf83fa214dfa5fd1aa
Author: Ingo Molnar <mingo@elte.hu>
Date:   Fri Sep 4 11:32:54 2009 +0200

    sched: Turn on SD_BALANCE_NEWIDLE
    
    Start the re-tuning of the balancer by turning on newidle.
    
    It improves hackbench performance and parallelism on a 4x4 box.
    The "perf stat --repeat 10" measurements give us:
    
      domain0             domain1
      .......................................
     -SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
       2041.273208  task-clock-msecs         #      9.354 CPUs    ( +-   0.363% )
    
     +SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
       2086.326925  task-clock-msecs         #     11.934 CPUs    ( +-   0.301% )
    
     +SD_BALANCE_NEWIDLE +SD_BALANCE_NEWIDLE:
       2115.289791  task-clock-msecs         #     12.158 CPUs    ( +-   0.263% )

BRGs    
Alex 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-03  3:47 UDP-U stream performance regression on 32-rc1 kernel Alex Shi
@ 2009-11-03  4:33 ` Zhang, Yanmin
  2009-11-03  9:09   ` Mike Galbraith
  2009-11-03 17:45   ` Ingo Molnar
  0 siblings, 2 replies; 12+ messages in thread
From: Zhang, Yanmin @ 2009-11-03  4:33 UTC (permalink / raw)
  To: alex.shi; +Cc: linux-kernel, mingo, Peter Zijlstra, Mike Galbraith

On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> We found the UDP-U 1k/4k stream of netperf benchmark have some
> performance regression from 10% to 20% on our Tulsa and some NHM
> machines. 

perf events shows function find_busiest_group consumes about 4.5% cpu time
with the patch while it only consumes 0.5% cpu time without the patch.

The communication between netperf client and netserver is very fast.
When netserver receives a message and there is no new message available,
it goes to sleep and scheduler calls idle_balance => load_balance_newidle.
load_balance_newidle spends too much time and a new message arrives quickly
before load_balance_newidle ends.

As the comments in the patch say hackbench benefits from it, I tested hackbench
on Nehalem and core2 machines. hackbench does benefit from it, about 6% on
nehalem machines, but doesn't benefit on core2 machines.

Yanmin

> Bisecting found it is due to the following commitment.  
> 
> commit 840a0653100dbde599ae8ddf83fa214dfa5fd1aa
> Author: Ingo Molnar <mingo@elte.hu>
> Date:   Fri Sep 4 11:32:54 2009 +0200



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-03  4:33 ` Zhang, Yanmin
@ 2009-11-03  9:09   ` Mike Galbraith
  2009-11-03 17:45   ` Ingo Molnar
  1 sibling, 0 replies; 12+ messages in thread
From: Mike Galbraith @ 2009-11-03  9:09 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: alex.shi, linux-kernel, mingo, Peter Zijlstra

On Tue, 2009-11-03 at 12:33 +0800, Zhang, Yanmin wrote:
> On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > performance regression from 10% to 20% on our Tulsa and some NHM
> > machines. 
> 
> perf events shows function find_busiest_group consumes about 4.5% cpu time
> with the patch while it only consumes 0.5% cpu time without the patch.
> 
> The communication between netperf client and netserver is very fast.
> When netserver receives a message and there is no new message available,
> it goes to sleep and scheduler calls idle_balance => load_balance_newidle.
> load_balance_newidle spends too much time and a new message arrives quickly
> before load_balance_newidle ends.

I have a similar problem wrt ramp-up and affinity, so will certainly be
doing battle with the thing here.

It's harming mysql+oltp and pgsql+oltp ramp up, and modest load in
general by pulling at the first micro-sleep.  After twiddling
wake_affine() to spread to a shared cache, newidle comes along and
throws a wrench into my plans an eyeblink later.

> As the comments in the patch say hackbench benefits from it, I tested hackbench
> on Nehalem and core2 machines. hackbench does benefit from it, about 6% on
> nehalem machines, but doesn't benefit on core2 machines.

It depends a lot on the load.  I have a testcase which spawns threads at
a ~high rate.  There, turning it off costs ~42% on my little Q6600 box.
It's also a modest utilization win for a kbuild.

In any case though, it certainly wants a couple fangs pulled.

	-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-03  4:33 ` Zhang, Yanmin
  2009-11-03  9:09   ` Mike Galbraith
@ 2009-11-03 17:45   ` Ingo Molnar
  2009-11-04  1:55     ` Zhang, Yanmin
  1 sibling, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2009-11-03 17:45 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: alex.shi, linux-kernel, Peter Zijlstra, Mike Galbraith


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > performance regression from 10% to 20% on our Tulsa and some NHM
> > machines. 
>  perf events shows function find_busiest_group consumes about 4.5% cpu 
> time with the patch while it only consumes 0.5% cpu time without the 
> patch.
> 
> The communication between netperf client and netserver is very fast. 
> When netserver receives a message and there is no new message 
> available, it goes to sleep and scheduler calls idle_balance => 
> load_balance_newidle. load_balance_newidle spends too much time and a 
> new message arrives quickly before load_balance_newidle ends.
> 
> As the comments in the patch say hackbench benefits from it, I tested 
> hackbench on Nehalem and core2 machines. hackbench does benefit from 
> it, about 6% on nehalem machines, but doesn't benefit on core2 
> machines.

Can you confirm that -tip:

  http://people.redhat.com/mingo/tip.git/README

has it fixed (or at least improved)?

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-03 17:45   ` Ingo Molnar
@ 2009-11-04  1:55     ` Zhang, Yanmin
  2009-11-04 12:07       ` Mike Galbraith
  0 siblings, 1 reply; 12+ messages in thread
From: Zhang, Yanmin @ 2009-11-04  1:55 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: alex.shi, linux-kernel, Peter Zijlstra, Mike Galbraith

On Tue, 2009-11-03 at 18:45 +0100, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> 
> > On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > > performance regression from 10% to 20% on our Tulsa and some NHM
> > > machines. 
> >  perf events shows function find_busiest_group consumes about 4.5% cpu 
> > time with the patch while it only consumes 0.5% cpu time without the 
> > patch.
> > 
> > The communication between netperf client and netserver is very fast. 
> > When netserver receives a message and there is no new message 
> > available, it goes to sleep and scheduler calls idle_balance => 
> > load_balance_newidle. load_balance_newidle spends too much time and a 
> > new message arrives quickly before load_balance_newidle ends.
> > 
> > As the comments in the patch say hackbench benefits from it, I tested 
> > hackbench on Nehalem and core2 machines. hackbench does benefit from 
> > it, about 6% on nehalem machines, but doesn't benefit on core2 
> > machines.
> 
> Can you confirm that -tip:
> 
>   http://people.redhat.com/mingo/tip.git/README
> 
> has it fixed (or at least improved)?
The latest tips improves netperf loopback result, but doesn't fix it
thoroughly. For example, on a Nehalem machine, netperf UDP-U-1k has
about 25% regression, but with the tips kernel, the regression becomes
less than 10%.

yanmin



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-04  1:55     ` Zhang, Yanmin
@ 2009-11-04 12:07       ` Mike Galbraith
  2009-11-05  2:20         ` Zhang, Yanmin
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2009-11-04 12:07 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Ingo Molnar, alex.shi, linux-kernel, Peter Zijlstra

On Wed, 2009-11-04 at 09:55 +0800, Zhang, Yanmin wrote:
> On Tue, 2009-11-03 at 18:45 +0100, Ingo Molnar wrote:
> > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > 
> > > On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > > > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > > > performance regression from 10% to 20% on our Tulsa and some NHM
> > > > machines. 
> > >  perf events shows function find_busiest_group consumes about 4.5% cpu 
> > > time with the patch while it only consumes 0.5% cpu time without the 
> > > patch.
> > > 
> > > The communication between netperf client and netserver is very fast. 
> > > When netserver receives a message and there is no new message 
> > > available, it goes to sleep and scheduler calls idle_balance => 
> > > load_balance_newidle. load_balance_newidle spends too much time and a 
> > > new message arrives quickly before load_balance_newidle ends.
> > > 
> > > As the comments in the patch say hackbench benefits from it, I tested 
> > > hackbench on Nehalem and core2 machines. hackbench does benefit from 
> > > it, about 6% on nehalem machines, but doesn't benefit on core2 
> > > machines.
> > 
> > Can you confirm that -tip:
> > 
> >   http://people.redhat.com/mingo/tip.git/README
> > 
> > has it fixed (or at least improved)?
> The latest tips improves netperf loopback result, but doesn't fix it
> thoroughly. For example, on a Nehalem machine, netperf UDP-U-1k has
> about 25% regression, but with the tips kernel, the regression becomes
> less than 10%.

Can you try the below, and send me your UDP-U-1k args so I can try it?  

The below shows promise for stopping newidle from harming cache, though
it needs to be more clever than a holdoff.  The fact that it only harms
the _very_ sensitive to idle time x264 testcase by 5% shows some
promise.

tip v2.6.32-rc6-1731-gc5bb4b1
tbench 8                                1044.66 MB/sec 8 procs
x264 8                                  366.58 frames/sec -start_debit 392.24 fps -newidle 215.34 fps

tip+ v2.6.32-rc6-1731-gc5bb4b1
tbench 8                                1040.08 MB/sec 8 procs .995
x264 8                                  350.23 frames/sec -start_debit 371.76
                                          .955                           .947

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          10447.14   19734.58   36038.18   35776.85   34662.76   33682.30   32256.22   28770.99   25323.23
             10462.61   19580.14   36050.48   35942.63   35054.84   33988.40   32423.89   29259.65   25892.24
             10501.02   19231.27   36007.03   35985.32   35060.79   33945.47   32400.42   29140.84   25716.16
tip avg      10470.25   19515.33   36031.89   35901.60   34926.13   33872.05   32360.17   29057.16   25643.87

tip+         10594.32   19912.01   36320.45   35904.71   35100.37   34003.38   32453.04   28413.57   23871.22
             10667.96   20000.17   36533.72   36472.19   35371.35   34208.85   32617.80   28893.55   24499.34
             10463.25   19915.69   36657.20   36419.08   35403.15   34041.80   32612.94   28835.82   24323.52
tip+ avg     10575.17   19942.62   36503.79   36265.32   35291.62   34084.67   32561.26   28714.31   24231.36
                1.010      1.021      1.013      1.010      1.010      1.006      1.006       .988       .944


---
 kernel/sched.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -590,6 +590,7 @@ struct rq {
 
 	u64 rt_avg;
 	u64 age_stamp;
+	u64 newidle_ratelimit;
 #endif
 
 	/* calc_load related fields */
@@ -2383,6 +2384,8 @@ static int try_to_wake_up(struct task_st
 	if (rq != orig_rq)
 		update_rq_clock(rq);
 
+	rq->newidle_ratelimit = rq->clock;
+
 	WARN_ON(p->state != TASK_WAKING);
 	cpu = task_cpu(p);
 
@@ -4427,6 +4430,12 @@ static void idle_balance(int this_cpu, s
 	struct sched_domain *sd;
 	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
+	u64 delta = this_rq->clock - this_rq->newidle_ratelimit;
+
+	if (delta < sysctl_sched_migration_cost)
+		return;
+
+	this_rq->newidle_ratelimit = this_rq->clock;
 
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-04 12:07       ` Mike Galbraith
@ 2009-11-05  2:20         ` Zhang, Yanmin
  2009-11-05  5:20           ` Mike Galbraith
  0 siblings, 1 reply; 12+ messages in thread
From: Zhang, Yanmin @ 2009-11-05  2:20 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, alex.shi, linux-kernel, Peter Zijlstra

On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
> On Wed, 2009-11-04 at 09:55 +0800, Zhang, Yanmin wrote:
> > On Tue, 2009-11-03 at 18:45 +0100, Ingo Molnar wrote:
> > > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > > 
> > > > On Tue, 2009-11-03 at 11:47 +0800, Alex Shi wrote:
> > > > > We found the UDP-U 1k/4k stream of netperf benchmark have some
> > > > > performance regression from 10% to 20% on our Tulsa and some NHM
> > > > > machines. 
> > > >  perf events shows function find_busiest_group consumes about 4.5% cpu 
> > > > time with the patch while it only consumes 0.5% cpu time without the 
> > > > patch.
> > > > 
> > > > The communication between netperf client and netserver is very fast. 
> > > > When netserver receives a message and there is no new message 
> > > > available, it goes to sleep and scheduler calls idle_balance => 
> > > > load_balance_newidle. load_balance_newidle spends too much time and a 
> > > > new message arrives quickly before load_balance_newidle ends.
> > > > 
> > > > As the comments in the patch say hackbench benefits from it, I tested 
> > > > hackbench on Nehalem and core2 machines. hackbench does benefit from 
> > > > it, about 6% on nehalem machines, but doesn't benefit on core2 
> > > > machines.
> > > 
> > > Can you confirm that -tip:
> > > 
> > >   http://people.redhat.com/mingo/tip.git/README
> > > 
> > > has it fixed (or at least improved)?
> > The latest tips improves netperf loopback result, but doesn't fix it
> > thoroughly. For example, on a Nehalem machine, netperf UDP-U-1k has
> > about 25% regression, but with the tips kernel, the regression becomes
> > less than 10%.
> 

> Can you try the below, and send me 
I tested it on Nehalem machine against the latest tips kernel. netperf loopback
result is good and regression disappears.

tbench result has no improvement.

> your UDP-U-1k args so I can try it? 
#taskset -c 0 ./netserver
#taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096

Pls. check /proc/cpuinfo to make sure cpu 0 and cpu 15 are not in the
same physical cpu.

I also run sysbench(oltp)+mysql testing with thread number 14,16,18,20,32,64,128. The average
number is good. If I compare every single result against 2.6.32-rc5's, I find thread number
14,16,18,20,32's result are better than 2.6.32-rc5's, but 64,128's result are worse. 128's is
the worst.

>  
> 
> The below shows promise for stopping newidle from harming cache, though
> it needs to be more clever than a holdoff.  The fact that it only harms
> the _very_ sensitive to idle time x264 testcase by 5% shows some
> promise.
> 
> tip v2.6.32-rc6-1731-gc5bb4b1
> tbench 8                                1044.66 MB/sec 8 procs
> x264 8                                  366.58 frames/sec -start_debit 392.24 fps -newidle 215.34 fps
> 
> tip+ v2.6.32-rc6-1731-gc5bb4b1
> tbench 8                                1040.08 MB/sec 8 procs .995
> x264 8                                  350.23 frames/sec -start_debit 371.76
>                                           .955                           .947
> 
> mysql+oltp
> clients             1          2          4          8         16         32         64        128        256
> tip          10447.14   19734.58   36038.18   35776.85   34662.76   33682.30   32256.22   28770.99   25323.23
>              10462.61   19580.14   36050.48   35942.63   35054.84   33988.40   32423.89   29259.65   25892.24
>              10501.02   19231.27   36007.03   35985.32   35060.79   33945.47   32400.42   29140.84   25716.16
> tip avg      10470.25   19515.33   36031.89   35901.60   34926.13   33872.05   32360.17   29057.16   25643.87
> 
> tip+         10594.32   19912.01   36320.45   35904.71   35100.37   34003.38   32453.04   28413.57   23871.22
>              10667.96   20000.17   36533.72   36472.19   35371.35   34208.85   32617.80   28893.55   24499.34
>              10463.25   19915.69   36657.20   36419.08   35403.15   34041.80   32612.94   28835.82   24323.52
> tip+ avg     10575.17   19942.62   36503.79   36265.32   35291.62   34084.67   32561.26   28714.31   24231.36
>                 1.010      1.021      1.013      1.010      1.010      1.006      1.006       .988       .944
> 
> 
> ---
>  kernel/sched.c |    9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -590,6 +590,7 @@ struct rq {
>  
>  	u64 rt_avg;
>  	u64 age_stamp;
> +	u64 newidle_ratelimit;
>  #endif
>  
>  	/* calc_load related fields */
> @@ -2383,6 +2384,8 @@ static int try_to_wake_up(struct task_st
>  	if (rq != orig_rq)
>  		update_rq_clock(rq);
>  
> +	rq->newidle_ratelimit = rq->clock;
> +
>  	WARN_ON(p->state != TASK_WAKING);
>  	cpu = task_cpu(p);
>  
> @@ -4427,6 +4430,12 @@ static void idle_balance(int this_cpu, s
>  	struct sched_domain *sd;
>  	int pulled_task = 0;
>  	unsigned long next_balance = jiffies + HZ;
> +	u64 delta = this_rq->clock - this_rq->newidle_ratelimit;
> +
> +	if (delta < sysctl_sched_migration_cost)
> +		return;
> +
> +	this_rq->newidle_ratelimit = this_rq->clock;
>  
>  	for_each_domain(this_cpu, sd) {
>  		unsigned long interval;
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-05  2:20         ` Zhang, Yanmin
@ 2009-11-05  5:20           ` Mike Galbraith
  2009-11-05  7:03             ` Mike Galbraith
  2009-11-05  7:44             ` Zhang, Yanmin
  0 siblings, 2 replies; 12+ messages in thread
From: Mike Galbraith @ 2009-11-05  5:20 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Ingo Molnar, alex.shi, linux-kernel, Peter Zijlstra

On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:

> > Can you try the below, and send me 
> I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> result is good and regression disappears.

Excellent.  Ingo has picked up a version in tip (1b9508f) which has zero
negative effect on my x264 testcase, and is a win for mysql+oltp through
the whole test spectrum.  As that may (dunno, Ingo?) now be considered a
regression fix, ie candidate for 32.final, testing that it does no harm
to your big machines would be a good thing.  (pretty please?:)

> tbench result has no improvement.

Can you remind me where we stand on tbench?

> > your UDP-U-1k args so I can try it? 
> #taskset -c 0 ./netserver
> #taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096
> 
> Pls. check /proc/cpuinfo to make sure cpu 0 and cpu 15 are not in the
> same physical cpu.

Thanks. My little box doesn't have a 15 (darn) so 0,3 will have to do.

> I also run sysbench(oltp)+mysql testing with thread number 14,16,18,20,32,64,128. The average
> number is good. If I compare every single result against 2.6.32-rc5's, I find thread number
> 14,16,18,20,32's result are better than 2.6.32-rc5's, but 64,128's result are worse. 128's is
> the worst.

Hm.  That's disconcerting.  However, that patch isn't going anywhere but
to the bitwolf anyway (diagnostic).  If 1b9508f regresses, that will be
a problem.  With diag, my box also regressed at the tail.  Balancing a
bit seems to help mysql once it starts tripping all over itself, it
improves the decay curve markedly.  1b9508f does brief bursts of newidle
balancing when idle time climbs, which translated to a ~6% improvement
at 256 clients on my little quad.

	-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-05  5:20           ` Mike Galbraith
@ 2009-11-05  7:03             ` Mike Galbraith
  2009-11-05  8:57               ` Mike Galbraith
  2009-11-05  7:44             ` Zhang, Yanmin
  1 sibling, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2009-11-05  7:03 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Ingo Molnar, alex.shi, linux-kernel, Peter Zijlstra

On Thu, 2009-11-05 at 06:20 +0100, Mike Galbraith wrote:
> On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
> 
> > > Can you try the below, and send me 
> > I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> > result is good and regression disappears.
> 
> Excellent.  Ingo has picked up a version in tip (1b9508f) which has zero
> negative effect on my x264 testcase, and is a win for mysql+oltp through
> the whole test spectrum.  As that may (dunno, Ingo?) now be considered a
> regression fix, ie candidate for 32.final, testing that it does no harm
> to your big machines would be a good thing.  (pretty please?:)

Egad. Size XXL difference on my cheap Q6600 box

git v2.6.32-rc6-26-g91d3f9b
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

 65536    4096   60.00     7793073      0    4256.06
 65536           60.00     7780487           4249.18

git v2.6.32-rc6-26-g91d3f9b + 1b9508f
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

 65536    4096   60.00     15133547      0    8264.93
 65536           60.00     15131466           8263.80

tip v2.6.32-rc6-1796-gd995f1d
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

 65536    4096   60.00     13998562      0    7645.08
 65536           60.00     13986112           7638.28 (uhoh, tinker time.)

	-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-05  5:20           ` Mike Galbraith
  2009-11-05  7:03             ` Mike Galbraith
@ 2009-11-05  7:44             ` Zhang, Yanmin
  2009-11-05  8:10               ` Mike Galbraith
  1 sibling, 1 reply; 12+ messages in thread
From: Zhang, Yanmin @ 2009-11-05  7:44 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, alex.shi, linux-kernel, Peter Zijlstra

On Thu, 2009-11-05 at 06:20 +0100, Mike Galbraith wrote:
> On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
> 
> > > Can you try the below, and send me 
> > I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> > result is good and regression disappears.
> 
> Excellent.  Ingo has picked up a version in tip (1b9508f) which has zero
> negative effect on my x264 testcase, and is a win for mysql+oltp through
> the whole test spectrum.  As that may (dunno, Ingo?) now be considered a
> regression fix, ie candidate for 32.final, testing that it does no harm
> to your big machines would be a good thing.  (pretty please?:)
I tested the latest tips kernel which includes commit 1b9508f.
Comparing with 2.6.31, netperf loopback UDP-U-4k has about 2% regression.

sysbench(oltp)+mysql result is pretty good, about 2% improvement than
2.6.31's.



> 
> > tbench result has no improvement.
> 
> Can you remind me where we stand on tbench?
I run tbench by starting CPU_NUM*2 tbench clients without cpu binding.
Comparing with 2.6.31, tbench has about 6% regression with 2.6.31-rc1 on Nehalem.
Mostly, it's caused by SD_PREFER_LOCAL and Peter already disables the flag for
MC and cpu domains. Your patch disables it for node domain.
With the current tips kernel, tbench has about 3% regression on 1 nahalem, and
less than 1% on another Nehalem.

With pure 2.6.32-rc6 kernel, tbench result has about 3~6% regression on Nehalem
, comparing with 2.6.32-rc5's. So some patches in tips haven't been merged into
upstream.

> 
> > > your UDP-U-1k args so I can try it? 
> > #taskset -c 0 ./netserver
> > #taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096
> > 
> > Pls. check /proc/cpuinfo to make sure cpu 0 and cpu 15 are not in the
> > same physical cpu.
> 
> Thanks. My little box doesn't have a 15 (darn) so 0,3 will have to do.
Sorry. I copy it from the output of "ps -ef", so a couple of ',' are lost. The right netperf command
line is:
#taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096


> 
> > I also run sysbench(oltp)+mysql testing with thread number 14,16,18,20,32,64,128. The average
> > number is good. If I compare every single result against 2.6.32-rc5's, I find thread number
> > 14,16,18,20,32's result are better than 2.6.32-rc5's, but 64,128's result are worse. 128's is
> > the worst.
> 
> Hm.  That's disconcerting.  However, that patch isn't going anywhere but
> to the bitwolf anyway (diagnostic).  If 1b9508f regresses, that will be
> a problem.  With diag, my box also regressed at the tail.  Balancing a
> bit seems to help mysql once it starts tripping all over itself, it
> improves the decay curve markedly.  1b9508f does brief bursts of newidle
> balancing when idle time climbs, which translated to a ~6% improvement
> at 256 clients on my little quad.
> 
> 	-Mike
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-05  7:44             ` Zhang, Yanmin
@ 2009-11-05  8:10               ` Mike Galbraith
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Galbraith @ 2009-11-05  8:10 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Ingo Molnar, alex.shi, linux-kernel, Peter Zijlstra

On Thu, 2009-11-05 at 15:44 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-11-05 at 06:20 +0100, Mike Galbraith wrote:
> > On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
> > 
> > > > Can you try the below, and send me 
> > > I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> > > result is good and regression disappears.
> > 
> > Excellent.  Ingo has picked up a version in tip (1b9508f) which has zero
> > negative effect on my x264 testcase, and is a win for mysql+oltp through
> > the whole test spectrum.  As that may (dunno, Ingo?) now be considered a
> > regression fix, ie candidate for 32.final, testing that it does no harm
> > to your big machines would be a good thing.  (pretty please?:)
> I tested the latest tips kernel which includes commit 1b9508f.
> Comparing with 2.6.31, netperf loopback UDP-U-4k has about 2% regression.

Ok, thanks for testing.  That could well be a1f84a3, that needs a bit of
fiddling.

> sysbench(oltp)+mysql result is pretty good, about 2% improvement than
> 2.6.31's.

Cool, a progression for a change :)
 
> > > tbench result has no improvement.
> > 
> > Can you remind me where we stand on tbench?
> I run tbench by starting CPU_NUM*2 tbench clients without cpu binding.
> Comparing with 2.6.31, tbench has about 6% regression with 2.6.31-rc1 on Nehalem.
> Mostly, it's caused by SD_PREFER_LOCAL and Peter already disables the flag for
> MC and cpu domains. Your patch disables it for node domain.
> With the current tips kernel, tbench has about 3% regression on 1 nahalem, and
> less than 1% on another Nehalem.

Ok, we're not looking too bad, but still something there to go after.

> With pure 2.6.32-rc6 kernel, tbench result has about 3~6% regression on Nehalem
> , comparing with 2.6.32-rc5's. So some patches in tips haven't been merged into
> upstream.
 
> > > > your UDP-U-1k args so I can try it? 
> > > #taskset -c 0 ./netserver
> > > #taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096
> > > 
> > > Pls. check /proc/cpuinfo to make sure cpu 0 and cpu 15 are not in the
> > > same physical cpu.
> > 
> > Thanks. My little box doesn't have a 15 (darn) so 0,3 will have to do.
> Sorry. I copy it from the output of "ps -ef", so a couple of ',' are lost. The right netperf command
> line is:
> #taskset -c 15 ./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 -- -P 12384,12888 -s 32768 -S 32768 -m 4096

Thanks.  (-i and -I have always given me trouble on my little boxen. I
usually just let it do it's thing without them, and repeat a lot;)

	-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: UDP-U stream performance regression on 32-rc1 kernel
  2009-11-05  7:03             ` Mike Galbraith
@ 2009-11-05  8:57               ` Mike Galbraith
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Galbraith @ 2009-11-05  8:57 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Ingo Molnar, alex.shi, linux-kernel, Peter Zijlstra

On Thu, 2009-11-05 at 08:03 +0100, Mike Galbraith wrote:
> On Thu, 2009-11-05 at 06:20 +0100, Mike Galbraith wrote:
> > On Thu, 2009-11-05 at 10:20 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-11-04 at 13:07 +0100, Mike Galbraith wrote:
> > 
> > > > Can you try the below, and send me 
> > > I tested it on Nehalem machine against the latest tips kernel. netperf loopback
> > > result is good and regression disappears.
> > 
> > Excellent.  Ingo has picked up a version in tip (1b9508f) which has zero
> > negative effect on my x264 testcase, and is a win for mysql+oltp through
> > the whole test spectrum.  As that may (dunno, Ingo?) now be considered a
> > regression fix, ie candidate for 32.final, testing that it does no harm
> > to your big machines would be a good thing.  (pretty please?:)
> 
> Egad. Size XXL difference on my cheap Q6600 box

Ingo, ignore my "eek!" reaction (for now).

I'm trying to test the pull lineup with as many benchmarks as I can fit
in, but methinks I'm screwing this one (unfamiliar) up :-/

> git v2.6.32-rc6-26-g91d3f9b
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
>  65536    4096   60.00     7793073      0    4256.06
>  65536           60.00     7780487           4249.18
> 
> git v2.6.32-rc6-26-g91d3f9b + 1b9508f
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
>  65536    4096   60.00     15133547      0    8264.93
>  65536           60.00     15131466           8263.80
> 
> tip v2.6.32-rc6-1796-gd995f1d
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
> 
>  65536    4096   60.00     13998562      0    7645.08
>  65536           60.00     13986112           7638.28 (uhoh, tinker time.)
> 
> 	-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-11-05  8:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-03  3:47 UDP-U stream performance regression on 32-rc1 kernel Alex Shi
2009-11-03  4:33 ` Zhang, Yanmin
2009-11-03  9:09   ` Mike Galbraith
2009-11-03 17:45   ` Ingo Molnar
2009-11-04  1:55     ` Zhang, Yanmin
2009-11-04 12:07       ` Mike Galbraith
2009-11-05  2:20         ` Zhang, Yanmin
2009-11-05  5:20           ` Mike Galbraith
2009-11-05  7:03             ` Mike Galbraith
2009-11-05  8:57               ` Mike Galbraith
2009-11-05  7:44             ` Zhang, Yanmin
2009-11-05  8:10               ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.