linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] sched: Pass affine target cpu into wake_affine
@ 2010-01-04  9:03 Lin Ming
  2010-01-04  9:25 ` Peter Zijlstra
  2010-01-05  2:48 ` Lin Ming
  0 siblings, 2 replies; 16+ messages in thread
From: Lin Ming @ 2010-01-04  9:03 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra; +Cc: lkml, Zhang, Yanmin

commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
Author: Lin Ming <ming.m.lin@intel.com>
Date:   Mon Jan 4 14:14:50 2010 +0800

    sched: Pass affine target cpu into wake_affine
    
    Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
    the affine target maybe adjusted to any idle cpu in cache sharing domains
    instead of current cpu.
    But wake_affine still use current cpu to calculate load which is wrong.
    
    This patch passes affine cpu into wake_affine.
    
    Signed-off-by: Lin Ming <ming.m.lin@intel.com>
---
 kernel/sched_fair.c |   29 ++++++++++++++---------------
 1 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..0a6fa39 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1237,11 +1237,11 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
 
 #endif
 
-static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
+static int wake_affine(struct sched_domain *sd, struct task_struct *p, int affine_cpu, int sync)
 {
 	struct task_struct *curr = current;
-	unsigned long this_load, load;
-	int idx, this_cpu, prev_cpu;
+	unsigned long affine_load, load;
+	int idx, prev_cpu;
 	unsigned long tl_per_task;
 	unsigned int imbalance;
 	struct task_group *tg;
@@ -1249,10 +1249,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	int balanced;
 
 	idx	  = sd->wake_idx;
-	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
 	load	  = source_load(prev_cpu, idx);
-	this_load = target_load(this_cpu, idx);
+	affine_load = target_load(affine_cpu, idx);
 
 	if (sync) {
 	       if (sched_feat(SYNC_LESS) &&
@@ -1275,7 +1274,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		tg = task_group(current);
 		weight = current->se.load.weight;
 
-		this_load += effective_load(tg, this_cpu, -weight, -weight);
+		affine_load += effective_load(tg, affine_cpu, -weight, -weight);
 		load += effective_load(tg, prev_cpu, 0, -weight);
 	}
 
@@ -1285,16 +1284,16 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	imbalance = 100 + (sd->imbalance_pct - 100) / 2;
 
 	/*
-	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
-	 * due to the sync cause above having dropped this_load to 0, we'll
+	 * In low-load situations, where prev_cpu is idle and affine_cpu is idle
+	 * due to the sync cause above having dropped affine_load to 0, we'll
 	 * always have an imbalance, but there's really nothing you can do
 	 * about that, so that's good too.
 	 *
 	 * Otherwise check if either cpus are near enough in load to allow this
-	 * task to be woken on this_cpu.
+	 * task to be woken on affine_cpu.
 	 */
-	balanced = !this_load ||
-		100*(this_load + effective_load(tg, this_cpu, weight, weight)) <=
+	balanced = !affine_load ||
+		100*(affine_load + effective_load(tg, affine_cpu, weight, weight)) <=
 		imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
 
 	/*
@@ -1306,11 +1305,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		return 1;
 
 	schedstat_inc(p, se.nr_wakeups_affine_attempts);
-	tl_per_task = cpu_avg_load_per_task(this_cpu);
+	tl_per_task = cpu_avg_load_per_task(affine_cpu);
 
 	if (balanced ||
-	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
+	    (affine_load <= load &&
+	     affine_load + target_load(prev_cpu, idx) <= tl_per_task)) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and
@@ -1544,7 +1543,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 			update_shares(tmp);
 	}
 
-	if (affine_sd && wake_affine(affine_sd, p, sync))
+	if (affine_sd && wake_affine(affine_sd, p, cpu, sync))
 		return cpu;
 
 	while (sd) {



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-04  9:25 ` Peter Zijlstra
@ 2010-01-04  9:12   ` Lin Ming
  2010-01-04  9:32     ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Lin Ming @ 2010-01-04  9:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, lkml, Zhang, Yanmin

On Mon, 2010-01-04 at 17:25 +0800, Peter Zijlstra wrote:
> On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > Author: Lin Ming <ming.m.lin@intel.com>
> > Date:   Mon Jan 4 14:14:50 2010 +0800
> > 
> >     sched: Pass affine target cpu into wake_affine
> >     
> >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> >     instead of current cpu.
> >     But wake_affine still use current cpu to calculate load which is wrong.
> >     
> >     This patch passes affine cpu into wake_affine.
> >     
> 
> Does this at all help with that regression?

No.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-04  9:03 [RFC PATCH] sched: Pass affine target cpu into wake_affine Lin Ming
@ 2010-01-04  9:25 ` Peter Zijlstra
  2010-01-04  9:12   ` Lin Ming
  2010-01-05  2:48 ` Lin Ming
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2010-01-04  9:25 UTC (permalink / raw)
  To: Lin Ming; +Cc: Mike Galbraith, lkml, Zhang, Yanmin

On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> Author: Lin Ming <ming.m.lin@intel.com>
> Date:   Mon Jan 4 14:14:50 2010 +0800
> 
>     sched: Pass affine target cpu into wake_affine
>     
>     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
>     the affine target maybe adjusted to any idle cpu in cache sharing domains
>     instead of current cpu.
>     But wake_affine still use current cpu to calculate load which is wrong.
>     
>     This patch passes affine cpu into wake_affine.
>     

Does this at all help with that regression?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-04  9:12   ` Lin Ming
@ 2010-01-04  9:32     ` Peter Zijlstra
  2010-01-04 10:59       ` Mike Galbraith
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2010-01-04  9:32 UTC (permalink / raw)
  To: Lin Ming; +Cc: Mike Galbraith, lkml, Zhang, Yanmin

On Mon, 2010-01-04 at 17:12 +0800, Lin Ming wrote:
> On Mon, 2010-01-04 at 17:25 +0800, Peter Zijlstra wrote:
> > On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > > Author: Lin Ming <ming.m.lin@intel.com>
> > > Date:   Mon Jan 4 14:14:50 2010 +0800
> > > 
> > >     sched: Pass affine target cpu into wake_affine
> > >     
> > >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> > >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> > >     instead of current cpu.
> > >     But wake_affine still use current cpu to calculate load which is wrong.
> > >     
> > >     This patch passes affine cpu into wake_affine.
> > >     
> > 
> > Does this at all help with that regression?
> 
> No.

crap :/

The change does look sensible though.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-04  9:32     ` Peter Zijlstra
@ 2010-01-04 10:59       ` Mike Galbraith
  2010-01-04 11:07         ` Lin Ming
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2010-01-04 10:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Lin Ming, lkml, Zhang, Yanmin

On Mon, 2010-01-04 at 10:32 +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-04 at 17:12 +0800, Lin Ming wrote:
> > On Mon, 2010-01-04 at 17:25 +0800, Peter Zijlstra wrote:
> > > On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > > > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > > > Author: Lin Ming <ming.m.lin@intel.com>
> > > > Date:   Mon Jan 4 14:14:50 2010 +0800
> > > > 
> > > >     sched: Pass affine target cpu into wake_affine
> > > >     
> > > >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> > > >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> > > >     instead of current cpu.
> > > >     But wake_affine still use current cpu to calculate load which is wrong.
> > > >     
> > > >     This patch passes affine cpu into wake_affine.
> > > >     
> > > 
> > > Does this at all help with that regression?
> > 
> > No.
> 
> crap :/
> 
> The change does look sensible though.

I piddled with all kinds of ways to get around calling wake_affine()
entirely, and/or calling it with the affine candidate to no avail.  Best
result was always to do the silly looking thing, namely test the current
cpu for wake affine decision, but slip in the shared cache cpu.

I bet the below helps, though there will still be cache misses, so there
will still be pain for extreme switchers.  Question is whether the
ramp-up gain is worth it.  I think yes, since it's up to 100%.  Would be
most excellent to find a way to know in advance when the cost will be
too high, and then not go there.  Same applies for doing the affinity
decision every time for extreme switchers.  It's expensive for those,
especially so when they're pinned, but pays in the general case.

Anyway...

PREFER_SIBLING is set at the CPU domain level if you don't have power
saving set, so you get to eat cache misses for each cpu, whether it's
sharing a cache or not as you traverse.  Lots of CPUs, LOTS of pain.

not-signed-off

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 57e6357..5b81156 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,7 +99,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
+				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
 				| 0*SD_PREFER_SIBLING			\
 				,					\
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..8fe7ee8 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1508,7 +1508,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 			 * If there's an idle sibling in this domain, make that
 			 * the wake_affine target instead of the current cpu.
 			 */
-			if (tmp->flags & SD_PREFER_SIBLING)
+			if (tmp->flags & SD_SHARE_PKG_RESOURCES)
 				target = select_idle_sibling(p, tmp, target);
 
 			if (target >= 0) {



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-04 10:59       ` Mike Galbraith
@ 2010-01-04 11:07         ` Lin Ming
  0 siblings, 0 replies; 16+ messages in thread
From: Lin Ming @ 2010-01-04 11:07 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Mon, 2010-01-04 at 18:59 +0800, Mike Galbraith wrote:
> On Mon, 2010-01-04 at 10:32 +0100, Peter Zijlstra wrote:
> > On Mon, 2010-01-04 at 17:12 +0800, Lin Ming wrote:
> > > On Mon, 2010-01-04 at 17:25 +0800, Peter Zijlstra wrote:
> > > > On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > > > > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > > > > Author: Lin Ming <ming.m.lin@intel.com>
> > > > > Date:   Mon Jan 4 14:14:50 2010 +0800
> > > > > 
> > > > >     sched: Pass affine target cpu into wake_affine
> > > > >     
> > > > >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> > > > >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> > > > >     instead of current cpu.
> > > > >     But wake_affine still use current cpu to calculate load which is wrong.
> > > > >     
> > > > >     This patch passes affine cpu into wake_affine.
> > > > >     
> > > > 
> > > > Does this at all help with that regression?
> > > 
> > > No.
> > 
> > crap :/
> > 
> > The change does look sensible though.
> 
> I piddled with all kinds of ways to get around calling wake_affine()
> entirely, and/or calling it with the affine candidate to no avail.  Best
> result was always to do the silly looking thing, namely test the current
> cpu for wake affine decision, but slip in the shared cache cpu.
> 
> I bet the below helps, though there will still be cache misses, so there
> will still be pain for extreme switchers.  Question is whether the
> ramp-up gain is worth it.  I think yes, since it's up to 100%.  Would be
> most excellent to find a way to know in advance when the cost will be
> too high, and then not go there.  Same applies for doing the affinity
> decision every time for extreme switchers.  It's expensive for those,
> especially so when they're pinned, but pays in the general case.
> 
> Anyway...
> 
> PREFER_SIBLING is set at the CPU domain level if you don't have power
> saving set, so you get to eat cache misses for each cpu, whether it's
> sharing a cache or not as you traverse.  Lots of CPUs, LOTS of pain.
> 
> not-signed-off

Nice. 
I did a quick test and it does fix the regression.
And actually, it achieves +30% improvement for the quick test with this
patch applied to 2.6.33-rc2, compared with 2.6.32.

I'll do more test to confirm the improvement.

Thanks,
Lin Ming

> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 57e6357..5b81156 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -99,7 +99,7 @@ int arch_update_cpu_topology(void);
>  				| 1*SD_WAKE_AFFINE			\
>  				| 1*SD_SHARE_CPUPOWER			\
>  				| 0*SD_POWERSAVINGS_BALANCE		\
> -				| 0*SD_SHARE_PKG_RESOURCES		\
> +				| 1*SD_SHARE_PKG_RESOURCES		\
>  				| 0*SD_SERIALIZE			\
>  				| 0*SD_PREFER_SIBLING			\
>  				,					\
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 42ac3c9..8fe7ee8 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1508,7 +1508,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
>  			 * If there's an idle sibling in this domain, make that
>  			 * the wake_affine target instead of the current cpu.
>  			 */
> -			if (tmp->flags & SD_PREFER_SIBLING)
> +			if (tmp->flags & SD_SHARE_PKG_RESOURCES)
>  				target = select_idle_sibling(p, tmp, target);
>  
>  			if (target >= 0) {
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-04  9:03 [RFC PATCH] sched: Pass affine target cpu into wake_affine Lin Ming
  2010-01-04  9:25 ` Peter Zijlstra
@ 2010-01-05  2:48 ` Lin Ming
  2010-01-05  3:44   ` Mike Galbraith
  1 sibling, 1 reply; 16+ messages in thread
From: Lin Ming @ 2010-01-05  2:48 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> Author: Lin Ming <ming.m.lin@intel.com>
> Date:   Mon Jan 4 14:14:50 2010 +0800
> 
>     sched: Pass affine target cpu into wake_affine
>     
>     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
>     the affine target maybe adjusted to any idle cpu in cache sharing domains
>     instead of current cpu.
>     But wake_affine still use current cpu to calculate load which is wrong.
>     
>     This patch passes affine cpu into wake_affine.
>     
>     Signed-off-by: Lin Ming <ming.m.lin@intel.com>

Mike,

Any comment of this patch?

Thanks,
Lin Ming

> ---
>  kernel/sched_fair.c |   29 ++++++++++++++---------------
>  1 files changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 42ac3c9..0a6fa39 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1237,11 +1237,11 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>  
>  #endif
>  
> -static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> +static int wake_affine(struct sched_domain *sd, struct task_struct *p, int affine_cpu, int sync)
>  {
>  	struct task_struct *curr = current;
> -	unsigned long this_load, load;
> -	int idx, this_cpu, prev_cpu;
> +	unsigned long affine_load, load;
> +	int idx, prev_cpu;
>  	unsigned long tl_per_task;
>  	unsigned int imbalance;
>  	struct task_group *tg;
> @@ -1249,10 +1249,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  	int balanced;
>  
>  	idx	  = sd->wake_idx;
> -	this_cpu  = smp_processor_id();
>  	prev_cpu  = task_cpu(p);
>  	load	  = source_load(prev_cpu, idx);
> -	this_load = target_load(this_cpu, idx);
> +	affine_load = target_load(affine_cpu, idx);
>  
>  	if (sync) {
>  	       if (sched_feat(SYNC_LESS) &&
> @@ -1275,7 +1274,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  		tg = task_group(current);
>  		weight = current->se.load.weight;
>  
> -		this_load += effective_load(tg, this_cpu, -weight, -weight);
> +		affine_load += effective_load(tg, affine_cpu, -weight, -weight);
>  		load += effective_load(tg, prev_cpu, 0, -weight);
>  	}
>  
> @@ -1285,16 +1284,16 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  	imbalance = 100 + (sd->imbalance_pct - 100) / 2;
>  
>  	/*
> -	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
> -	 * due to the sync cause above having dropped this_load to 0, we'll
> +	 * In low-load situations, where prev_cpu is idle and affine_cpu is idle
> +	 * due to the sync cause above having dropped affine_load to 0, we'll
>  	 * always have an imbalance, but there's really nothing you can do
>  	 * about that, so that's good too.
>  	 *
>  	 * Otherwise check if either cpus are near enough in load to allow this
> -	 * task to be woken on this_cpu.
> +	 * task to be woken on affine_cpu.
>  	 */
> -	balanced = !this_load ||
> -		100*(this_load + effective_load(tg, this_cpu, weight, weight)) <=
> +	balanced = !affine_load ||
> +		100*(affine_load + effective_load(tg, affine_cpu, weight, weight)) <=
>  		imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
>  
>  	/*
> @@ -1306,11 +1305,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  		return 1;
>  
>  	schedstat_inc(p, se.nr_wakeups_affine_attempts);
> -	tl_per_task = cpu_avg_load_per_task(this_cpu);
> +	tl_per_task = cpu_avg_load_per_task(affine_cpu);
>  
>  	if (balanced ||
> -	    (this_load <= load &&
> -	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
> +	    (affine_load <= load &&
> +	     affine_load + target_load(prev_cpu, idx) <= tl_per_task)) {
>  		/*
>  		 * This domain has SD_WAKE_AFFINE and
>  		 * p is cache cold in this domain, and
> @@ -1544,7 +1543,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
>  			update_shares(tmp);
>  	}
>  
> -	if (affine_sd && wake_affine(affine_sd, p, sync))
> +	if (affine_sd && wake_affine(affine_sd, p, cpu, sync))
>  		return cpu;
>  
>  	while (sd) {
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-05  2:48 ` Lin Ming
@ 2010-01-05  3:44   ` Mike Galbraith
  2010-01-05  6:43     ` Mike Galbraith
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2010-01-05  3:44 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Tue, 2010-01-05 at 10:48 +0800, Lin Ming wrote:
> On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > Author: Lin Ming <ming.m.lin@intel.com>
> > Date:   Mon Jan 4 14:14:50 2010 +0800
> > 
> >     sched: Pass affine target cpu into wake_affine
> >     
> >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> >     instead of current cpu.
> >     But wake_affine still use current cpu to calculate load which is wrong.
> >     
> >     This patch passes affine cpu into wake_affine.
> >     
> >     Signed-off-by: Lin Ming <ming.m.lin@intel.com>
> 
> Mike,
> 
> Any comment of this patch?

The patch definitely looks like the right thing to do, but when I tried
this, it didn't work out well.  Since I can't seem to recall precise
details, I'll let my box either remind me or give it's ack.

	-Mike


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-05  3:44   ` Mike Galbraith
@ 2010-01-05  6:43     ` Mike Galbraith
  2010-01-05 11:49       ` Mike Galbraith
  2010-01-07  8:45       ` Lin Ming
  0 siblings, 2 replies; 16+ messages in thread
From: Mike Galbraith @ 2010-01-05  6:43 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Tue, 2010-01-05 at 04:44 +0100, Mike Galbraith wrote:
> On Tue, 2010-01-05 at 10:48 +0800, Lin Ming wrote:
> > On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > > Author: Lin Ming <ming.m.lin@intel.com>
> > > Date:   Mon Jan 4 14:14:50 2010 +0800
> > > 
> > >     sched: Pass affine target cpu into wake_affine
> > >     
> > >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> > >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> > >     instead of current cpu.
> > >     But wake_affine still use current cpu to calculate load which is wrong.
> > >     
> > >     This patch passes affine cpu into wake_affine.
> > >     
> > >     Signed-off-by: Lin Ming <ming.m.lin@intel.com>
> > 
> > Mike,
> > 
> > Any comment of this patch?
> 
> The patch definitely looks like the right thing to do, but when I tried
> this, it didn't work out well.  Since I can't seem to recall precise
> details, I'll let my box either remind me or give it's ack.

Unfortunately, box reminded me.  mysql+oltp peak throughput with
nr_clients == nr_cpus

tip   37012.34
tip+  33025.83
          .892

We really only want to check for shared cache on ramp-up and/or longish
intermission.  Once there's enough work to go around, interleaving is a
big problem for these synchronous tasks.  Doing the silly thing gets us
the ramp-up gain without too much pain, though there is definitely pain
for very fast switchers.

Looking always costs you a cache miss, not looking costs you throughput
on ramp/intermission.  Damned if you do, damned if you don't.

	-Mike


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-05  6:43     ` Mike Galbraith
@ 2010-01-05 11:49       ` Mike Galbraith
  2010-01-07  8:45       ` Lin Ming
  1 sibling, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2010-01-05 11:49 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Tue, 2010-01-05 at 07:43 +0100, Mike Galbraith wrote:
> On Tue, 2010-01-05 at 04:44 +0100, Mike Galbraith wrote:
> > On Tue, 2010-01-05 at 10:48 +0800, Lin Ming wrote:
> > > On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > > > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > > > Author: Lin Ming <ming.m.lin@intel.com>
> > > > Date:   Mon Jan 4 14:14:50 2010 +0800
> > > > 
> > > >     sched: Pass affine target cpu into wake_affine
> > > >     
> > > >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> > > >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> > > >     instead of current cpu.
> > > >     But wake_affine still use current cpu to calculate load which is wrong.
> > > >     
> > > >     This patch passes affine cpu into wake_affine.
> > > >     
> > > >     Signed-off-by: Lin Ming <ming.m.lin@intel.com>
> > > 
> > > Mike,
> > > 
> > > Any comment of this patch?
> > 
> > The patch definitely looks like the right thing to do, but when I tried
> > this, it didn't work out well.  Since I can't seem to recall precise
> > details, I'll let my box either remind me or give it's ack.
> 
> Unfortunately, box reminded me.  mysql+oltp peak throughput with
> nr_clients == nr_cpus
> 
> tip   37012.34
> tip+  33025.83
>           .892
> 
> We really only want to check for shared cache on ramp-up and/or longish
> intermission.  Once there's enough work to go around, interleaving is a
> big problem for these synchronous tasks.  Doing the silly thing gets us
> the ramp-up gain without too much pain, though there is definitely pain
> for very fast switchers.
> 
> Looking always costs you a cache miss, not looking costs you throughput
> on ramp/intermission.  Damned if you do, damned if you don't.

FWIW, I'm almost tempted to submit the sched_fair.c bit of the below
even though it costs almost 2% of mysql+oltp peak.  Notice the TCP
numbers went from erratic to stable in the second series of three .33
runs (sched_fair.c bits added), and other microbench improvements.

These bits also gave tbench a little boost.  Cuts wakeup overhead a bit,
which everything appreciates, but still delivers instant affine cpu when
it counts the most.

reference numbers are virgin 31.9.

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
marge     Linux 2.6.31. 2853 2923 1132 2829.3 4761.9 1235.0 1234.4 4472 1683.
marge     Linux 2.6.31. 2839 2921 1141 2846.5 4779.8 1242.5 1235.9 4455 1684.
marge     Linux 2.6.31. 2838 2935 751. 2838.5 4820.0 1243.6 1235.0 4472 1684.

marge     Linux 2.6.33- 3070 5167 2936 2819.3 4772.9 1231.7 1228.2 4381 1681.
marge     Linux 2.6.33- 3033 5047 2013 2803.0 4745.5 1355.3 1236.5 4461 1665.
marge     Linux 2.6.33- 3061 5176 1145 2800.9 4737.6 1237.6 1233.1 4404 1685.

marge     Linux 2.6.33- 3084 5173 2917 2813.7 4788.5 1340.8 1349.0 4460 1760.
marge     Linux 2.6.33- 3079 5152 2928 2839.2 4795.6 1328.6 1316.7 4438 1752.
marge     Linux 2.6.33- 3082 5173 2924 2808.1 4811.4 1348.6 1326.0 4479 1772.

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 57e6357..5b81156 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,7 +99,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
+				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
 				| 0*SD_PREFER_SIBLING			\
 				,					\
diff --git a/kernel/sched.c b/kernel/sched.c
index 22c14eb..427ebf3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2380,10 +2380,11 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
 
 	smp_wmb();
 	rq = orig_rq = task_rq_lock(p, &flags);
-	update_rq_clock(rq);
 	if (!(p->state & state))
 		goto out;
 
+	update_rq_clock(rq);
+
 	if (p->se.on_rq)
 		goto out_running;
 
@@ -2414,7 +2415,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state,
 		set_task_cpu(p, cpu);
 
 	rq = __task_rq_lock(p);
-	update_rq_clock(rq);
+
+	if (cpu != orig_cpu)
+		update_rq_clock(rq);
 
 	WARN_ON(p->state != TASK_WAKING);
 	cpu = task_cpu(p);
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..20f58ec 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1453,11 +1453,14 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 	int want_affine = 0;
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
+	int ramp;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		if (sched_feat(AFFINE_WAKEUPS) &&
-		    cpumask_test_cpu(cpu, &p->cpus_allowed))
+		    cpumask_test_cpu(cpu, &p->cpus_allowed)) {
 			want_affine = 1;
+			ramp = this_rq()->nr_running == 1;
+		}
 		new_cpu = prev_cpu;
 	}
 
@@ -1508,8 +1511,11 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 			 * If there's an idle sibling in this domain, make that
 			 * the wake_affine target instead of the current cpu.
 			 */
-			if (tmp->flags & SD_PREFER_SIBLING)
+			if (ramp && tmp->flags & SD_SHARE_PKG_RESOURCES) {
 				target = select_idle_sibling(p, tmp, target);
+				if (target >= 0)
+					ramp++;
+			}
 
 			if (target >= 0) {
 				if (tmp->flags & SD_WAKE_AFFINE) {
@@ -1544,7 +1550,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 			update_shares(tmp);
 	}
 
-	if (affine_sd && wake_affine(affine_sd, p, sync))
+	if (affine_sd && (ramp > 1 || wake_affine(affine_sd, p, sync)))
 		return cpu;
 
 	while (sd) {



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-05  6:43     ` Mike Galbraith
  2010-01-05 11:49       ` Mike Galbraith
@ 2010-01-07  8:45       ` Lin Ming
  2010-01-07  9:15         ` Peter Zijlstra
                           ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Lin Ming @ 2010-01-07  8:45 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Tue, 2010-01-05 at 14:43 +0800, Mike Galbraith wrote:
> On Tue, 2010-01-05 at 04:44 +0100, Mike Galbraith wrote:
> > On Tue, 2010-01-05 at 10:48 +0800, Lin Ming wrote:
> > > On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > > > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > > > Author: Lin Ming <ming.m.lin@intel.com>
> > > > Date:   Mon Jan 4 14:14:50 2010 +0800
> > > > 
> > > >     sched: Pass affine target cpu into wake_affine
> > > >     
> > > >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> > > >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> > > >     instead of current cpu.
> > > >     But wake_affine still use current cpu to calculate load which is wrong.
> > > >     
> > > >     This patch passes affine cpu into wake_affine.
> > > >     
> > > >     Signed-off-by: Lin Ming <ming.m.lin@intel.com>
> > > 
> > > Mike,
> > > 
> > > Any comment of this patch?
> > 
> > The patch definitely looks like the right thing to do, but when I tried
> > this, it didn't work out well.  Since I can't seem to recall precise
> > details, I'll let my box either remind me or give it's ack.
> 
> Unfortunately, box reminded me.  mysql+oltp peak throughput with
> nr_clients == nr_cpus

Did you test with your vmark regression fix patch also applied?

I tested on below 2 machines with the 2 patches both applied and the
oltp(sysbench+mysql) data shows good.
Tigerton x86_64 machine: 16cpus(4P/4Cores), 40G mem
IA64 machine: 32cpus(4P/4Cores/HT), 16G mem

Compared with upstream 2.6.33-rc2, IA64 improves ~15% and Tigerton
improves ~3%.

The 2 patches are merged as below,

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 57e6357..5b81156 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,7 +99,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
+				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
 				| 0*SD_PREFER_SIBLING			\
 				,					\
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..cbf4bd2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1237,11 +1237,11 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
 
 #endif
 
-static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
+static int wake_affine(struct sched_domain *sd, struct task_struct *p, int affine_cpu, int sync)
 {
 	struct task_struct *curr = current;
-	unsigned long this_load, load;
-	int idx, this_cpu, prev_cpu;
+	unsigned long affine_load, load;
+	int idx, prev_cpu;
 	unsigned long tl_per_task;
 	unsigned int imbalance;
 	struct task_group *tg;
@@ -1249,10 +1249,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	int balanced;
 
 	idx	  = sd->wake_idx;
-	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
 	load	  = source_load(prev_cpu, idx);
-	this_load = target_load(this_cpu, idx);
+	affine_load = target_load(affine_cpu, idx);
 
 	if (sync) {
 	       if (sched_feat(SYNC_LESS) &&
@@ -1275,7 +1274,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		tg = task_group(current);
 		weight = current->se.load.weight;
 
-		this_load += effective_load(tg, this_cpu, -weight, -weight);
+		affine_load += effective_load(tg, affine_cpu, -weight, -weight);
 		load += effective_load(tg, prev_cpu, 0, -weight);
 	}
 
@@ -1285,16 +1284,16 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	imbalance = 100 + (sd->imbalance_pct - 100) / 2;
 
 	/*
-	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
-	 * due to the sync cause above having dropped this_load to 0, we'll
+	 * In low-load situations, where prev_cpu is idle and affine_cpu is idle
+	 * due to the sync cause above having dropped affine_load to 0, we'll
 	 * always have an imbalance, but there's really nothing you can do
 	 * about that, so that's good too.
 	 *
 	 * Otherwise check if either cpus are near enough in load to allow this
-	 * task to be woken on this_cpu.
+	 * task to be woken on affine_cpu.
 	 */
-	balanced = !this_load ||
-		100*(this_load + effective_load(tg, this_cpu, weight, weight)) <=
+	balanced = !affine_load ||
+		100*(affine_load + effective_load(tg, affine_cpu, weight, weight)) <=
 		imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
 
 	/*
@@ -1306,11 +1305,11 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		return 1;
 
 	schedstat_inc(p, se.nr_wakeups_affine_attempts);
-	tl_per_task = cpu_avg_load_per_task(this_cpu);
+	tl_per_task = cpu_avg_load_per_task(affine_cpu);
 
 	if (balanced ||
-	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
+	    (affine_load <= load &&
+	     affine_load + target_load(prev_cpu, idx) <= tl_per_task)) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and
@@ -1508,7 +1507,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 			 * If there's an idle sibling in this domain, make that
 			 * the wake_affine target instead of the current cpu.
 			 */
-			if (tmp->flags & SD_PREFER_SIBLING)
+			if (tmp->flags & SD_SHARE_PKG_RESOURCES)
 				target = select_idle_sibling(p, tmp, target);
 
 			if (target >= 0) {
@@ -1544,7 +1543,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 			update_shares(tmp);
 	}
 
-	if (affine_sd && wake_affine(affine_sd, p, sync))
+	if (affine_sd && wake_affine(affine_sd, p, cpu, sync))
 		return cpu;
 
 	while (sd) {




^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-07  8:45       ` Lin Ming
@ 2010-01-07  9:15         ` Peter Zijlstra
  2010-01-07  9:33         ` Mike Galbraith
  2010-01-07 13:14         ` Mike Galbraith
  2 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2010-01-07  9:15 UTC (permalink / raw)
  To: Lin Ming; +Cc: Mike Galbraith, lkml, Zhang, Yanmin

On Thu, 2010-01-07 at 16:45 +0800, Lin Ming wrote:

> I tested on below 2 machines with the 2 patches both applied and the
> oltp(sysbench+mysql) data shows good.
> Tigerton x86_64 machine: 16cpus(4P/4Cores), 40G mem
> IA64 machine: 32cpus(4P/4Cores/HT), 16G mem
> 
> Compared with upstream 2.6.33-rc2, IA64 improves ~15% and Tigerton
> improves ~3%.

I rather like your patch, so I'll queue both up.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-07  8:45       ` Lin Ming
  2010-01-07  9:15         ` Peter Zijlstra
@ 2010-01-07  9:33         ` Mike Galbraith
  2010-01-07 13:14         ` Mike Galbraith
  2 siblings, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2010-01-07  9:33 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Thu, 2010-01-07 at 16:45 +0800, Lin Ming wrote:
> On Tue, 2010-01-05 at 14:43 +0800, Mike Galbraith wrote:
> > On Tue, 2010-01-05 at 04:44 +0100, Mike Galbraith wrote:
> > > On Tue, 2010-01-05 at 10:48 +0800, Lin Ming wrote:
> > > > On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > > > > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > > > > Author: Lin Ming <ming.m.lin@intel.com>
> > > > > Date:   Mon Jan 4 14:14:50 2010 +0800
> > > > > 
> > > > >     sched: Pass affine target cpu into wake_affine
> > > > >     
> > > > >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> > > > >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> > > > >     instead of current cpu.
> > > > >     But wake_affine still use current cpu to calculate load which is wrong.
> > > > >     
> > > > >     This patch passes affine cpu into wake_affine.
> > > > >     
> > > > >     Signed-off-by: Lin Ming <ming.m.lin@intel.com>
> > > > 
> > > > Mike,
> > > > 
> > > > Any comment of this patch?
> > > 
> > > The patch definitely looks like the right thing to do, but when I tried
> > > this, it didn't work out well.  Since I can't seem to recall precise
> > > details, I'll let my box either remind me or give it's ack.
> > 
> > Unfortunately, box reminded me.  mysql+oltp peak throughput with
> > nr_clients == nr_cpus
> 
> Did you test with your vmark regression fix patch also applied?

Yeah.  Delta between tested kernels was your patch.

	-Mike


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-07  8:45       ` Lin Ming
  2010-01-07  9:15         ` Peter Zijlstra
  2010-01-07  9:33         ` Mike Galbraith
@ 2010-01-07 13:14         ` Mike Galbraith
  2010-01-08  2:38           ` Lin Ming
  2 siblings, 1 reply; 16+ messages in thread
From: Mike Galbraith @ 2010-01-07 13:14 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Thu, 2010-01-07 at 16:45 +0800, Lin Ming wrote:
> On Tue, 2010-01-05 at 14:43 +0800, Mike Galbraith wrote:
> > On Tue, 2010-01-05 at 04:44 +0100, Mike Galbraith wrote:
> > > On Tue, 2010-01-05 at 10:48 +0800, Lin Ming wrote:
> > > > On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > > > > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > > > > Author: Lin Ming <ming.m.lin@intel.com>
> > > > > Date:   Mon Jan 4 14:14:50 2010 +0800
> > > > > 
> > > > >     sched: Pass affine target cpu into wake_affine
> > > > >     
> > > > >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> > > > >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> > > > >     instead of current cpu.
> > > > >     But wake_affine still use current cpu to calculate load which is wrong.
> > > > >     
> > > > >     This patch passes affine cpu into wake_affine.
> > > > >     
> > > > >     Signed-off-by: Lin Ming <ming.m.lin@intel.com>
> > > > 
> > > > Mike,
> > > > 
> > > > Any comment of this patch?
> > > 
> > > The patch definitely looks like the right thing to do, but when I tried
> > > this, it didn't work out well.  Since I can't seem to recall precise
> > > details, I'll let my box either remind me or give it's ack.
> > 
> > Unfortunately, box reminded me.  mysql+oltp peak throughput with
> > nr_clients == nr_cpus
> 
> Did you test with your vmark regression fix patch also applied?

Below is a complete retest.  Mind testing my hacklet?  I bet a nickle
it'll work at least as well as yours on your beefy boxen.

Everything is a trade, but I wonder what your patch puts on the table
that mine doesn't.  Mine trades a bit of peak for better ramp just as
yours does (not as much pain, for more gain on my box), but cuts out
overhead when there's a very good chance that sage advice is unneeded,
and when a return on the investment is unlikely.  It also tosses
opportunities that might have worked out with some load, but I'm not
seeing numbers that justify the pain.  The big gain is the ramp.

The tail I don't care much about.  When mysql starts jamming up, tossing
in balancing always extends the tail.  Turn newidle loose, and it'll get
considerably better.. at the expense of just about everything else.

(all have vmark regression fix)

tip = v2.6.33-rc3-260-gadd8174
tip+ = pass affine target
tip++ = ramp

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          10097.67   19850.62   36652.15   36175.93   35131.83   33968.09   32264.10   28861.89   25264.55
             10360.76   19969.69   37217.48   36679.43   35670.86   34281.49   32575.91   28424.81   24415.42
             10254.75   19732.79   37122.05   36523.65   35500.15   34181.83   32508.23   28182.73   23319.44
tip avg      10237.72   19851.03   36997.22   36459.67   35434.28   34143.80   32449.41   28489.81   24333.13

tip+         10994.71   20056.54   32689.38   36210.83   35372.91   34277.60   32629.49   28264.63   26220.13
             11025.81   20084.65   32709.84   36671.23   35789.21   34602.03   32849.03   29198.10   25902.69
             11002.07   20148.40   32257.20   36627.57   35859.69   35859.69   32871.66   29160.72   26346.97
tip+ avg     11007.53   20096.53   32552.14   36503.21   35673.93   34913.10   32783.39   28874.48   26156.59
vs tip          1.075      1.012       .879      1.001      1.006      1.022      1.010      1.013      1.074


tip++        10841.88   20578.16   36161.14   36330.32   35552.39   34178.19   32181.05   27447.47   25213.32
             11101.92   20912.30   36471.23   36850.12   35749.46   34518.61   32921.50   28669.84   24672.39
             11116.54   20899.96   36553.23   36853.72   35859.07   34572.91   32887.71   28518.16   25535.70
tip+ avg     11020.11   20796.80   36395.20   36678.05   35720.30   34423.23   32663.42   28211.82   25140.47
vs tip          1.076      1.047       .983      1.005      1.008      1.008      1.006       .990      1.033
vs tip+         1.001      1.034      1.118      1.004      1.001       .985       .996       .977       .961

(combo pack)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 57e6357..5b81156 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -99,7 +99,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
+				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
 				| 0*SD_PREFER_SIBLING			\
 				,					\
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 42ac3c9..1f9cc7a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1450,14 +1450,16 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
-	int want_affine = 0;
+	int want_affine = 0, ramp = 0;
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		if (sched_feat(AFFINE_WAKEUPS) &&
-		    cpumask_test_cpu(cpu, &p->cpus_allowed))
+		    cpumask_test_cpu(cpu, &p->cpus_allowed)) {
 			want_affine = 1;
+			ramp = this_rq()->nr_running == 1;
+		}
 		new_cpu = prev_cpu;
 	}
 
@@ -1508,8 +1510,11 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 			 * If there's an idle sibling in this domain, make that
 			 * the wake_affine target instead of the current cpu.
 			 */
-			if (tmp->flags & SD_PREFER_SIBLING)
+			if (ramp && tmp->flags & SD_SHARE_PKG_RESOURCES) {
 				target = select_idle_sibling(p, tmp, target);
+				if (target >= 0)
+					ramp++;
+			}
 
 			if (target >= 0) {
 				if (tmp->flags & SD_WAKE_AFFINE) {
@@ -1544,7 +1549,7 @@ static int select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flag
 			update_shares(tmp);
 	}
 
-	if (affine_sd && wake_affine(affine_sd, p, sync))
+	if (affine_sd && (ramp > 1 || wake_affine(affine_sd, p, sync)))
 		return cpu;
 
 	while (sd) {



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-07 13:14         ` Mike Galbraith
@ 2010-01-08  2:38           ` Lin Ming
  2010-01-08  3:34             ` Mike Galbraith
  0 siblings, 1 reply; 16+ messages in thread
From: Lin Ming @ 2010-01-08  2:38 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Thu, 2010-01-07 at 21:14 +0800, Mike Galbraith wrote:
> On Thu, 2010-01-07 at 16:45 +0800, Lin Ming wrote:
> > On Tue, 2010-01-05 at 14:43 +0800, Mike Galbraith wrote:
> > > On Tue, 2010-01-05 at 04:44 +0100, Mike Galbraith wrote:
> > > > On Tue, 2010-01-05 at 10:48 +0800, Lin Ming wrote:
> > > > > On Mon, 2010-01-04 at 17:03 +0800, Lin Ming wrote:
> > > > > > commit a03ecf08d7bbdd979d81163ea13d194fe21ad339
> > > > > > Author: Lin Ming <ming.m.lin@intel.com>
> > > > > > Date:   Mon Jan 4 14:14:50 2010 +0800
> > > > > > 
> > > > > >     sched: Pass affine target cpu into wake_affine
> > > > > >     
> > > > > >     Since commit a1f84a3(sched: Check for an idle shared cache in select_task_rq_fair()),
> > > > > >     the affine target maybe adjusted to any idle cpu in cache sharing domains
> > > > > >     instead of current cpu.
> > > > > >     But wake_affine still use current cpu to calculate load which is wrong.
> > > > > >     
> > > > > >     This patch passes affine cpu into wake_affine.
> > > > > >     
> > > > > >     Signed-off-by: Lin Ming <ming.m.lin@intel.com>
> > > > > 
> > > > > Mike,
> > > > > 
> > > > > Any comment of this patch?
> > > > 
> > > > The patch definitely looks like the right thing to do, but when I tried
> > > > this, it didn't work out well.  Since I can't seem to recall precise
> > > > details, I'll let my box either remind me or give it's ack.
> > > 
> > > Unfortunately, box reminded me.  mysql+oltp peak throughput with
> > > nr_clients == nr_cpus
> > 
> > Did you test with your vmark regression fix patch also applied?
> 
> Below is a complete retest.  Mind testing my hacklet?  I bet a nickle
> it'll work at least as well as yours on your beefy boxen.

I tested your hacklet on below 2 machines as before.

Tigerton x86_64 machine: 16cpus(4P/4Cores), 40G mem
IA64 machine: 32cpus(4P/4Cores/HT), 16G mem

Test1: vmark regression fix patch + pass affine target
Test2: this hacklet

Compared with upstream 2.6.33-rc2, 
Test1: Tigerton +3%, IA64 +15%
Test2: Tigerton +3%, IA64 +10%

The test2 also improves on IA64, although not as good as test1.

I also tested tbench, this hacklet does not help.

Lin Ming


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] sched: Pass affine target cpu into wake_affine
  2010-01-08  2:38           ` Lin Ming
@ 2010-01-08  3:34             ` Mike Galbraith
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Galbraith @ 2010-01-08  3:34 UTC (permalink / raw)
  To: Lin Ming; +Cc: Peter Zijlstra, lkml, Zhang, Yanmin

On Fri, 2010-01-08 at 10:38 +0800, Lin Ming wrote:
> On Thu, 2010-01-07 at 21:14 +0800, Mike Galbraith wrote:

> > Below is a complete retest.  Mind testing my hacklet?  I bet a nickle
> > it'll work at least as well as yours on your beefy boxen.
> 
> I tested your hacklet on below 2 machines as before.
> 
> Tigerton x86_64 machine: 16cpus(4P/4Cores), 40G mem
> IA64 machine: 32cpus(4P/4Cores/HT), 16G mem
> 
> Test1: vmark regression fix patch + pass affine target
> Test2: this hacklet
> 
> Compared with upstream 2.6.33-rc2, 
> Test1: Tigerton +3%, IA64 +15%
> Test2: Tigerton +3%, IA64 +10%
> 
> The test2 also improves on IA64, although not as good as test1.

Dang, I owe you a nickle.  Thanks a bunch for giving it a go.

Interesting result.  It's not really making much sense why you see a
peak gain, while I see peak loss.  Radical behavior difference :-/

I think the ramp only (and harder) approach is safer, but we'll see.

	-Mike


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-01-08  3:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-04  9:03 [RFC PATCH] sched: Pass affine target cpu into wake_affine Lin Ming
2010-01-04  9:25 ` Peter Zijlstra
2010-01-04  9:12   ` Lin Ming
2010-01-04  9:32     ` Peter Zijlstra
2010-01-04 10:59       ` Mike Galbraith
2010-01-04 11:07         ` Lin Ming
2010-01-05  2:48 ` Lin Ming
2010-01-05  3:44   ` Mike Galbraith
2010-01-05  6:43     ` Mike Galbraith
2010-01-05 11:49       ` Mike Galbraith
2010-01-07  8:45       ` Lin Ming
2010-01-07  9:15         ` Peter Zijlstra
2010-01-07  9:33         ` Mike Galbraith
2010-01-07 13:14         ` Mike Galbraith
2010-01-08  2:38           ` Lin Ming
2010-01-08  3:34             ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).