linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sched: Avoid SMT siblings in select_idle_sibling() if possible
@ 2011-11-15  9:46 Peter Zijlstra
  2011-11-16  1:14 ` Suresh Siddha
  2011-11-18 23:40 ` [tip:sched/core] sched: Avoid SMT siblings in select_idle_sibling() if possible tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-15  9:46 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar; +Cc: Paul Turner, Suresh Siddha, Mike Galbraith

Subject: sched: Avoid SMT siblings in select_idle_sibling() if possible
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 10 Nov 2011 13:01:10 +0100

Avoid select_idle_sibling() from picking a sibling thread if there's
an idle core that shares cache.

Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-smlz1ah4jwood10f7eml9gzk@git.kernel.org
---
 kernel/sched_fair.c |   42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -2326,7 +2326,8 @@ static int select_idle_sibling(struct ta
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
-	int i;
+	struct sched_group *sg;
+	int i, smt = 0;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2346,25 +2347,38 @@ static int select_idle_sibling(struct ta
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
+again:
 	for_each_domain(target, sd) {
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
-			break;
+		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
+			continue;
 
-		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
-			if (idle_cpu(i)) {
-				target = i;
-				break;
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
+			if (!smt) {
+				smt = 1;
+				goto again;
 			}
+			break;
 		}
 
-		/*
-		 * Lets stop looking for an idle sibling when we reached
-		 * the domain that spans the current cpu and prev_cpu.
-		 */
-		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
-			break;
+		sg = sd->groups;
+		do {
+			if (!cpumask_intersects(sched_group_cpus(sg),
+						tsk_cpus_allowed(p)))
+				goto next;
+
+			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (!idle_cpu(i))
+					goto next;
+			}
+
+			target = cpumask_first_and(sched_group_cpus(sg),
+					tsk_cpus_allowed(p));
+			goto done;
+next:
+			sg = sg->next;
+		} while (sg != sd->groups);
 	}
+done:
 	rcu_read_unlock();
 
 	return target;


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-15  9:46 sched: Avoid SMT siblings in select_idle_sibling() if possible Peter Zijlstra
@ 2011-11-16  1:14 ` Suresh Siddha
  2011-11-16  9:24   ` Mike Galbraith
  2011-11-18 23:40 ` [tip:sched/core] sched: Avoid SMT siblings in select_idle_sibling() if possible tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 89+ messages in thread
From: Suresh Siddha @ 2011-11-16  1:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Paul Turner, Mike Galbraith

On Tue, 2011-11-15 at 01:46 -0800, Peter Zijlstra wrote:
> @@ -2346,25 +2347,38 @@ static int select_idle_sibling(struct ta
>  	 * Otherwise, iterate the domains and find an elegible idle cpu.
>  	 */
>  	rcu_read_lock();
> +again:
>  	for_each_domain(target, sd) {
> -		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
> -			break;
> +		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
> +			continue;
>  
> -		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> -			if (idle_cpu(i)) {
> -				target = i;
> -				break;
> +		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
> +			if (!smt) {
> +				smt = 1;
> +				goto again;
>  			}
> +			break;
>  		}

It looks like you will be checking the core domain twice (with smt == 0
and smt == 1) if there are no idle siblings.

How about this patch which is more self explanatory?
---

Avoid select_idle_sibling() from picking a sibling thread if there's
an idle core that shares cache.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |    2 +
 kernel/sched_fair.c |   54 +++++++++++++++++++++++++++++++++++---------------
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0e9344a..4b0bc6a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -734,6 +734,8 @@ static inline int cpu_of(struct rq *rq)
 #define for_each_domain(cpu, __sd) \
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
 
+#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
+
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		(&__get_cpu_var(runqueues))
 #define task_rq(p)		cpu_rq(task_cpu(p))
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..cb7a5ef 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2241,6 +2241,25 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd;
+
+	for_each_domain(cpu, sd)
+		if (!(sd->flags & flag))
+			return sd->child;
+	return NULL;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2249,6 +2268,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
+	struct sched_group *sg;
 	int i;
 
 	/*
@@ -2269,25 +2289,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
-	for_each_domain(target, sd) {
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
-			break;
+	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	for_each_lower_domain(sd) {
+		sg = sd->groups;
+		do {
+			if (!cpumask_intersects(sched_group_cpus(sg),
+						tsk_cpus_allowed(p)))
+				goto next;
 
-		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
-			if (idle_cpu(i)) {
-				target = i;
-				break;
+			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (!idle_cpu(i))
+					goto next;
 			}
-		}
 
-		/*
-		 * Lets stop looking for an idle sibling when we reached
-		 * the domain that spans the current cpu and prev_cpu.
-		 */
-		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
-			break;
+			target = cpumask_first_and(sched_group_cpus(sg),
+					tsk_cpus_allowed(p));
+			goto done;
+next:
+			sg = sg->next;
+		} while (sg != sd->groups);
 	}
+done:
 	rcu_read_unlock();
 
 	return target;



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-16  1:14 ` Suresh Siddha
@ 2011-11-16  9:24   ` Mike Galbraith
  2011-11-16 18:37     ` Suresh Siddha
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-16  9:24 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2011-11-15 at 17:14 -0800, Suresh Siddha wrote:

> How about this patch which is more self explanatory?

Actually, after further testing/reading, it looks to me like both of
these patches have a problem.  They'll never select SMT siblings (so a
skip SIBLING should accomplish the same).

This patch didn't select an idle core either though, where Peter's did.

Tested by pinning a hog to cores 1-3, then starting up an unpinned
tbench pair.  Peter's patch didn't do the BadThing (as in bad for
TCP_RR/tbench) in that case, but should have.

> +		sg = sd->groups;
> +		do {
> +			if (!cpumask_intersects(sched_group_cpus(sg),
> +						tsk_cpus_allowed(p)))
> +				goto next;
>  
> +			for_each_cpu(i, sched_group_cpus(sg)) {
> +				if (!idle_cpu(i))
> +					goto next;

Say target is CPU0.  Groups are 0,4 1,5 2,6 3,7.  0-3 are first CPUs
encountered in MC groups, all were busy.  At SIBLING level, the only
group is 0,4.  First encountered CPU of sole group is busy target, so
we're done.. so we return busy target.

> +			target = cpumask_first_and(sched_group_cpus(sg),
> +					tsk_cpus_allowed(p));

At SIBLING, group = 0,4 = 0x5, 0x5 & 0xff = 1 = target.

	-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-16  9:24   ` Mike Galbraith
@ 2011-11-16 18:37     ` Suresh Siddha
  2011-11-17  1:59       ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Suresh Siddha @ 2011-11-16 18:37 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-16 at 01:24 -0800, Mike Galbraith wrote:
> On Tue, 2011-11-15 at 17:14 -0800, Suresh Siddha wrote:
> 
> > How about this patch which is more self explanatory?
> 
> Actually, after further testing/reading, it looks to me like both of
> these patches have a problem.  They'll never select SMT siblings (so a
> skip SIBLING should accomplish the same).
> 
> This patch didn't select an idle core either though, where Peter's did.
> 
> Tested by pinning a hog to cores 1-3, then starting up an unpinned
> tbench pair.  Peter's patch didn't do the BadThing (as in bad for
> TCP_RR/tbench) in that case, but should have.
> 
> > +		sg = sd->groups;
> > +		do {
> > +			if (!cpumask_intersects(sched_group_cpus(sg),
> > +						tsk_cpus_allowed(p)))
> > +				goto next;
> >  
> > +			for_each_cpu(i, sched_group_cpus(sg)) {
> > +				if (!idle_cpu(i))
> > +					goto next;
> 
> Say target is CPU0.  Groups are 0,4 1,5 2,6 3,7.  0-3 are first CPUs
> encountered in MC groups, all were busy.  At SIBLING level, the only
> group is 0,4.  First encountered CPU of sole group is busy target, so
> we're done.. so we return busy target.
> 
> > +			target = cpumask_first_and(sched_group_cpus(sg),
> > +					tsk_cpus_allowed(p));
> 
> At SIBLING, group = 0,4 = 0x5, 0x5 & 0xff = 1 = target.

Mike, At the sibling level, domain span will be 0,4 which is 0x5. But
there are two individual groups. First group just contains cpu0 and the
second group contains cpu4.

So if cpu0 is busy, we will check the next group to see if it is idle
(which is cpu4 in your example). So we will return cpu-4.

It should be ok. Isn't it?

thanks,
suresh


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-16 18:37     ` Suresh Siddha
@ 2011-11-17  1:59       ` Mike Galbraith
  2011-11-17 15:38         ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-17  1:59 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote:
> On Wed, 2011-11-16 at 01:24 -0800, Mike Galbraith wrote:

> > At SIBLING, group = 0,4 = 0x5, 0x5 & 0xff = 1 = target.
> 
> Mike, At the sibling level, domain span will be 0,4 which is 0x5. But
> there are two individual groups. First group just contains cpu0 and the
> second group contains cpu4.
> 
> So if cpu0 is busy, we will check the next group to see if it is idle
> (which is cpu4 in your example). So we will return cpu-4.

Oh duh, right.

> It should be ok. Isn't it?

Nope, wasn't ok.  I'll double check today though, and bend, spindle
and/or mutilate as required.

	-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17  1:59       ` Mike Galbraith
@ 2011-11-17 15:38         ` Mike Galbraith
  2011-11-17 15:56           ` Peter Zijlstra
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-17 15:38 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 02:59 +0100, Mike Galbraith wrote:
> On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote:

> > It should be ok. Isn't it?
> 
> Nope, wasn't ok.

Because you can't get to again: with a single E5620 package, it having
only SIBLING and MC domains.
 
again:
        for_each_domain(target, sd) {
                if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
                        continue;

                if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
                        if (!smt) {
                                smt = 1;
                                goto again;


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 15:38         ` Mike Galbraith
@ 2011-11-17 15:56           ` Peter Zijlstra
  2011-11-17 16:38             ` Mike Galbraith
  2011-11-17 19:08             ` Suresh Siddha
  0 siblings, 2 replies; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-17 15:56 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 16:38 +0100, Mike Galbraith wrote:
> On Thu, 2011-11-17 at 02:59 +0100, Mike Galbraith wrote:
> > On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote:
> 
> > > It should be ok. Isn't it?
> > 
> > Nope, wasn't ok.
> 
> Because you can't get to again: with a single E5620 package, it having
> only SIBLING and MC domains.
>  
> again:
>         for_each_domain(target, sd) {
>                 if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
>                         continue;
> 
>                 if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
>                         if (!smt) {
>                                 smt = 1;
>                                 goto again;
> 

D'0h, indeed.. 

Something like the below maybe, although I'm certain it all can be
written much nicer indeed.

---
 kernel/sched/fair.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd3b642..340e62e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2676,13 +2676,11 @@ static int select_idle_sibling(struct task_struct *p, int target)
 		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
 			continue;
 
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
-			if (!smt) {
-				smt = 1;
-				goto again;
-			}
+		if (smt && !(sd->flags & SD_SHARE_CPUPOWER))
+			break;
+
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
 			break;
-		}
 
 		sg = sd->groups;
 		do {
@@ -2702,6 +2700,10 @@ static int select_idle_sibling(struct task_struct *p, int target)
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+	if (!smt) {
+		smt = 1;
+		goto again;
+	}
 done:
 	rcu_read_unlock();
 



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 15:56           ` Peter Zijlstra
@ 2011-11-17 16:38             ` Mike Galbraith
  2011-11-17 17:36               ` Suresh Siddha
  2011-11-17 19:08             ` Suresh Siddha
  1 sibling, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-17 16:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 16:56 +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-17 at 16:38 +0100, Mike Galbraith wrote:
> > On Thu, 2011-11-17 at 02:59 +0100, Mike Galbraith wrote:
> > > On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote:
> > 
> > > > It should be ok. Isn't it?
> > > 
> > > Nope, wasn't ok.
> > 
> > Because you can't get to again: with a single E5620 package, it having
> > only SIBLING and MC domains.
> >  
> > again:
> >         for_each_domain(target, sd) {
> >                 if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
> >                         continue;
> > 
> >                 if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
> >                         if (!smt) {
> >                                 smt = 1;
> >                                 goto again;
> > 
> 
> D'0h, indeed.. 
> 
> Something like the below maybe, although I'm certain it all can be
> written much nicer indeed.

I'll give it a go.

Squabbling with bouncing buddies in an isolated and otherwise idle
cpuset ate my day.
 
> ---
>  kernel/sched/fair.c |   14 ++++++++------
>  1 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cd3b642..340e62e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2676,13 +2676,11 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
>  			continue;
>  
> -		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
> -			if (!smt) {
> -				smt = 1;
> -				goto again;
> -			}
> +		if (smt && !(sd->flags & SD_SHARE_CPUPOWER))
> +			break;
> +
> +		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
>  			break;
> -		}
>  
>  		sg = sd->groups;
>  		do {
> @@ -2702,6 +2700,10 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  			sg = sg->next;
>  		} while (sg != sd->groups);
>  	}
> +	if (!smt) {
> +		smt = 1;
> +		goto again;
> +	}
>  done:
>  	rcu_read_unlock();
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 16:38             ` Mike Galbraith
@ 2011-11-17 17:36               ` Suresh Siddha
  2011-11-18 15:14                 ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Suresh Siddha @ 2011-11-17 17:36 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 08:38 -0800, Mike Galbraith wrote:
> On Thu, 2011-11-17 at 16:56 +0100, Peter Zijlstra wrote:
> > Something like the below maybe, although I'm certain it all can be
> > written much nicer indeed.
> 
> I'll give it a go.
> 
> Squabbling with bouncing buddies in an isolated and otherwise idle
> cpuset ate my day.
>  

Well looks like I managed to have the similar issue in my patch too.
Anyways here is the updated cleaned up version of the patch ;)
---

Avoid select_idle_sibling() from picking a sibling thread if there's
an idle core that shares cache.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |    2 +
 kernel/sched_fair.c |   57 ++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0e9344a..4b0bc6a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -734,6 +734,8 @@ static inline int cpu_of(struct rq *rq)
 #define for_each_domain(cpu, __sd) \
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
 
+#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
+
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		(&__get_cpu_var(runqueues))
 #define task_rq(p)		cpu_rq(task_cpu(p))
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..020483a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2241,6 +2241,28 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			break;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2249,6 +2271,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
+	struct sched_group *sg;
 	int i;
 
 	/*
@@ -2269,25 +2292,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
-	for_each_domain(target, sd) {
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
-			break;
+	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	for_each_lower_domain(sd) {
+		sg = sd->groups;
+		do {
+			if (!cpumask_intersects(sched_group_cpus(sg),
+						tsk_cpus_allowed(p)))
+				goto next;
 
-		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
-			if (idle_cpu(i)) {
-				target = i;
-				break;
+			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (!idle_cpu(i))
+					goto next;
 			}
-		}
 
-		/*
-		 * Lets stop looking for an idle sibling when we reached
-		 * the domain that spans the current cpu and prev_cpu.
-		 */
-		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
-			break;
+			target = cpumask_first_and(sched_group_cpus(sg),
+					tsk_cpus_allowed(p));
+			goto done;
+next:
+			sg = sg->next;
+		} while (sg != sd->groups);
 	}
+done:
 	rcu_read_unlock();
 
 	return target;



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 15:56           ` Peter Zijlstra
  2011-11-17 16:38             ` Mike Galbraith
@ 2011-11-17 19:08             ` Suresh Siddha
  2011-11-18 15:12               ` Peter Zijlstra
  2011-12-06  9:49               ` [tip:sched/core] sched: Clean up domain traversal in select_idle_sibling() tip-bot for Suresh Siddha
  1 sibling, 2 replies; 89+ messages in thread
From: Suresh Siddha @ 2011-11-17 19:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 07:56 -0800, Peter Zijlstra wrote:
> D'0h, indeed.. 
> 
> Something like the below maybe, although I'm certain it all can be
> written much nicer indeed.
> 

Peter, I just noticed that the -tip tree has the original proposed patch
and the new sched/ directory. So updated my cleanup patch accordingly.
Thanks.
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched: cleanup domain traversal in select_idle_sibling

Instead of going through the scheduler domain hierarchy multiple times
(for giving priority to an idle core over an idle SMT sibling in a busy
core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES
flag and traverse the domain hierarchy down till we find an idle group.

This cleanup also addresses an issue reported by Mike where the recent
changes returned the busy thread even in the presence of an idle SMT
sibling in single socket platforms.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/fair.c  |   38 +++++++++++++++++++++++++-------------
 kernel/sched/sched.h |    2 ++
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd3b642..537e16a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2642,6 +2642,28 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			break;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2651,7 +2673,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i, smt = 0;
+	int i;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2671,19 +2693,9 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
-again:
-	for_each_domain(target, sd) {
-		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
-			continue;
-
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
-			if (!smt) {
-				smt = 1;
-				goto again;
-			}
-			break;
-		}
 
+	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
 			if (!cpumask_intersects(sched_group_cpus(sg),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c2e7802..8715055 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -501,6 +501,8 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define for_each_domain(cpu, __sd) \
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
 
+#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
+
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		(&__get_cpu_var(runqueues))
 #define task_rq(p)		cpu_rq(task_cpu(p))



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 19:08             ` Suresh Siddha
@ 2011-11-18 15:12               ` Peter Zijlstra
  2011-11-18 15:26                 ` Mike Galbraith
  2011-12-06  9:49               ` [tip:sched/core] sched: Clean up domain traversal in select_idle_sibling() tip-bot for Suresh Siddha
  1 sibling, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-18 15:12 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Mike Galbraith, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 11:08 -0800, Suresh Siddha wrote:
> 
> From: Suresh Siddha <suresh.b.siddha@intel.com>
> Subject: sched: cleanup domain traversal in select_idle_sibling
> 
> Instead of going through the scheduler domain hierarchy multiple times
> (for giving priority to an idle core over an idle SMT sibling in a busy
> core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES
> flag and traverse the domain hierarchy down till we find an idle group.
> 
> This cleanup also addresses an issue reported by Mike where the recent
> changes returned the busy thread even in the presence of an idle SMT
> sibling in single socket platforms.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> Cc: Mike Galbraith <efault@gmx.de> 

Thanks Suresh!

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 17:36               ` Suresh Siddha
@ 2011-11-18 15:14                 ` Mike Galbraith
  2011-11-18 15:17                   ` [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles Mike Galbraith
                                     ` (6 more replies)
  0 siblings, 7 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:14 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 09:36 -0800, Suresh Siddha wrote:
> On Thu, 2011-11-17 at 08:38 -0800, Mike Galbraith wrote:
> > On Thu, 2011-11-17 at 16:56 +0100, Peter Zijlstra wrote:
> > > Something like the below maybe, although I'm certain it all can be
> > > written much nicer indeed.
> > 
> > I'll give it a go.
> > 
> > Squabbling with bouncing buddies in an isolated and otherwise idle
> > cpuset ate my day.
> >  
> 
> Well looks like I managed to have the similar issue in my patch too.
> Anyways here is the updated cleaned up version of the patch ;)

Works fine.  However, unpinned buddies bounce more than with virgin
mainline.  I tried doing it differently (mikie in numbers below), and it
worked for a single unbound pair, but raped multiple unbound pairs.

---
 kernel/sched_fair.c |   10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

Index: linux-3.0-tip/kernel/sched_fair.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched_fair.c
+++ linux-3.0-tip/kernel/sched_fair.c
@@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
 		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
 			if (idle_cpu(i)) {
 				target = i;
+				if (sd->flags & SD_SHARE_CPUPOWER)
+					continue;
 				break;
 			}
 		}
-
-		/*
-		 * Lets stop looking for an idle sibling when we reached
-		 * the domain that spans the current cpu and prev_cpu.
-		 */
-		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
-			break;
 	}
 	rcu_read_unlock();
 
 

mikie2 is your patch + twiddles I'll post as a reply to this post.
  
kernel v3.2-rc1-306-g7f80850

TTWU_QUEUE off (skews results), test in cpuset 1-3,5-7

Test1: one unbound TCP_RR pair, three runs

virgin      66611.73      71376.00      61297.09                       avg 66428.27   1.000
suresh      68488.88      68412.48      68149.73 (bounce)                  68350.36   1.028
mikie       75925.91      75851.63      74617.29 (bounce--)                75464.94   1.136
mikie2      71403.39      71396.73      72258.91 NO_SIBLING_LIMIT_SYNC     71686.34   1.079
mikie2     139210.06     140485.95     140189.95 SIBLING_LIMIT_SYNC       139961.98   2.106


Test2: one unbound TCP_RR pair plus 2 unbound hogs, three runs

virgin      87108.59      88737.30      87383.98                       avg 87743.29  1.000
suresh      84281.24      84725.07      84823.57                           84931.93   .967
mikie       87850.37      86081.73      85789.49                           86573.86   .986
mikie2      92613.79      92022.95      92014.26 NO_SIBLING_LIMIT_SYNC     92217.00  1.050
mikie2     134682.16     133497.30     133584.48 SIBLING_LIMIT_SYNC


Test3: three unbound TCP_RR pairs, single run

virgin      55246.99      55138.67      55248.95                       avg 55211.53  1.000
suresh      53141.24      53165.45      53224.71                           53177.13   .963
mikie       47627.14      47361.68      47389.41                           47459.41   .859
mikie2      57969.49      57704.79      58218.14 NO_SIBLING_LIMIT_SYNC     57964.14  1.049
mikie2     132205.11     133726.94     133706.09 SIBLING_LIMIT_SYNC       133212.71  2.412


Test4: three bound TCP_RR pairs, single run

virgin     130073.67     130202.02     131666.48                      avg 130647.39  1.000
suresh     129805.98     128058.25     128709.77                          128858.00   .986
mikie      125597.11     127260.39     127208.73                          126688.74   .969
mikie2     135441.58     134961.89     137162.00                          135855.15  1.039


Test5: drop shield, tbench 8

virgin     2118.26 MB/sec  1.000
suresh     2036.32 MB/sec   .961
mikie      2051.18 MB/sec   .968
mikie2     2125.21 MB/sec  1.003  (hohum, all within tbench jitter)

Problem reference: select_idle_sibling() = painful L2 misses with westmere.

Identical configs, nohz=off NO_TTWU_QUEUE,
processor.max_cstate=0 intel_idle.max_cstate=0
turbo-boost off (so both are now plain 2.4GHz boxen)

single bound TCP_RR pair
              E5620      Q6600  bound
           90196.84   42517.96  3->0
           92654.92   43946.50  3->1
           91735.26   95274.10  3->2
          129394.55   95266.83  3->3
           89127.98             3->4
           91303.15             3->5
           91345.85             3->6
           74141.88             3->7  huh?.. load is synchronous!



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles
  2011-11-18 15:14                 ` Mike Galbraith
@ 2011-11-18 15:17                   ` Mike Galbraith
  2011-11-18 15:35                     ` Peter Zijlstra
  2011-11-18 15:39                     ` [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles Hillf Danton
  2011-11-18 15:18                   ` [patch 2/6] sched: convert rq->avg_idle to rq->avg_event Mike Galbraith
                                     ` (5 subsequent siblings)
  6 siblings, 2 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:17 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner


rt.nr_cpus_allowed is always available, use it to bail from select_task_rq()
when only one cpu can be used, and saves some cycles for pinned tasks.

taskset -c 3 pipe-test

   PerfTop:     997 irqs/sec  kernel:89.5%  exact:  0.0% [1000Hz cycles],  (all, CPU: 3)
------------------------------------------------------------------------------------------------

             Virgin                                    Patched
             samples  pcnt function                    samples  pcnt function
             _______ _____ ___________________________ _______ _____ ___________________________

             2880.00 10.2% __schedule                  3136.00 11.3% __schedule
             1634.00  5.8% pipe_read                   1615.00  5.8% pipe_read
             1458.00  5.2% system_call                 1534.00  5.5% system_call
             1382.00  4.9% _raw_spin_lock_irqsave      1412.00  5.1% _raw_spin_lock_irqsave
             1202.00  4.3% pipe_write                  1255.00  4.5% copy_user_generic_string
             1164.00  4.1% copy_user_generic_string    1241.00  4.5% __switch_to
             1097.00  3.9% __switch_to                  929.00  3.3% mutex_lock
              872.00  3.1% mutex_lock                   846.00  3.0% mutex_unlock
              687.00  2.4% mutex_unlock                 804.00  2.9% pipe_write
              682.00  2.4% native_sched_clock           713.00  2.6% native_sched_clock
              643.00  2.3% system_call_after_swapgs     653.00  2.3% _raw_spin_unlock_irqrestore
              617.00  2.2% sched_clock_local            633.00  2.3% fsnotify
              612.00  2.2% fsnotify                     605.00  2.2% sched_clock_local
              596.00  2.1% _raw_spin_unlock_irqrestore  593.00  2.1% system_call_after_swapgs
              542.00  1.9% sysret_check                 559.00  2.0% sysret_check
              467.00  1.7% fget_light                   472.00  1.7% fget_light
              462.00  1.6% finish_task_switch           461.00  1.7% finish_task_switch
              437.00  1.5% vfs_write                    442.00  1.6% vfs_write
              431.00  1.5% do_sync_write                428.00  1.5% do_sync_write
*             413.00  1.5% select_task_rq_fair          404.00  1.5% _raw_spin_lock_irq
              386.00  1.4% update_curr                  402.00  1.4% update_curr
              385.00  1.4% rw_verify_area               389.00  1.4% do_sync_read
              377.00  1.3% _raw_spin_lock_irq           378.00  1.4% vfs_read
              369.00  1.3% do_sync_read                 340.00  1.2% pipe_iov_copy_from_user
              360.00  1.3% vfs_read                     316.00  1.1% __wake_up_sync_key
              342.00  1.2% hrtick_start_fair            313.00  1.1% __wake_up_common

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched_fair.c |    3 +++
 kernel/sched_rt.c   |    3 +++
 2 files changed, 6 insertions(+)

Index: linux-3.2.git/kernel/sched_fair.c
===================================================================
--- linux-3.2.git.orig/kernel/sched_fair.c
+++ linux-3.2.git/kernel/sched_fair.c
@@ -2340,6 +2340,9 @@ select_task_rq_fair(struct task_struct *
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
 
+	if (p->rt.nr_cpus_allowed < 2)
+		return prev_cpu;
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
 			want_affine = 1;
Index: linux-3.2.git/kernel/sched_rt.c
===================================================================
--- linux-3.2.git.orig/kernel/sched_rt.c
+++ linux-3.2.git/kernel/sched_rt.c
@@ -999,6 +999,9 @@ select_task_rq_rt(struct task_struct *p,
 
 	cpu = task_cpu(p);
 
+	if (p->rt.nr_cpus_allowed < 2)
+		goto out;
+
 	/* For anything but wake ups, just return the task_cpu */
 	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
 		goto out;



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 2/6] sched: convert rq->avg_idle to rq->avg_event
  2011-11-18 15:14                 ` Mike Galbraith
  2011-11-18 15:17                   ` [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles Mike Galbraith
@ 2011-11-18 15:18                   ` Mike Galbraith
  2011-11-18 15:19                   ` [patch 3/6] sched: use rq->avg_event to resurrect nohz ratelimiting Mike Galbraith
                                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:18 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner


We update rq->clock only at points of interest to the scheduler.
Using this distance has the same effect as measuring idle time
for idle_balance() throttling, and allows other uses as well.

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched.c       |   32 +++++++++++---------------------
 kernel/sched_debug.c |    2 +-
 kernel/sched_fair.c  |    8 ++------
 3 files changed, 14 insertions(+), 28 deletions(-)

Index: linux-3.2.git/kernel/sched.c
===================================================================
--- linux-3.2.git.orig/kernel/sched.c
+++ linux-3.2.git/kernel/sched.c
@@ -656,8 +656,7 @@ struct rq {
 
 	u64 rt_avg;
 	u64 age_stamp;
-	u64 idle_stamp;
-	u64 avg_idle;
+	u64 avg_event;
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
@@ -789,6 +788,14 @@ static inline struct task_group *task_gr
 
 #endif /* CONFIG_CGROUP_SCHED */
 
+static void update_avg(u64 *avg, u64 sample)
+{
+#ifdef CONFIG_SMP
+	s64 diff = sample - *avg;
+	*avg += diff >> 3;
+#endif
+}
+
 static void update_rq_clock_task(struct rq *rq, s64 delta);
 
 static void update_rq_clock(struct rq *rq)
@@ -801,6 +808,7 @@ static void update_rq_clock(struct rq *r
 	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
 	rq->clock += delta;
 	update_rq_clock_task(rq, delta);
+	update_avg(&rq->avg_event, delta);
 }
 
 /*
@@ -2593,12 +2601,6 @@ int select_task_rq(struct task_struct *p
 
 	return cpu;
 }
-
-static void update_avg(u64 *avg, u64 sample)
-{
-	s64 diff = sample - *avg;
-	*avg += diff >> 3;
-}
 #endif
 
 static void
@@ -2664,17 +2666,6 @@ ttwu_do_wakeup(struct rq *rq, struct tas
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)
 		p->sched_class->task_woken(rq, p);
-
-	if (rq->idle_stamp) {
-		u64 delta = rq->clock - rq->idle_stamp;
-		u64 max = 2*sysctl_sched_migration_cost;
-
-		if (delta > max)
-			rq->avg_idle = max;
-		else
-			update_avg(&rq->avg_idle, delta);
-		rq->idle_stamp = 0;
-	}
 #endif
 }
 
@@ -8294,8 +8285,7 @@ void __init sched_init(void)
 		rq->push_cpu = 0;
 		rq->cpu = i;
 		rq->online = 0;
-		rq->idle_stamp = 0;
-		rq->avg_idle = 2*sysctl_sched_migration_cost;
+		rq->avg_event = 0;
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_balance_kick = 0;
Index: linux-3.2.git/kernel/sched_debug.c
===================================================================
--- linux-3.2.git.orig/kernel/sched_debug.c
+++ linux-3.2.git/kernel/sched_debug.c
@@ -290,7 +290,7 @@ static void print_cpu(struct seq_file *m
 	P(sched_count);
 	P(sched_goidle);
 #ifdef CONFIG_SMP
-	P64(avg_idle);
+	P64(avg_event);
 #endif
 
 	P(ttwu_count);
Index: linux-3.2.git/kernel/sched_fair.c
===================================================================
--- linux-3.2.git.orig/kernel/sched_fair.c
+++ linux-3.2.git/kernel/sched_fair.c
@@ -4190,9 +4190,7 @@ static void idle_balance(int this_cpu, s
 	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
 
-	this_rq->idle_stamp = this_rq->clock;
-
-	if (this_rq->avg_idle < sysctl_sched_migration_cost)
+	if (this_rq->avg_event < sysctl_sched_migration_cost)
 		return;
 
 	/*
@@ -4218,10 +4216,8 @@ static void idle_balance(int this_cpu, s
 		interval = msecs_to_jiffies(sd->balance_interval);
 		if (time_after(next_balance, sd->last_balance + interval))
 			next_balance = sd->last_balance + interval;
-		if (pulled_task) {
-			this_rq->idle_stamp = 0;
+		if (pulled_task)
 			break;
-		}
 	}
 	rcu_read_unlock();
 



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 3/6] sched: use rq->avg_event to resurrect nohz ratelimiting
  2011-11-18 15:14                 ` Mike Galbraith
  2011-11-18 15:17                   ` [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles Mike Galbraith
  2011-11-18 15:18                   ` [patch 2/6] sched: convert rq->avg_idle to rq->avg_event Mike Galbraith
@ 2011-11-18 15:19                   ` Mike Galbraith
  2011-11-18 15:36                     ` Peter Zijlstra
  2011-11-18 15:20                   ` [patch 4/6] sched: ratelimit select_idle_sibling()for sync wakeups Mike Galbraith
                                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:19 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner


Entering nohz at high frequency eats cycles, ratelimit it
just as we do idle_balance(), and for the same reason.

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 include/linux/sched.h    |    2 ++
 kernel/sched.c           |    7 +++++++
 kernel/time/tick-sched.c |    2 +-
 3 files changed, 10 insertions(+), 1 deletion(-)

Index: linux-3.2.git/include/linux/sched.h
===================================================================
--- linux-3.2.git.orig/include/linux/sched.h
+++ linux-3.2.git/include/linux/sched.h
@@ -1979,8 +1979,10 @@ static inline void idle_task_exit(void)
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
+extern int sched_needs_cpu(int cpu);
 #else
 static inline void wake_up_idle_cpu(int cpu) { }
+static inline void sched_needs_cpu(int cpu) { }
 #endif
 
 extern unsigned int sysctl_sched_latency;
Index: linux-3.2.git/kernel/sched.c
===================================================================
--- linux-3.2.git.orig/kernel/sched.c
+++ linux-3.2.git/kernel/sched.c
@@ -1419,6 +1419,13 @@ static inline bool got_nohz_idle_kick(vo
 	return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick;
 }
 
+#ifdef CONFIG_SMP
+int sched_needs_cpu(int cpu)
+{
+	return cpu_rq(cpu)->avg_event < sysctl_sched_migration_cost;
+}
+#endif
+
 #else /* CONFIG_NO_HZ */
 
 static inline bool got_nohz_idle_kick(void)
Index: linux-3.2.git/kernel/time/tick-sched.c
===================================================================
--- linux-3.2.git.orig/kernel/time/tick-sched.c
+++ linux-3.2.git/kernel/time/tick-sched.c
@@ -352,7 +352,7 @@ void tick_nohz_stop_sched_tick(int inidl
 	} while (read_seqretry(&xtime_lock, seq));
 
 	if (rcu_needs_cpu(cpu) || printk_needs_cpu(cpu) ||
-	    arch_needs_cpu(cpu)) {
+	    arch_needs_cpu(cpu) || sched_needs_cpu(cpu)) {
 		next_jiffies = last_jiffies + 1;
 		delta_jiffies = 1;
 	} else {



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 4/6] sched: ratelimit select_idle_sibling()for sync wakeups
  2011-11-18 15:14                 ` Mike Galbraith
                                     ` (2 preceding siblings ...)
  2011-11-18 15:19                   ` [patch 3/6] sched: use rq->avg_event to resurrect nohz ratelimiting Mike Galbraith
@ 2011-11-18 15:20                   ` Mike Galbraith
  2011-11-18 15:22                   ` [patch 5/6] sched: save some hrtick_start_fair cycles Mike Galbraith
                                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:20 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner


select_idle_sibling() injures synchromous loads horribly
on processors with L3 (ala westmere) when called at high
Frequency.  Cut it off at 40 KHz (25usec event rate) when
waking sync, in lieu of an inter-domain cache penalty.

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched_fair.c     |   20 ++++++++++++++++++--
 kernel/sched_features.h |    6 ++++++
 2 files changed, 24 insertions(+), 2 deletions(-)

Index: linux-3.2.git/kernel/sched_fair.c
===================================================================
--- linux-3.2.git.orig/kernel/sched_fair.c
+++ linux-3.2.git/kernel/sched_fair.c
@@ -2264,6 +2264,21 @@ static inline struct sched_domain *highe
 }
 
 /*
+ * select_idle_sibling() injures synchromous loads horribly
+ * on processors with L3 (ala westmere) when called at high
+ * Frequency.  Cut it off at 40 KHz (25usec event rate) when
+ * waking sync, in lieu of an inter-domain cache penalty.
+ */
+#define SIBLING_SYNC_CUTOFF_NS (NSEC_PER_SEC/40000UL)
+
+static int idle_sibling_limit(int target, int sync)
+{
+	if (!sync || !sched_feat(SIBLING_LIMIT_SYNC))
+		return 0;
+	return cpu_rq(target)->avg_event < SIBLING_SYNC_CUTOFF_NS;
+}
+
+/*
  * Try and locate an idle CPU in the sched_domain.
  */
 static int select_idle_sibling(struct task_struct *p, int target)
@@ -2400,9 +2415,10 @@ select_task_rq_fair(struct task_struct *
 
 	if (affine_sd) {
 		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
-			prev_cpu = cpu;
+			new_cpu = cpu;
 
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		if (!idle_sibling_limit(new_cpu, sync))
+			new_cpu = select_idle_sibling(p, new_cpu);
 		goto unlock;
 	}
 
Index: linux-3.2.git/kernel/sched_features.h
===================================================================
--- linux-3.2.git.orig/kernel/sched_features.h
+++ linux-3.2.git/kernel/sched_features.h
@@ -67,3 +67,9 @@ SCHED_FEAT(NONTASK_POWER, 1)
 SCHED_FEAT(TTWU_QUEUE, 1)
 
 SCHED_FEAT(FORCE_SD_OVERLAP, 0)
+
+/*
+ * Restrict the frequency at which select_idle_sibling() may be called
+ * for synchronous wakeups.
+ */
+SCHED_FEAT(SIBLING_LIMIT_SYNC, 1)



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 5/6] sched: save some hrtick_start_fair cycles
  2011-11-18 15:14                 ` Mike Galbraith
                                     ` (3 preceding siblings ...)
  2011-11-18 15:20                   ` [patch 4/6] sched: ratelimit select_idle_sibling()for sync wakeups Mike Galbraith
@ 2011-11-18 15:22                   ` Mike Galbraith
  2011-11-18 15:23                   ` [patch 6/6] sched: set skip_clock_update in yield_task_fair() Mike Galbraith
  2012-02-20 14:41                   ` sched: Avoid SMT siblings in select_idle_sibling() if possible Peter Zijlstra
  6 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:22 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner


hrtick_start_fair() shows up in profiles even when disabled.

v3.0.6

taskset -c 3 pipe-test

   PerfTop:     997 irqs/sec  kernel:89.5%  exact:  0.0% [1000Hz cycles],  (all, CPU: 3)
------------------------------------------------------------------------------------------------

             Virgin                                    Patched
             samples  pcnt function                    samples  pcnt function
             _______ _____ ___________________________ _______ _____ ___________________________

             2880.00 10.2% __schedule                  3136.00 11.3% __schedule
             1634.00  5.8% pipe_read                   1615.00  5.8% pipe_read
             1458.00  5.2% system_call                 1534.00  5.5% system_call
             1382.00  4.9% _raw_spin_lock_irqsave      1412.00  5.1% _raw_spin_lock_irqsave
             1202.00  4.3% pipe_write                  1255.00  4.5% copy_user_generic_string
             1164.00  4.1% copy_user_generic_string    1241.00  4.5% __switch_to
             1097.00  3.9% __switch_to                  929.00  3.3% mutex_lock
              872.00  3.1% mutex_lock                   846.00  3.0% mutex_unlock
              687.00  2.4% mutex_unlock                 804.00  2.9% pipe_write
              682.00  2.4% native_sched_clock           713.00  2.6% native_sched_clock
              643.00  2.3% system_call_after_swapgs     653.00  2.3% _raw_spin_unlock_irqrestore
              617.00  2.2% sched_clock_local            633.00  2.3% fsnotify
              612.00  2.2% fsnotify                     605.00  2.2% sched_clock_local
              596.00  2.1% _raw_spin_unlock_irqrestore  593.00  2.1% system_call_after_swapgs
              542.00  1.9% sysret_check                 559.00  2.0% sysret_check
              467.00  1.7% fget_light                   472.00  1.7% fget_light
              462.00  1.6% finish_task_switch           461.00  1.7% finish_task_switch
              437.00  1.5% vfs_write                    442.00  1.6% vfs_write
              431.00  1.5% do_sync_write                428.00  1.5% do_sync_write
              413.00  1.5% select_task_rq_fair          404.00  1.5% _raw_spin_lock_irq
              386.00  1.4% update_curr                  402.00  1.4% update_curr
              385.00  1.4% rw_verify_area               389.00  1.4% do_sync_read
              377.00  1.3% _raw_spin_lock_irq           378.00  1.4% vfs_read
              369.00  1.3% do_sync_read                 340.00  1.2% pipe_iov_copy_from_user
              360.00  1.3% vfs_read                     316.00  1.1% __wake_up_sync_key
*             342.00  1.2% hrtick_start_fair            313.00  1.1% __wake_up_common

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched_fair.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-3.2.git/kernel/sched_fair.c
===================================================================
--- linux-3.2.git.orig/kernel/sched_fair.c
+++ linux-3.2.git/kernel/sched_fair.c
@@ -1853,7 +1853,7 @@ static void hrtick_start_fair(struct rq
 
 	WARN_ON(task_rq(p) != rq);
 
-	if (hrtick_enabled(rq) && cfs_rq->nr_running > 1) {
+	if (cfs_rq->nr_running > 1) {
 		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
 		s64 delta = slice - ran;
@@ -1884,7 +1884,7 @@ static void hrtick_update(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 
-	if (curr->sched_class != &fair_sched_class)
+	if (!hrtick_enabled(rq) || curr->sched_class != &fair_sched_class)
 		return;
 
 	if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
@@ -2643,7 +2643,8 @@ static struct task_struct *pick_next_tas
 	} while (cfs_rq);
 
 	p = task_of(se);
-	hrtick_start_fair(rq, p);
+	if (hrtick_enabled(rq))
+		hrtick_start_fair(rq, p);
 
 	return p;
 }



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 6/6] sched: set skip_clock_update in yield_task_fair()
  2011-11-18 15:14                 ` Mike Galbraith
                                     ` (4 preceding siblings ...)
  2011-11-18 15:22                   ` [patch 5/6] sched: save some hrtick_start_fair cycles Mike Galbraith
@ 2011-11-18 15:23                   ` Mike Galbraith
  2012-02-20 14:41                   ` sched: Avoid SMT siblings in select_idle_sibling() if possible Peter Zijlstra
  6 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:23 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner


This is another case where we are on our way to schedule(),
so can save a useless clock update and resulting microscopic
vruntime update.

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched.c      |    8 +++++++-
 kernel/sched_fair.c |    6 ++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

Index: linux-3.2.git/kernel/sched.c
===================================================================
--- linux-3.2.git.orig/kernel/sched.c
+++ linux-3.2.git/kernel/sched.c
@@ -5838,7 +5838,13 @@ again:
 		 */
 		if (preempt && rq != p_rq)
 			resched_task(p_rq->curr);
-	}
+	} else
+		/*
+		 * We might have set it in task_yield_fair(), but are
+		 * not going to schedule(), so don't want to skip
+		 * the next update.
+		 */
+		rq->skip_clock_update = 0;
 
 out:
 	double_rq_unlock(rq, p_rq);
Index: linux-3.2.git/kernel/sched_fair.c
===================================================================
--- linux-3.2.git.orig/kernel/sched_fair.c
+++ linux-3.2.git/kernel/sched_fair.c
@@ -2688,6 +2688,12 @@ static void yield_task_fair(struct rq *r
 		 * Update run-time statistics of the 'current'.
 		 */
 		update_curr(cfs_rq);
+		/*
+		 * Tell update_rq_clock() that we've just updated,
+		 * so we don't do microscopic update in schedule()
+		 * and double the fastpath cost.
+		 */
+		 rq->skip_clock_update = 1;
 	}
 
 	set_skip_buddy(se);



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-18 15:12               ` Peter Zijlstra
@ 2011-11-18 15:26                 ` Mike Galbraith
  0 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Fri, 2011-11-18 at 16:12 +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-17 at 11:08 -0800, Suresh Siddha wrote:
> > 
> > From: Suresh Siddha <suresh.b.siddha@intel.com>
> > Subject: sched: cleanup domain traversal in select_idle_sibling
> > 
> > Instead of going through the scheduler domain hierarchy multiple times
> > (for giving priority to an idle core over an idle SMT sibling in a busy
> > core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES
> > flag and traverse the domain hierarchy down till we find an idle group.
> > 
> > This cleanup also addresses an issue reported by Mike where the recent
> > changes returned the busy thread even in the presence of an idle SMT
> > sibling in single socket platforms.
> > 
> > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > Cc: Mike Galbraith <efault@gmx.de> 
> 
> Thanks Suresh!

Tested-by: Mike Galbraith <efault@gmx.de>



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles
  2011-11-18 15:17                   ` [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles Mike Galbraith
@ 2011-11-18 15:35                     ` Peter Zijlstra
  2011-11-18 17:34                       ` Mike Galbraith
  2011-11-22 14:17                       ` Mike Galbraith
  2011-11-18 15:39                     ` [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles Hillf Danton
  1 sibling, 2 replies; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-18 15:35 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Fri, 2011-11-18 at 16:17 +0100, Mike Galbraith wrote:
>  kernel/sched_fair.c |    3 +++
>  kernel/sched_rt.c   |    3 +++

I hate to be the bringer of bad news, but you're going to have to respin
all this, these files no longer exist ;-)

Please update your tip/master.

Also, < 2 ? nr_cpus_allowed being 0 is somewhat of a problem, so the
only remaining option is == 1.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 3/6] sched: use rq->avg_event to resurrect nohz ratelimiting
  2011-11-18 15:19                   ` [patch 3/6] sched: use rq->avg_event to resurrect nohz ratelimiting Mike Galbraith
@ 2011-11-18 15:36                     ` Peter Zijlstra
  2011-11-18 17:42                       ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-18 15:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner, Van De Ven, Arjan

On Fri, 2011-11-18 at 16:19 +0100, Mike Galbraith wrote:
> Entering nohz at high frequency eats cycles, ratelimit it
> just as we do idle_balance(), and for the same reason.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>

Arjan, can we do something smart with the idle governor which already
guestimates the idle duration for us?

> ---
>  include/linux/sched.h    |    2 ++
>  kernel/sched.c           |    7 +++++++
>  kernel/time/tick-sched.c |    2 +-
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> Index: linux-3.2.git/include/linux/sched.h
> ===================================================================
> --- linux-3.2.git.orig/include/linux/sched.h
> +++ linux-3.2.git/include/linux/sched.h
> @@ -1979,8 +1979,10 @@ static inline void idle_task_exit(void)
>  
>  #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
>  extern void wake_up_idle_cpu(int cpu);
> +extern int sched_needs_cpu(int cpu);
>  #else
>  static inline void wake_up_idle_cpu(int cpu) { }
> +static inline void sched_needs_cpu(int cpu) { }
>  #endif
>  
>  extern unsigned int sysctl_sched_latency;
> Index: linux-3.2.git/kernel/sched.c
> ===================================================================
> --- linux-3.2.git.orig/kernel/sched.c
> +++ linux-3.2.git/kernel/sched.c
> @@ -1419,6 +1419,13 @@ static inline bool got_nohz_idle_kick(vo
>  	return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick;
>  }
>  
> +#ifdef CONFIG_SMP
> +int sched_needs_cpu(int cpu)
> +{
> +	return cpu_rq(cpu)->avg_event < sysctl_sched_migration_cost;
> +}
> +#endif
> +
>  #else /* CONFIG_NO_HZ */
>  
>  static inline bool got_nohz_idle_kick(void)
> Index: linux-3.2.git/kernel/time/tick-sched.c
> ===================================================================
> --- linux-3.2.git.orig/kernel/time/tick-sched.c
> +++ linux-3.2.git/kernel/time/tick-sched.c
> @@ -352,7 +352,7 @@ void tick_nohz_stop_sched_tick(int inidl
>  	} while (read_seqretry(&xtime_lock, seq));
>  
>  	if (rcu_needs_cpu(cpu) || printk_needs_cpu(cpu) ||
> -	    arch_needs_cpu(cpu)) {
> +	    arch_needs_cpu(cpu) || sched_needs_cpu(cpu)) {
>  		next_jiffies = last_jiffies + 1;
>  		delta_jiffies = 1;
>  	} else {
> 
> 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles
  2011-11-18 15:17                   ` [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles Mike Galbraith
  2011-11-18 15:35                     ` Peter Zijlstra
@ 2011-11-18 15:39                     ` Hillf Danton
  1 sibling, 0 replies; 89+ messages in thread
From: Hillf Danton @ 2011-11-18 15:39 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 5178 bytes --]

On Fri, Nov 18, 2011 at 11:17 PM, Mike Galbraith <efault@gmx.de> wrote:
>
> rt.nr_cpus_allowed is always available, use it to bail from select_task_rq()
> when only one cpu can be used, and saves some cycles for pinned tasks.
>
> taskset -c 3 pipe-test
>
>   PerfTop:     997 irqs/sec  kernel:89.5%  exact:  0.0% [1000Hz cycles],  (all, CPU: 3)
> ------------------------------------------------------------------------------------------------
>
>             Virgin                                    Patched
>             samples  pcnt function                    samples  pcnt function
>             _______ _____ ___________________________ _______ _____ ___________________________
>
>             2880.00 10.2% __schedule                  3136.00 11.3% __schedule
>             1634.00  5.8% pipe_read                   1615.00  5.8% pipe_read
>             1458.00  5.2% system_call                 1534.00  5.5% system_call
>             1382.00  4.9% _raw_spin_lock_irqsave      1412.00  5.1% _raw_spin_lock_irqsave
>             1202.00  4.3% pipe_write                  1255.00  4.5% copy_user_generic_string
>             1164.00  4.1% copy_user_generic_string    1241.00  4.5% __switch_to
>             1097.00  3.9% __switch_to                  929.00  3.3% mutex_lock
>              872.00  3.1% mutex_lock                   846.00  3.0% mutex_unlock
>              687.00  2.4% mutex_unlock                 804.00  2.9% pipe_write
>              682.00  2.4% native_sched_clock           713.00  2.6% native_sched_clock
>              643.00  2.3% system_call_after_swapgs     653.00  2.3% _raw_spin_unlock_irqrestore
>              617.00  2.2% sched_clock_local            633.00  2.3% fsnotify
>              612.00  2.2% fsnotify                     605.00  2.2% sched_clock_local
>              596.00  2.1% _raw_spin_unlock_irqrestore  593.00  2.1% system_call_after_swapgs
>              542.00  1.9% sysret_check                 559.00  2.0% sysret_check
>              467.00  1.7% fget_light                   472.00  1.7% fget_light
>              462.00  1.6% finish_task_switch           461.00  1.7% finish_task_switch
>              437.00  1.5% vfs_write                    442.00  1.6% vfs_write
>              431.00  1.5% do_sync_write                428.00  1.5% do_sync_write
> *             413.00  1.5% select_task_rq_fair          404.00  1.5% _raw_spin_lock_irq
>              386.00  1.4% update_curr                  402.00  1.4% update_curr
>              385.00  1.4% rw_verify_area               389.00  1.4% do_sync_read
>              377.00  1.3% _raw_spin_lock_irq           378.00  1.4% vfs_read
>              369.00  1.3% do_sync_read                 340.00  1.2% pipe_iov_copy_from_user
>              360.00  1.3% vfs_read                     316.00  1.1% __wake_up_sync_key
>              342.00  1.2% hrtick_start_fair            313.00  1.1% __wake_up_common
>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
>

Acked-by: Hillf Danton <dhillf@gmail.com>

> ---
>  kernel/sched_fair.c |    3 +++
>  kernel/sched_rt.c   |    3 +++
>  2 files changed, 6 insertions(+)
>
> Index: linux-3.2.git/kernel/sched_fair.c
> ===================================================================
> --- linux-3.2.git.orig/kernel/sched_fair.c
> +++ linux-3.2.git/kernel/sched_fair.c
> @@ -2340,6 +2340,9 @@ select_task_rq_fair(struct task_struct *
>        int want_sd = 1;
>        int sync = wake_flags & WF_SYNC;
>
> +       if (p->rt.nr_cpus_allowed < 2)
> +               return prev_cpu;
> +
>        if (sd_flag & SD_BALANCE_WAKE) {
>                if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
>                        want_affine = 1;
> Index: linux-3.2.git/kernel/sched_rt.c
> ===================================================================
> --- linux-3.2.git.orig/kernel/sched_rt.c
> +++ linux-3.2.git/kernel/sched_rt.c
> @@ -999,6 +999,9 @@ select_task_rq_rt(struct task_struct *p,
>
>        cpu = task_cpu(p);
>
> +       if (p->rt.nr_cpus_allowed < 2)
> +               goto out;
> +
>        /* For anything but wake ups, just return the task_cpu */
>        if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
>                goto out;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles
  2011-11-18 15:35                     ` Peter Zijlstra
@ 2011-11-18 17:34                       ` Mike Galbraith
  2011-11-22 14:17                       ` Mike Galbraith
  1 sibling, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 17:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Fri, 2011-11-18 at 16:35 +0100, Peter Zijlstra wrote:
> On Fri, 2011-11-18 at 16:17 +0100, Mike Galbraith wrote:
> >  kernel/sched_fair.c |    3 +++
> >  kernel/sched_rt.c   |    3 +++
> 
> I hate to be the bringer of bad news, but you're going to have to respin
> all this, these files no longer exist ;-)

No biggie, will do.

I'm just showing what was in the kernel to go with the numbers.

The nohz thing I dislike, but the problem isn't going away.  Don't like
the select_idle_sibling() ratelimit as is either, but I don't have a
better solution to this two faced little bugger.

> Please update your tip/master.

Yeah, these were master branch.  Was twiddling stable as usual first, as
you then don't have to worry about stuff changing (especially unrelated
stuff that still affects you) day to day, and munging results.

> Also, < 2 ? nr_cpus_allowed being 0 is somewhat of a problem, so the
> only remaining option is == 1.

True, but see select_task_rq_rt() :)

	-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 3/6] sched: use rq->avg_event to resurrect nohz ratelimiting
  2011-11-18 15:36                     ` Peter Zijlstra
@ 2011-11-18 17:42                       ` Mike Galbraith
  2011-11-19  0:51                         ` Van De Ven, Arjan
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-18 17:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner, Van De Ven, Arjan

On Fri, 2011-11-18 at 16:36 +0100, Peter Zijlstra wrote:
> On Fri, 2011-11-18 at 16:19 +0100, Mike Galbraith wrote:
> > Entering nohz at high frequency eats cycles, ratelimit it
> > just as we do idle_balance(), and for the same reason.
> > 
> > Signed-off-by: Mike Galbraith <efault@gmx.de>
> 
> Arjan, can we do something smart with the idle governor which already
> guestimates the idle duration for us?

That would be great.  Won't help my Intel Q6600 though, the intel_idle
governor doesn't support it.  Other idle_governor gizmos really need to
get smarter too.

	-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [tip:sched/core] sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-15  9:46 sched: Avoid SMT siblings in select_idle_sibling() if possible Peter Zijlstra
  2011-11-16  1:14 ` Suresh Siddha
@ 2011-11-18 23:40 ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 89+ messages in thread
From: tip-bot for Peter Zijlstra @ 2011-11-18 23:40 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, efault, peterz,
	suresh.b.siddha, tglx, mingo

Commit-ID:  4dcfe1025b513c2c1da5bf5586adb0e80148f612
Gitweb:     http://git.kernel.org/tip/4dcfe1025b513c2c1da5bf5586adb0e80148f612
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 10 Nov 2011 13:01:10 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 16 Nov 2011 08:43:43 +0100

sched: Avoid SMT siblings in select_idle_sibling() if possible

Avoid select_idle_sibling() from picking a sibling thread if there's
an idle core that shares cache.

This fixes SMT balancing in the increasingly common case where there's
a shared cache core available to balance to.

Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1321350377.1421.55.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched_fair.c |   42 ++++++++++++++++++++++++++++--------------
 1 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 7e51b5b..ba0e1f4 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2326,7 +2326,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
-	int i;
+	struct sched_group *sg;
+	int i, smt = 0;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2346,25 +2347,38 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
+again:
 	for_each_domain(target, sd) {
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
-			break;
+		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
+			continue;
 
-		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
-			if (idle_cpu(i)) {
-				target = i;
-				break;
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
+			if (!smt) {
+				smt = 1;
+				goto again;
 			}
+			break;
 		}
 
-		/*
-		 * Lets stop looking for an idle sibling when we reached
-		 * the domain that spans the current cpu and prev_cpu.
-		 */
-		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
-			break;
+		sg = sd->groups;
+		do {
+			if (!cpumask_intersects(sched_group_cpus(sg),
+						tsk_cpus_allowed(p)))
+				goto next;
+
+			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (!idle_cpu(i))
+					goto next;
+			}
+
+			target = cpumask_first_and(sched_group_cpus(sg),
+					tsk_cpus_allowed(p));
+			goto done;
+next:
+			sg = sg->next;
+		} while (sg != sd->groups);
 	}
+done:
 	rcu_read_unlock();
 
 	return target;

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* RE: [patch 3/6] sched: use rq->avg_event to resurrect nohz ratelimiting
  2011-11-18 17:42                       ` Mike Galbraith
@ 2011-11-19  0:51                         ` Van De Ven, Arjan
  2011-11-19  4:15                           ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Van De Ven, Arjan @ 2011-11-19  0:51 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: Siddha, Suresh B, linux-kernel, Ingo Molnar, Paul Turner

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 520 bytes --]

> > Arjan, can we do something smart with the idle governor which already
> > guestimates the idle duration for us?
> 
> That would be great.  Won't help my Intel Q6600 though, the intel_idle
> governor doesn't support it.  Other idle_governor gizmos really need to
> get smarter too.


you're confusing the governor (menu) and the low level hardware driver.

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 89+ messages in thread

* RE: [patch 3/6] sched: use rq->avg_event to resurrect nohz ratelimiting
  2011-11-19  0:51                         ` Van De Ven, Arjan
@ 2011-11-19  4:15                           ` Mike Galbraith
  0 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-19  4:15 UTC (permalink / raw)
  To: Van De Ven, Arjan
  Cc: Peter Zijlstra, Siddha, Suresh B, linux-kernel, Ingo Molnar, Paul Turner

On Fri, 2011-11-18 at 16:51 -0800, Van De Ven, Arjan wrote:
> > > Arjan, can we do something smart with the idle governor which already
> > > guestimates the idle duration for us?
> > 
> > That would be great.  Won't help my Intel Q6600 though, the intel_idle
> > governor doesn't support it.  Other idle_governor gizmos really need to
> > get smarter too.
> 
> 
> you're confusing the governor (menu) and the low level hardware driver.

Ok, I just saw that intel_idle_probe() says go away.

[    2.479695] intel_idle: MWAIT substates: 0x20
[    2.484144] intel_idle: does not run on family 6 model 15

	-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles
  2011-11-18 15:35                     ` Peter Zijlstra
  2011-11-18 17:34                       ` Mike Galbraith
@ 2011-11-22 14:17                       ` Mike Galbraith
  2011-11-22 14:18                         ` [patch 1/7] " Mike Galbraith
                                           ` (6 more replies)
  1 sibling, 7 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-22 14:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Fri, 2011-11-18 at 16:35 +0100, Peter Zijlstra wrote:
> On Fri, 2011-11-18 at 16:17 +0100, Mike Galbraith wrote:
> >  kernel/sched_fair.c |    3 +++
> >  kernel/sched_rt.c   |    3 +++
> 
> I hate to be the bringer of bad news, but you're going to have to respin
> all this, these files no longer exist ;-)
> 
> Please update your tip/master.

Ok, finally did that, here they come.

	-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 1/7] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles
  2011-11-22 14:17                       ` Mike Galbraith
@ 2011-11-22 14:18                         ` Mike Galbraith
  2011-12-06  9:50                           ` [tip:sched/core] sched: Use " tip-bot for Mike Galbraith
  2011-11-22 14:20                         ` [patch 2/7] sched: save some hrtick_start_fair cycles Mike Galbraith
                                           ` (5 subsequent siblings)
  6 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-22 14:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner


rt.nr_cpus_allowed is always available, use it to bail from select_task_rq()
when only one cpu can be used, and saves some cycles for pinned tasks.

taskset -c 3 pipe-test

   PerfTop:     997 irqs/sec  kernel:89.5%  exact:  0.0% [1000Hz cycles],  (all, CPU: 3)
------------------------------------------------------------------------------------------------

             Virgin                                    Patched
             samples  pcnt function                    samples  pcnt function
             _______ _____ ___________________________ _______ _____ ___________________________

             2880.00 10.2% __schedule                  3136.00 11.3% __schedule
             1634.00  5.8% pipe_read                   1615.00  5.8% pipe_read
             1458.00  5.2% system_call                 1534.00  5.5% system_call
             1382.00  4.9% _raw_spin_lock_irqsave      1412.00  5.1% _raw_spin_lock_irqsave
             1202.00  4.3% pipe_write                  1255.00  4.5% copy_user_generic_string
             1164.00  4.1% copy_user_generic_string    1241.00  4.5% __switch_to
             1097.00  3.9% __switch_to                  929.00  3.3% mutex_lock
              872.00  3.1% mutex_lock                   846.00  3.0% mutex_unlock
              687.00  2.4% mutex_unlock                 804.00  2.9% pipe_write
              682.00  2.4% native_sched_clock           713.00  2.6% native_sched_clock
              643.00  2.3% system_call_after_swapgs     653.00  2.3% _raw_spin_unlock_irqrestore
              617.00  2.2% sched_clock_local            633.00  2.3% fsnotify
              612.00  2.2% fsnotify                     605.00  2.2% sched_clock_local
              596.00  2.1% _raw_spin_unlock_irqrestore  593.00  2.1% system_call_after_swapgs
              542.00  1.9% sysret_check                 559.00  2.0% sysret_check
              467.00  1.7% fget_light                   472.00  1.7% fget_light
              462.00  1.6% finish_task_switch           461.00  1.7% finish_task_switch
              437.00  1.5% vfs_write                    442.00  1.6% vfs_write
              431.00  1.5% do_sync_write                428.00  1.5% do_sync_write
*             413.00  1.5% select_task_rq_fair          404.00  1.5% _raw_spin_lock_irq
              386.00  1.4% update_curr                  402.00  1.4% update_curr
              385.00  1.4% rw_verify_area               389.00  1.4% do_sync_read
              377.00  1.3% _raw_spin_lock_irq           378.00  1.4% vfs_read
              369.00  1.3% do_sync_read                 340.00  1.2% pipe_iov_copy_from_user
              360.00  1.3% vfs_read                     316.00  1.1% __wake_up_sync_key
              342.00  1.2% hrtick_start_fair            313.00  1.1% __wake_up_common

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched/fair.c |    3 +++
 kernel/sched/rt.c   |    3 +++
 2 files changed, 6 insertions(+)

Index: linux-3.0-tip/kernel/sched/fair.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/fair.c
+++ linux-3.0-tip/kernel/sched/fair.c
@@ -2730,6 +2730,9 @@ select_task_rq_fair(struct task_struct *
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
 
+	if (p->rt.nr_cpus_allowed == 1)
+		return prev_cpu;
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
 			want_affine = 1;
Index: linux-3.0-tip/kernel/sched/rt.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/rt.c
+++ linux-3.0-tip/kernel/sched/rt.c
@@ -1200,6 +1200,9 @@ select_task_rq_rt(struct task_struct *p,
 
 	cpu = task_cpu(p);
 
+	if (p->rt.nr_cpus_allowed == 1)
+		goto out;
+
 	/* For anything but wake ups, just return the task_cpu */
 	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
 		goto out;




^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 2/7] sched: save some hrtick_start_fair cycles
  2011-11-22 14:17                       ` Mike Galbraith
  2011-11-22 14:18                         ` [patch 1/7] " Mike Galbraith
@ 2011-11-22 14:20                         ` Mike Galbraith
  2011-12-06 20:20                           ` [tip:sched/core] sched: Save " tip-bot for Mike Galbraith
  2011-11-22 14:21                         ` [patch 3/7] sched: set skip_clock_update in yield_task_fair() Mike Galbraith
                                           ` (4 subsequent siblings)
  6 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-22 14:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner


hrtick_start_fair() shows up in profiles even when disabled.

v3.0.6

taskset -c 3 pipe-test

   PerfTop:     997 irqs/sec  kernel:89.5%  exact:  0.0% [1000Hz cycles],  (all, CPU: 3)
------------------------------------------------------------------------------------------------

             Virgin                                    Patched
             samples  pcnt function                    samples  pcnt function
             _______ _____ ___________________________ _______ _____ ___________________________

             2880.00 10.2% __schedule                  3136.00 11.3% __schedule
             1634.00  5.8% pipe_read                   1615.00  5.8% pipe_read
             1458.00  5.2% system_call                 1534.00  5.5% system_call
             1382.00  4.9% _raw_spin_lock_irqsave      1412.00  5.1% _raw_spin_lock_irqsave
             1202.00  4.3% pipe_write                  1255.00  4.5% copy_user_generic_string
             1164.00  4.1% copy_user_generic_string    1241.00  4.5% __switch_to
             1097.00  3.9% __switch_to                  929.00  3.3% mutex_lock
              872.00  3.1% mutex_lock                   846.00  3.0% mutex_unlock
              687.00  2.4% mutex_unlock                 804.00  2.9% pipe_write
              682.00  2.4% native_sched_clock           713.00  2.6% native_sched_clock
              643.00  2.3% system_call_after_swapgs     653.00  2.3% _raw_spin_unlock_irqrestore
              617.00  2.2% sched_clock_local            633.00  2.3% fsnotify
              612.00  2.2% fsnotify                     605.00  2.2% sched_clock_local
              596.00  2.1% _raw_spin_unlock_irqrestore  593.00  2.1% system_call_after_swapgs
              542.00  1.9% sysret_check                 559.00  2.0% sysret_check
              467.00  1.7% fget_light                   472.00  1.7% fget_light
              462.00  1.6% finish_task_switch           461.00  1.7% finish_task_switch
              437.00  1.5% vfs_write                    442.00  1.6% vfs_write
              431.00  1.5% do_sync_write                428.00  1.5% do_sync_write
              413.00  1.5% select_task_rq_fair          404.00  1.5% _raw_spin_lock_irq
              386.00  1.4% update_curr                  402.00  1.4% update_curr
              385.00  1.4% rw_verify_area               389.00  1.4% do_sync_read
              377.00  1.3% _raw_spin_lock_irq           378.00  1.4% vfs_read
              369.00  1.3% do_sync_read                 340.00  1.2% pipe_iov_copy_from_user
              360.00  1.3% vfs_read                     316.00  1.1% __wake_up_sync_key
*             342.00  1.2% hrtick_start_fair            313.00  1.1% __wake_up_common

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched/fair.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-3.0-tip/kernel/sched/fair.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/fair.c
+++ linux-3.0-tip/kernel/sched/fair.c
@@ -2135,7 +2135,7 @@ static void hrtick_start_fair(struct rq
 
 	WARN_ON(task_rq(p) != rq);
 
-	if (hrtick_enabled(rq) && cfs_rq->nr_running > 1) {
+	if (cfs_rq->nr_running > 1) {
 		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
 		s64 delta = slice - ran;
@@ -2166,7 +2166,7 @@ static void hrtick_update(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 
-	if (curr->sched_class != &fair_sched_class)
+	if (!hrtick_enabled(rq) || curr->sched_class != &fair_sched_class)
 		return;
 
 	if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
@@ -3017,7 +3017,8 @@ static struct task_struct *pick_next_tas
 	} while (cfs_rq);
 
 	p = task_of(se);
-	hrtick_start_fair(rq, p);
+	if (hrtick_enabled(rq))
+		hrtick_start_fair(rq, p);
 
 	return p;
 }



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 3/7] sched: set skip_clock_update in yield_task_fair()
  2011-11-22 14:17                       ` Mike Galbraith
  2011-11-22 14:18                         ` [patch 1/7] " Mike Galbraith
  2011-11-22 14:20                         ` [patch 2/7] sched: save some hrtick_start_fair cycles Mike Galbraith
@ 2011-11-22 14:21                         ` Mike Galbraith
  2011-11-23 11:53                           ` Peter Zijlstra
                                             ` (2 more replies)
  2011-11-22 14:22                         ` [patch 4/7] sched: convert rq->avg_idle to rq->avg_event Mike Galbraith
                                           ` (3 subsequent siblings)
  6 siblings, 3 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-22 14:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner


This is another case where we are on our way to schedule(),
so can save a useless clock update and resulting microscopic
vruntime update.

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched/core.c |    8 +++++++-
 kernel/sched/fair.c |    6 ++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

Index: linux-3.0-tip/kernel/sched/core.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/core.c
+++ linux-3.0-tip/kernel/sched/core.c
@@ -4547,7 +4547,13 @@ again:
 		 */
 		if (preempt && rq != p_rq)
 			resched_task(p_rq->curr);
-	}
+	} else
+		/*
+		 * We might have set it in task_yield_fair(), but are
+		 * not going to schedule(), so don't want to skip
+		 * the next update.
+		 */
+		rq->skip_clock_update = 0;
 
 out:
 	double_rq_unlock(rq, p_rq);
Index: linux-3.0-tip/kernel/sched/fair.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/fair.c
+++ linux-3.0-tip/kernel/sched/fair.c
@@ -3062,6 +3062,12 @@ static void yield_task_fair(struct rq *r
 		 * Update run-time statistics of the 'current'.
 		 */
 		update_curr(cfs_rq);
+		/*
+		 * Tell update_rq_clock() that we've just updated,
+		 * so we don't do microscopic update in schedule()
+		 * and double the fastpath cost.
+		 */
+		 rq->skip_clock_update = 1;
 	}
 
 	set_skip_buddy(se);



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 4/7] sched: convert rq->avg_idle to rq->avg_event
  2011-11-22 14:17                       ` Mike Galbraith
                                           ` (2 preceding siblings ...)
  2011-11-22 14:21                         ` [patch 3/7] sched: set skip_clock_update in yield_task_fair() Mike Galbraith
@ 2011-11-22 14:22                         ` Mike Galbraith
  2011-11-23 11:55                           ` Peter Zijlstra
  2011-11-22 14:23                         ` [patch 5/7] sched: ratelimit select_idle_sibling()for sync wakeups Mike Galbraith
                                           ` (2 subsequent siblings)
  6 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-22 14:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner


We update rq->clock only at points of interest to the scheduler.
Using this distance has the same effect as measuring idle time
for idle_balance() throttling, and allows other uses as well.

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched/core.c  |   29 ++++++++++-------------------
 kernel/sched/debug.c |    2 +-
 kernel/sched/fair.c  |    8 ++------
 kernel/sched/sched.h |    3 +--
 4 files changed, 14 insertions(+), 28 deletions(-)

Index: linux-3.0-tip/kernel/sched/core.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/core.c
+++ linux-3.0-tip/kernel/sched/core.c
@@ -107,6 +107,14 @@ void start_bandwidth_timer(struct hrtime
 DEFINE_MUTEX(sched_domains_mutex);
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
+static void update_avg(u64 *avg, u64 sample)
+{
+#ifdef CONFIG_SMP
+	s64 diff = sample - *avg;
+	*avg += diff >> 3;
+#endif
+}
+
 static void update_rq_clock_task(struct rq *rq, s64 delta);
 
 void update_rq_clock(struct rq *rq)
@@ -119,6 +127,7 @@ void update_rq_clock(struct rq *rq)
 	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
 	rq->clock += delta;
 	update_rq_clock_task(rq, delta);
+	update_avg(&rq->avg_event, delta);
 }
 
 /*
@@ -1287,12 +1296,6 @@ int select_task_rq(struct task_struct *p
 
 	return cpu;
 }
-
-static void update_avg(u64 *avg, u64 sample)
-{
-	s64 diff = sample - *avg;
-	*avg += diff >> 3;
-}
 #endif
 
 static void
@@ -1358,17 +1361,6 @@ ttwu_do_wakeup(struct rq *rq, struct tas
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)
 		p->sched_class->task_woken(rq, p);
-
-	if (rq->idle_stamp) {
-		u64 delta = rq->clock - rq->idle_stamp;
-		u64 max = 2*sysctl_sched_migration_cost;
-
-		if (delta > max)
-			rq->avg_idle = max;
-		else
-			update_avg(&rq->avg_idle, delta);
-		rq->idle_stamp = 0;
-	}
 #endif
 }
 
@@ -6835,8 +6827,7 @@ void __init sched_init(void)
 		rq->push_cpu = 0;
 		rq->cpu = i;
 		rq->online = 0;
-		rq->idle_stamp = 0;
-		rq->avg_idle = 2*sysctl_sched_migration_cost;
+		rq->avg_event = 0;
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_balance_kick = 0;
Index: linux-3.0-tip/kernel/sched/debug.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/debug.c
+++ linux-3.0-tip/kernel/sched/debug.c
@@ -292,7 +292,7 @@ static void print_cpu(struct seq_file *m
 	P(sched_count);
 	P(sched_goidle);
 #ifdef CONFIG_SMP
-	P64(avg_idle);
+	P64(avg_event);
 #endif
 
 	P(ttwu_count);
Index: linux-3.0-tip/kernel/sched/fair.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/fair.c
+++ linux-3.0-tip/kernel/sched/fair.c
@@ -4605,9 +4605,7 @@ void idle_balance(int this_cpu, struct r
 	int pulled_task = 0;
 	unsigned long next_balance = jiffies + HZ;
 
-	this_rq->idle_stamp = this_rq->clock;
-
-	if (this_rq->avg_idle < sysctl_sched_migration_cost)
+	if (this_rq->avg_event < sysctl_sched_migration_cost)
 		return;
 
 	/*
@@ -4633,10 +4631,8 @@ void idle_balance(int this_cpu, struct r
 		interval = msecs_to_jiffies(sd->balance_interval);
 		if (time_after(next_balance, sd->last_balance + interval))
 			next_balance = sd->last_balance + interval;
-		if (pulled_task) {
-			this_rq->idle_stamp = 0;
+		if (pulled_task)
 			break;
-		}
 	}
 	rcu_read_unlock();
 
Index: linux-3.0-tip/kernel/sched/sched.h
===================================================================
--- linux-3.0-tip.orig/kernel/sched/sched.h
+++ linux-3.0-tip/kernel/sched/sched.h
@@ -426,8 +426,7 @@ struct rq {
 
 	u64 rt_avg;
 	u64 age_stamp;
-	u64 idle_stamp;
-	u64 avg_idle;
+	u64 avg_event;
 #endif
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 5/7] sched: ratelimit select_idle_sibling()for sync wakeups
  2011-11-22 14:17                       ` Mike Galbraith
                                           ` (3 preceding siblings ...)
  2011-11-22 14:22                         ` [patch 4/7] sched: convert rq->avg_idle to rq->avg_event Mike Galbraith
@ 2011-11-22 14:23                         ` Mike Galbraith
  2011-11-22 14:24                         ` [patch 6/7] sched: use rq->avg_event to resurrect nohz ratelimiting Mike Galbraith
  2011-11-22 14:26                         ` [patch 7/7] sched: only use TTWU_QUEUE when waker/wakee CPUs do not share top level cache Mike Galbraith
  6 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-22 14:23 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner


select_idle_sibling() injures synchromous loads horribly
on processors with L3 (ala westmere) when called at high
Frequency.  Cut it off at 40 KHz (25usec event rate) when
waking sync, in lieu of an inter-domain cache penalty.

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched/fair.c     |   20 ++++++++++++++++++--
 kernel/sched/features.h |    6 ++++++
 2 files changed, 24 insertions(+), 2 deletions(-)

Index: linux-3.0-tip/kernel/sched/fair.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/fair.c
+++ linux-3.0-tip/kernel/sched/fair.c
@@ -2643,6 +2643,21 @@ find_idlest_cpu(struct sched_group *grou
 }
 
 /*
+ * select_idle_sibling() injures synchromous loads horribly
+ * on processors with L3 (ala westmere) when called at high
+ * Frequency.  Cut it off at 40 KHz (25usec event rate) when
+ * waking sync, in lieu of an inter-domain cache penalty.
+ */
+#define SIBLING_SYNC_CUTOFF_NS (NSEC_PER_SEC/40000UL)
+
+static int idle_sibling_limit(int target, int sync)
+{
+	if (!sync || !sched_feat(SIBLING_LIMIT_SYNC))
+		return 0;
+	return cpu_rq(target)->avg_event < SIBLING_SYNC_CUTOFF_NS;
+}
+
+/*
  * Try and locate an idle CPU in the sched_domain.
  */
 static int select_idle_sibling(struct task_struct *p, int target)
@@ -2790,9 +2805,10 @@ select_task_rq_fair(struct task_struct *
 
 	if (affine_sd) {
 		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
-			prev_cpu = cpu;
+			new_cpu = cpu;
 
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		if (!idle_sibling_limit(new_cpu, sync))
+			new_cpu = select_idle_sibling(p, new_cpu);
 		goto unlock;
 	}
 
Index: linux-3.0-tip/kernel/sched/features.h
===================================================================
--- linux-3.0-tip.orig/kernel/sched/features.h
+++ linux-3.0-tip/kernel/sched/features.h
@@ -68,3 +68,9 @@ SCHED_FEAT(TTWU_QUEUE, 1)
 
 SCHED_FEAT(FORCE_SD_OVERLAP, 0)
 SCHED_FEAT(RT_RUNTIME_SHARE, 1)
+
+/*
+ * Restrict the frequency at which select_idle_sibling() may be called
+ * for synchronous wakeups.
+ */
+SCHED_FEAT(SIBLING_LIMIT_SYNC, 1)



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 6/7] sched: use rq->avg_event to resurrect nohz ratelimiting
  2011-11-22 14:17                       ` Mike Galbraith
                                           ` (4 preceding siblings ...)
  2011-11-22 14:23                         ` [patch 5/7] sched: ratelimit select_idle_sibling()for sync wakeups Mike Galbraith
@ 2011-11-22 14:24                         ` Mike Galbraith
  2011-11-23 11:57                           ` Peter Zijlstra
  2011-11-23 12:35                           ` Mike Galbraith
  2011-11-22 14:26                         ` [patch 7/7] sched: only use TTWU_QUEUE when waker/wakee CPUs do not share top level cache Mike Galbraith
  6 siblings, 2 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-22 14:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner


Entering nohz at high frequency eats cycles, ratelimit it
just as we do idle_balance(), and for the same reason.

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 include/linux/sched.h    |    2 ++
 kernel/sched/core.c      |    7 +++++++
 kernel/time/tick-sched.c |    2 +-
 3 files changed, 10 insertions(+), 1 deletion(-)

Index: linux-3.0-tip/include/linux/sched.h
===================================================================
--- linux-3.0-tip.orig/include/linux/sched.h
+++ linux-3.0-tip/include/linux/sched.h
@@ -1988,8 +1988,10 @@ static inline void idle_task_exit(void)
 
 #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
 extern void wake_up_idle_cpu(int cpu);
+extern int sched_needs_cpu(int cpu);
 #else
 static inline void wake_up_idle_cpu(int cpu) { }
+static inline void sched_needs_cpu(int cpu) { }
 #endif
 
 extern unsigned int sysctl_sched_latency;
Index: linux-3.0-tip/kernel/sched/core.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/core.c
+++ linux-3.0-tip/kernel/sched/core.c
@@ -587,6 +587,13 @@ static inline bool got_nohz_idle_kick(vo
 	return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick;
 }
 
+#ifdef CONFIG_SMP
+int sched_needs_cpu(int cpu)
+{
+	return cpu_rq(cpu)->avg_event < sysctl_sched_migration_cost;
+}
+#endif
+
 #else /* CONFIG_NO_HZ */
 
 static inline bool got_nohz_idle_kick(void)
Index: linux-3.0-tip/kernel/time/tick-sched.c
===================================================================
--- linux-3.0-tip.orig/kernel/time/tick-sched.c
+++ linux-3.0-tip/kernel/time/tick-sched.c
@@ -352,7 +352,7 @@ void tick_nohz_stop_sched_tick(int inidl
 	} while (read_seqretry(&xtime_lock, seq));
 
 	if (rcu_needs_cpu(cpu) || printk_needs_cpu(cpu) ||
-	    arch_needs_cpu(cpu)) {
+	    arch_needs_cpu(cpu) || sched_needs_cpu(cpu)) {
 		next_jiffies = last_jiffies + 1;
 		delta_jiffies = 1;
 	} else {



^ permalink raw reply	[flat|nested] 89+ messages in thread

* [patch 7/7] sched: only use TTWU_QUEUE when waker/wakee CPUs do not share top level cache
  2011-11-22 14:17                       ` Mike Galbraith
                                           ` (5 preceding siblings ...)
  2011-11-22 14:24                         ` [patch 6/7] sched: use rq->avg_event to resurrect nohz ratelimiting Mike Galbraith
@ 2011-11-22 14:26                         ` Mike Galbraith
  2011-11-23 12:08                           ` Peter Zijlstra
  6 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-22 14:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner


TTWU_QUEUE IPI overhead was measured to be as much as 13% of netperf TCP_RR
overhead when waking to a shared cache.  Don't IPI unless we're waking cross
cache, where it can be a winner.

Suggested by Peterz, tested by /me.

Signed-off-by: Mike Galbraith <efault@gmx.de>

---
 kernel/sched/core.c |   24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

Index: linux-3.0-tip/kernel/sched/core.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched/core.c
+++ linux-3.0-tip/kernel/sched/core.c
@@ -1479,12 +1479,34 @@ static int ttwu_activate_remote(struct t
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 #endif /* CONFIG_SMP */
 
+static int ttwu_share_cache(int this_cpu, int cpu)
+{
+#ifndef CONFIG_X86
+	struct sched_domain *sd;
+	int ret = 0;
+
+	rcu_read_lock();
+	for_each_domain(this_cpu, sd) {
+		if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
+			continue;
+
+		ret = (sd->flags & SD_SHARE_PKG_RESOURCES);
+		break;
+	}
+	rcu_read_unlock();
+
+	return ret;
+#else
+	return per_cpu(cpu_llc_id, this_cpu) == per_cpu(cpu_llc_id, cpu);
+#endif
+}
+
 static void ttwu_queue(struct task_struct *p, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
+	if (sched_feat(TTWU_QUEUE) && !ttwu_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
 		ttwu_queue_remote(p, cpu);
 		return;



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 3/7] sched: set skip_clock_update in yield_task_fair()
  2011-11-22 14:21                         ` [patch 3/7] sched: set skip_clock_update in yield_task_fair() Mike Galbraith
@ 2011-11-23 11:53                           ` Peter Zijlstra
  2011-11-23 12:06                             ` Mike Galbraith
  2011-11-23 14:48                           ` Peter Zijlstra
  2011-12-06  9:51                           ` [tip:sched/core] sched: Set " tip-bot for Mike Galbraith
  2 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-23 11:53 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2011-11-22 at 15:21 +0100, Mike Galbraith wrote:
> +++ linux-3.0-tip/kernel/sched/core.c
> @@ -4547,7 +4547,13 @@ again: 

Could you add:

QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"


to your quiltrc ?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 4/7] sched: convert rq->avg_idle to rq->avg_event
  2011-11-22 14:22                         ` [patch 4/7] sched: convert rq->avg_idle to rq->avg_event Mike Galbraith
@ 2011-11-23 11:55                           ` Peter Zijlstra
  2011-11-23 12:09                             ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-23 11:55 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2011-11-22 at 15:22 +0100, Mike Galbraith wrote:
> We update rq->clock only at points of interest to the scheduler.
> Using this distance has the same effect as measuring idle time
> for idle_balance() throttling, and allows other uses as well. 

I'm not sure I follow, suppose we're happily context switching away, how
is the avg distance between context switches related to idle time?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 6/7] sched: use rq->avg_event to resurrect nohz ratelimiting
  2011-11-22 14:24                         ` [patch 6/7] sched: use rq->avg_event to resurrect nohz ratelimiting Mike Galbraith
@ 2011-11-23 11:57                           ` Peter Zijlstra
  2011-11-23 12:35                           ` Mike Galbraith
  1 sibling, 0 replies; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-23 11:57 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2011-11-22 at 15:24 +0100, Mike Galbraith wrote:
> Entering nohz at high frequency eats cycles, ratelimit it
> just as we do idle_balance(), and for the same reason.

ideally we should have some per-arch information on how costly it
actually is to enter/leave nohz.

Also, can you confirm, or otherwise run this by whoemever reported that
power-usage regression the last time we tried this?


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 3/7] sched: set skip_clock_update in yield_task_fair()
  2011-11-23 11:53                           ` Peter Zijlstra
@ 2011-11-23 12:06                             ` Mike Galbraith
  0 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-23 12:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-23 at 12:53 +0100, Peter Zijlstra wrote:
> On Tue, 2011-11-22 at 15:21 +0100, Mike Galbraith wrote:
> > +++ linux-3.0-tip/kernel/sched/core.c
> > @@ -4547,7 +4547,13 @@ again: 
> 
> Could you add:
> 
> QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"
> 
> 
> to your quiltrc ?

Done, I now have a ~/.quiltrc with one whole line in it.

	Thanks,

	-Mike



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 7/7] sched: only use TTWU_QUEUE when waker/wakee CPUs do not share top level cache
  2011-11-22 14:26                         ` [patch 7/7] sched: only use TTWU_QUEUE when waker/wakee CPUs do not share top level cache Mike Galbraith
@ 2011-11-23 12:08                           ` Peter Zijlstra
  0 siblings, 0 replies; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-23 12:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Chris Mason, Dave Kleikamp


I changed this a little, no point in iterating the domains when
SD_SHARE_PKG_RESOURCES fails.

---
Subject: sched: Only use TTWU_QUEUE when waker/wakee CPUs do not share top level cache
From: Mike Galbraith <mgalbraith@suse.de>
Date: Tue, 22 Nov 2011 15:26:25 +0100

TTWU_QUEUE IPI overhead was measured to be as much as 13% of netperf TCP_RR
overhead when waking to a shared cache. Don't IPI unless we're waking cross
cache, where it can be a winner.

Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1321971985.6855.26.camel@marge.simson.net
---
 kernel/sched/core.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -1480,12 +1480,36 @@ static int ttwu_activate_remote(struct t
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 #endif /* CONFIG_SMP */
 
+static int ttwu_share_cache(int this_cpu, int cpu)
+{
+#ifndef CONFIG_X86
+	struct sched_domain *sd;
+	int ret = 0;
+
+	rcu_read_lock();
+	for_each_domain(this_cpu, sd) {
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
+			break;
+
+		if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
+			ret = 1;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return ret;
+#else
+	return per_cpu(cpu_llc_id, this_cpu) == per_cpu(cpu_llc_id, cpu);
+#endif
+}
+
 static void ttwu_queue(struct task_struct *p, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 
 #if defined(CONFIG_SMP)
-	if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
+	if (sched_feat(TTWU_QUEUE) && !ttwu_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* sync clocks x-cpu */
 		ttwu_queue_remote(p, cpu);
 		return;


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 4/7] sched: convert rq->avg_idle to rq->avg_event
  2011-11-23 11:55                           ` Peter Zijlstra
@ 2011-11-23 12:09                             ` Mike Galbraith
  2011-11-23 12:27                               ` Peter Zijlstra
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-23 12:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-23 at 12:55 +0100, Peter Zijlstra wrote:
> On Tue, 2011-11-22 at 15:22 +0100, Mike Galbraith wrote:
> > We update rq->clock only at points of interest to the scheduler.
> > Using this distance has the same effect as measuring idle time
> > for idle_balance() throttling, and allows other uses as well. 
> 
> I'm not sure I follow, suppose we're happily context switching away, how
> is the avg distance between context switches related to idle time?

Average idle time can't be larger.

	-Mike 



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 4/7] sched: convert rq->avg_idle to rq->avg_event
  2011-11-23 12:09                             ` Mike Galbraith
@ 2011-11-23 12:27                               ` Peter Zijlstra
  2011-11-23 12:57                                 ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-23 12:27 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-23 at 13:09 +0100, Mike Galbraith wrote:
> On Wed, 2011-11-23 at 12:55 +0100, Peter Zijlstra wrote:
> > On Tue, 2011-11-22 at 15:22 +0100, Mike Galbraith wrote:
> > > We update rq->clock only at points of interest to the scheduler.
> > > Using this distance has the same effect as measuring idle time
> > > for idle_balance() throttling, and allows other uses as well. 
> > 
> > I'm not sure I follow, suppose we're happily context switching away, how
> > is the avg distance between context switches related to idle time?
> 
> Average idle time can't be larger.

True :-)

But it can be _much MUCH_ smaller. So the value is a fair upper limit on
the idle time, but has no relation to the actual idle duration.

Now this value seems to be used in 5 to throttle select_idle_sibling(),
which is again something unrelated to actual idle duration, but also
unrelated to the avg update_rq_clock() interval.

In patch 6 we use this value to guestimate if we should enter nohz,
since its a wild over estimation of the actual idle duration it'll be
less effective and might not hard much.

Also, patch 6's use of sched_migration_cost to reflect the nohz
enter/exit cost is somewhat iffy, but that's another issue.


Now I'm not saying this all isn't worth it, just saying my brain is
having difficulty seeing how it all makes sense :-)

Anyway, I picked up 1,2,3,7 and will give the missing patches another
stare a bit later.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 6/7] sched: use rq->avg_event to resurrect nohz ratelimiting
  2011-11-22 14:24                         ` [patch 6/7] sched: use rq->avg_event to resurrect nohz ratelimiting Mike Galbraith
  2011-11-23 11:57                           ` Peter Zijlstra
@ 2011-11-23 12:35                           ` Mike Galbraith
  1 sibling, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-23 12:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner, Norbert Preining

Added CC of the person who reported that the last nohz ratelimit patch
caused increased power consumption.

Norbert: can you please test this patch to make sure it's not going to
cause you the same trouble the last one did?  I see no ill effects with
this patch, but then I didn't with the last either.

	TIA,

	-Mike

On Tue, 2011-11-22 at 15:24 +0100, Mike Galbraith wrote:
> Entering nohz at high frequency eats cycles, ratelimit it
> just as we do idle_balance(), and for the same reason.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> 
> ---
>  include/linux/sched.h    |    2 ++
>  kernel/sched/core.c      |    7 +++++++
>  kernel/time/tick-sched.c |    2 +-
>  3 files changed, 10 insertions(+), 1 deletion(-)
> 
> Index: linux-3.0-tip/include/linux/sched.h
> ===================================================================
> --- linux-3.0-tip.orig/include/linux/sched.h
> +++ linux-3.0-tip/include/linux/sched.h
> @@ -1988,8 +1988,10 @@ static inline void idle_task_exit(void)
>  
>  #if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
>  extern void wake_up_idle_cpu(int cpu);
> +extern int sched_needs_cpu(int cpu);
>  #else
>  static inline void wake_up_idle_cpu(int cpu) { }
> +static inline void sched_needs_cpu(int cpu) { }
>  #endif
>  
>  extern unsigned int sysctl_sched_latency;
> Index: linux-3.0-tip/kernel/sched/core.c
> ===================================================================
> --- linux-3.0-tip.orig/kernel/sched/core.c
> +++ linux-3.0-tip/kernel/sched/core.c
> @@ -587,6 +587,13 @@ static inline bool got_nohz_idle_kick(vo
>  	return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick;
>  }
>  
> +#ifdef CONFIG_SMP
> +int sched_needs_cpu(int cpu)
> +{
> +	return cpu_rq(cpu)->avg_event < sysctl_sched_migration_cost;
> +}
> +#endif
> +
>  #else /* CONFIG_NO_HZ */
>  
>  static inline bool got_nohz_idle_kick(void)
> Index: linux-3.0-tip/kernel/time/tick-sched.c
> ===================================================================
> --- linux-3.0-tip.orig/kernel/time/tick-sched.c
> +++ linux-3.0-tip/kernel/time/tick-sched.c
> @@ -352,7 +352,7 @@ void tick_nohz_stop_sched_tick(int inidl
>  	} while (read_seqretry(&xtime_lock, seq));
>  
>  	if (rcu_needs_cpu(cpu) || printk_needs_cpu(cpu) ||
> -	    arch_needs_cpu(cpu)) {
> +	    arch_needs_cpu(cpu) || sched_needs_cpu(cpu)) {
>  		next_jiffies = last_jiffies + 1;
>  		delta_jiffies = 1;
>  	} else {
> 
> 



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 4/7] sched: convert rq->avg_idle to rq->avg_event
  2011-11-23 12:27                               ` Peter Zijlstra
@ 2011-11-23 12:57                                 ` Mike Galbraith
  2011-11-23 14:21                                   ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-23 12:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-23 at 13:27 +0100, Peter Zijlstra wrote:
> On Wed, 2011-11-23 at 13:09 +0100, Mike Galbraith wrote:
> > On Wed, 2011-11-23 at 12:55 +0100, Peter Zijlstra wrote:
> > > On Tue, 2011-11-22 at 15:22 +0100, Mike Galbraith wrote:
> > > > We update rq->clock only at points of interest to the scheduler.
> > > > Using this distance has the same effect as measuring idle time
> > > > for idle_balance() throttling, and allows other uses as well. 
> > > 
> > > I'm not sure I follow, suppose we're happily context switching away, how
> > > is the avg distance between context switches related to idle time?
> > 
> > Average idle time can't be larger.
> 
> True :-)
> 
> But it can be _much MUCH_ smaller. So the value is a fair upper limit on
> the idle time, but has no relation to the actual idle duration.

Yup, none.

> Now this value seems to be used in 5 to throttle select_idle_sibling(),
> which is again something unrelated to actual idle duration, but also
> unrelated to the avg update_rq_clock() interval.

Yup.  Clock update is a thing we do at sched event-of-interest, so this
seemed like a good spot to create a multi-purpose number.

> In patch 6 we use this value to guestimate if we should enter nohz,
> since its a wild over estimation of the actual idle duration it'll be
> less effective and might not hard much.
> 
> Also, patch 6's use of sched_migration_cost to reflect the nohz
> enter/exit cost is somewhat iffy, but that's another issue.

Yup.

> Now I'm not saying this all isn't worth it, just saying my brain is
> having difficulty seeing how it all makes sense :-)

They make sense only in that one cheap number generator bandaids three
owies.  It's fugly but effective :)

	-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 4/7] sched: convert rq->avg_idle to rq->avg_event
  2011-11-23 12:57                                 ` Mike Galbraith
@ 2011-11-23 14:21                                   ` Mike Galbraith
  0 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-23 14:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-23 at 13:57 +0100, Mike Galbraith wrote:
> On Wed, 2011-11-23 at 13:27 +0100, Peter Zijlstra wrote:

> > Now I'm not saying this all isn't worth it, just saying my brain is
> > having difficulty seeing how it all makes sense :-)
> 
> They make sense only in that one cheap number generator bandaids three
> owies.  It's fugly but effective :)

Addendum:

That number represents scheduler busyness.  If we're "this" busy, it's
not worth entering nohz, there's unlikely to be enough overlap to be
worth going after at the expense of L2 misses fro L3 equipped CPUs, and
we can't afford to futz around with load balancing just now.

"this" is arbitrary, but in the select_idle_sibling() case at least,
it's impossible to know what any wakee will do with the CPU, so you're
stuck with arbitrary no matter what you use to shut the thing off.

	-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 3/7] sched: set skip_clock_update in yield_task_fair()
  2011-11-22 14:21                         ` [patch 3/7] sched: set skip_clock_update in yield_task_fair() Mike Galbraith
  2011-11-23 11:53                           ` Peter Zijlstra
@ 2011-11-23 14:48                           ` Peter Zijlstra
  2011-11-24  3:50                             ` Mike Galbraith
  2011-12-06  9:51                           ` [tip:sched/core] sched: Set " tip-bot for Mike Galbraith
  2 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-23 14:48 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2011-11-22 at 15:21 +0100, Mike Galbraith wrote:
> This is another case where we are on our way to schedule(),
> so can save a useless clock update and resulting microscopic
> vruntime update.
> 

Everytime I see one of these skip_clock_update thingies I get the idea
we should do something smarter, because its most fragile and getting it
wrong results in a subtle trainwreck..

Would something like the below help in validating this stuff?


---
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -113,12 +113,27 @@ void update_rq_clock(struct rq *rq)
 {
 	s64 delta;
 
-	if (rq->skip_clock_update > 0)
+	if (rq->skip_clock_update > 0) {
+		/*
+		 * We skipped an update even though we had a full context
+		 * switch in between, badness.
+		 */
+		if (rq->clock_update_stamp != rq->sched_switch)
+			trace_printk("lost an update!!\n");
 		return;
+	}
+
+	/*
+	 * We're updating the clock even though we didn't switch yet.
+	 */
+	if (rq->clock_update_stamp == rq->sched_switch)
+		trace_printk("too many updates!!\n");
 
 	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
 	rq->clock += delta;
 	update_rq_clock_task(rq, delta);
+
+	rq->clock_update_stamp = rq->sched_switch
 }
 
 /*
@@ -1922,6 +1937,9 @@ static void finish_task_switch(struct rq
 #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
 	local_irq_enable();
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
+#ifdef CONFIG_SCHED_DEBUG
+	schedstat_inc(rq, sched_switch);
+#endif
 	finish_lock_switch(rq, prev);
 
 	fire_sched_in_preempt_notifiers(current);
Index: linux-2.6/kernel/sched/sched.h
===================================================================
--- linux-2.6.orig/kernel/sched/sched.h
+++ linux-2.6/kernel/sched/sched.h
@@ -374,6 +374,8 @@ struct rq {
 	unsigned char nohz_balance_kick;
 #endif
 	int skip_clock_update;
+	unsigned int clock_update_stamp;
+
 
 	/* capture load from *all* tasks on this cpu: */
 	struct load_weight load;


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 3/7] sched: set skip_clock_update in yield_task_fair()
  2011-11-23 14:48                           ` Peter Zijlstra
@ 2011-11-24  3:50                             ` Mike Galbraith
  2011-11-24 10:12                               ` Peter Zijlstra
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2011-11-24  3:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-23 at 15:48 +0100, Peter Zijlstra wrote:
> On Tue, 2011-11-22 at 15:21 +0100, Mike Galbraith wrote:
> > This is another case where we are on our way to schedule(),
> > so can save a useless clock update and resulting microscopic
> > vruntime update.
> > 
> 
> Everytime I see one of these skip_clock_update thingies I get the idea
> we should do something smarter, because its most fragile and getting it
> wrong results in a subtle trainwreck..
> 
> Would something like the below help in validating this stuff?

No, because switch itself is irrelevant to fair-beans counting.

> ---
> Index: linux-2.6/kernel/sched/core.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/core.c
> +++ linux-2.6/kernel/sched/core.c
> @@ -113,12 +113,27 @@ void update_rq_clock(struct rq *rq)
>  {
>  	s64 delta;
>  
> -	if (rq->skip_clock_update > 0)
> +	if (rq->skip_clock_update > 0) {
> +		/*
> +		 * We skipped an update even though we had a full context
> +		 * switch in between, badness.
> +		 */
> +		if (rq->clock_update_stamp != rq->sched_switch)
> +			trace_printk("lost an update!!\n");
>  		return;
> +	}
> +
> +	/*
> +	 * We're updating the clock even though we didn't switch yet.
> +	 */
> +	if (rq->clock_update_stamp == rq->sched_switch)
> +		trace_printk("too many updates!!\n");
>  
>  	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
>  	rq->clock += delta;
>  	update_rq_clock_task(rq, delta);
> +
> +	rq->clock_update_stamp = rq->sched_switch
>  }
>  
>  /*
> @@ -1922,6 +1937,9 @@ static void finish_task_switch(struct rq
>  #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
>  	local_irq_enable();
>  #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
> +#ifdef CONFIG_SCHED_DEBUG
> +	schedstat_inc(rq, sched_switch);
> +#endif
>  	finish_lock_switch(rq, prev);
>  
>  	fire_sched_in_preempt_notifiers(current);
> Index: linux-2.6/kernel/sched/sched.h
> ===================================================================
> --- linux-2.6.orig/kernel/sched/sched.h
> +++ linux-2.6/kernel/sched/sched.h
> @@ -374,6 +374,8 @@ struct rq {
>  	unsigned char nohz_balance_kick;
>  #endif
>  	int skip_clock_update;
> +	unsigned int clock_update_stamp;
> +
>  
>  	/* capture load from *all* tasks on this cpu: */
>  	struct load_weight load;
> 



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 3/7] sched: set skip_clock_update in yield_task_fair()
  2011-11-24  3:50                             ` Mike Galbraith
@ 2011-11-24 10:12                               ` Peter Zijlstra
  2011-11-25  6:39                                 ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2011-11-24 10:12 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-24 at 04:50 +0100, Mike Galbraith wrote:
> On Wed, 2011-11-23 at 15:48 +0100, Peter Zijlstra wrote:
> > On Tue, 2011-11-22 at 15:21 +0100, Mike Galbraith wrote:
> > > This is another case where we are on our way to schedule(),
> > > so can save a useless clock update and resulting microscopic
> > > vruntime update.
> > > 
> > 
> > Everytime I see one of these skip_clock_update thingies I get the idea
> > we should do something smarter, because its most fragile and getting it
> > wrong results in a subtle trainwreck..
> > 
> > Would something like the below help in validating this stuff?
> 
> No, because switch itself is irrelevant to fair-beans counting.

Hmm, right, but we can keep another counter that is more relevant, like
all schedule(),ttwu() and various other calls, right?

Thing is, is that going to be less fragile than the current crap :-(

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [patch 3/7] sched: set skip_clock_update in yield_task_fair()
  2011-11-24 10:12                               ` Peter Zijlstra
@ 2011-11-25  6:39                                 ` Mike Galbraith
  0 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2011-11-25  6:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-24 at 11:12 +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-24 at 04:50 +0100, Mike Galbraith wrote:
> > On Wed, 2011-11-23 at 15:48 +0100, Peter Zijlstra wrote:
> > > On Tue, 2011-11-22 at 15:21 +0100, Mike Galbraith wrote:
> > > > This is another case where we are on our way to schedule(),
> > > > so can save a useless clock update and resulting microscopic
> > > > vruntime update.
> > > > 
> > > 
> > > Everytime I see one of these skip_clock_update thingies I get the idea
> > > we should do something smarter, because its most fragile and getting it
> > > wrong results in a subtle trainwreck..
> > > 
> > > Would something like the below help in validating this stuff?
> > 
> > No, because switch itself is irrelevant to fair-beans counting.
> 
> Hmm, right, but we can keep another counter that is more relevant, like
> all schedule(),ttwu() and various other calls, right?

Yeah, there's gotta be a way to be smarter about this.  When waking a
gaggle, we'll do useless microscopic updates and preempt checks for each
and every wakeup (not to mention waker can be preempted).  It's one
event per touched rq really, but we treat is as many.  The skip thing
just binds a couple obvious spots in a hacky way.

> Thing is, is that going to be less fragile than the current crap :-(

I agree it's fragile.

	-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [tip:sched/core] sched: Clean up domain traversal in select_idle_sibling()
  2011-11-17 19:08             ` Suresh Siddha
  2011-11-18 15:12               ` Peter Zijlstra
@ 2011-12-06  9:49               ` tip-bot for Suresh Siddha
  1 sibling, 0 replies; 89+ messages in thread
From: tip-bot for Suresh Siddha @ 2011-12-06  9:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, efault, suresh.b.siddha,
	tglx, mingo

Commit-ID:  77e81365e0b7d7479fc444a21cea0cd4def70b45
Gitweb:     http://git.kernel.org/tip/77e81365e0b7d7479fc444a21cea0cd4def70b45
Author:     Suresh Siddha <suresh.b.siddha@intel.com>
AuthorDate: Thu, 17 Nov 2011 11:08:23 -0800
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 08:51:25 +0100

sched: Clean up domain traversal in select_idle_sibling()

Instead of going through the scheduler domain hierarchy multiple times
(for giving priority to an idle core over an idle SMT sibling in a busy
core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES
flag and traverse the domain hierarchy down till we find an idle group.

This cleanup also addresses an issue reported by Mike where the recent
changes returned the busy thread even in the presence of an idle SMT
sibling in single socket platforms.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1321556904.15339.25.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c  |   38 +++++++++++++++++++++++++-------------
 kernel/sched/sched.h |    2 ++
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c62e2b..96a9ece 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2644,6 +2644,28 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			break;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2653,7 +2675,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i, smt = 0;
+	int i;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2673,19 +2695,9 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
-again:
-	for_each_domain(target, sd) {
-		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
-			continue;
-
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
-			if (!smt) {
-				smt = 1;
-				goto again;
-			}
-			break;
-		}
 
+	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
 			if (!cpumask_intersects(sched_group_cpus(sg),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c2e7802..8715055 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -501,6 +501,8 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define for_each_domain(cpu, __sd) \
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
 
+#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
+
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		(&__get_cpu_var(runqueues))
 #define task_rq(p)		cpu_rq(task_cpu(p))

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [tip:sched/core] sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles
  2011-11-22 14:18                         ` [patch 1/7] " Mike Galbraith
@ 2011-12-06  9:50                           ` tip-bot for Mike Galbraith
  0 siblings, 0 replies; 89+ messages in thread
From: tip-bot for Mike Galbraith @ 2011-12-06  9:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mgalbraith, hpa, mingo, a.p.zijlstra, efault, tglx, mingo

Commit-ID:  76854c7e8f3f4172fef091e78d88b3b751463ac6
Gitweb:     http://git.kernel.org/tip/76854c7e8f3f4172fef091e78d88b3b751463ac6
Author:     Mike Galbraith <mgalbraith@suse.de>
AuthorDate: Tue, 22 Nov 2011 15:18:24 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 08:51:26 +0100

sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles

rt.nr_cpus_allowed is always available, use it to bail from select_task_rq()
when only one cpu can be used, and saves some cycles for pinned tasks.

See the line marked with '*' below:

  # taskset -c 3 pipe-test

   PerfTop:     997 irqs/sec  kernel:89.5%  exact:  0.0% [1000Hz cycles],  (all, CPU: 3)
------------------------------------------------------------------------------------------------

             Virgin                                    Patched
             samples  pcnt function                    samples  pcnt function
             _______ _____ ___________________________ _______ _____ ___________________________

             2880.00 10.2% __schedule                  3136.00 11.3% __schedule
             1634.00  5.8% pipe_read                   1615.00  5.8% pipe_read
             1458.00  5.2% system_call                 1534.00  5.5% system_call
             1382.00  4.9% _raw_spin_lock_irqsave      1412.00  5.1% _raw_spin_lock_irqsave
             1202.00  4.3% pipe_write                  1255.00  4.5% copy_user_generic_string
             1164.00  4.1% copy_user_generic_string    1241.00  4.5% __switch_to
             1097.00  3.9% __switch_to                  929.00  3.3% mutex_lock
              872.00  3.1% mutex_lock                   846.00  3.0% mutex_unlock
              687.00  2.4% mutex_unlock                 804.00  2.9% pipe_write
              682.00  2.4% native_sched_clock           713.00  2.6% native_sched_clock
              643.00  2.3% system_call_after_swapgs     653.00  2.3% _raw_spin_unlock_irqrestore
              617.00  2.2% sched_clock_local            633.00  2.3% fsnotify
              612.00  2.2% fsnotify                     605.00  2.2% sched_clock_local
              596.00  2.1% _raw_spin_unlock_irqrestore  593.00  2.1% system_call_after_swapgs
              542.00  1.9% sysret_check                 559.00  2.0% sysret_check
              467.00  1.7% fget_light                   472.00  1.7% fget_light
              462.00  1.6% finish_task_switch           461.00  1.7% finish_task_switch
              437.00  1.5% vfs_write                    442.00  1.6% vfs_write
              431.00  1.5% do_sync_write                428.00  1.5% do_sync_write
*             413.00  1.5% select_task_rq_fair          404.00  1.5% _raw_spin_lock_irq
              386.00  1.4% update_curr                  402.00  1.4% update_curr
              385.00  1.4% rw_verify_area               389.00  1.4% do_sync_read
              377.00  1.3% _raw_spin_lock_irq           378.00  1.4% vfs_read
              369.00  1.3% do_sync_read                 340.00  1.2% pipe_iov_copy_from_user
              360.00  1.3% vfs_read                     316.00  1.1% __wake_up_sync_key
              342.00  1.2% hrtick_start_fair            313.00  1.1% __wake_up_common

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1321971504.6855.15.camel@marge.simson.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c |    3 +++
 kernel/sched/rt.c   |    3 +++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96a9ece..8e534a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2744,6 +2744,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
 
+	if (p->rt.nr_cpus_allowed == 1)
+		return prev_cpu;
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
 			want_affine = 1;
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 023b355..58a4844 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1200,6 +1200,9 @@ select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
 
 	cpu = task_cpu(p);
 
+	if (p->rt.nr_cpus_allowed == 1)
+		goto out;
+
 	/* For anything but wake ups, just return the task_cpu */
 	if (sd_flag != SD_BALANCE_WAKE && sd_flag != SD_BALANCE_FORK)
 		goto out;

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [tip:sched/core] sched: Set skip_clock_update in yield_task_fair()
  2011-11-22 14:21                         ` [patch 3/7] sched: set skip_clock_update in yield_task_fair() Mike Galbraith
  2011-11-23 11:53                           ` Peter Zijlstra
  2011-11-23 14:48                           ` Peter Zijlstra
@ 2011-12-06  9:51                           ` tip-bot for Mike Galbraith
  2 siblings, 0 replies; 89+ messages in thread
From: tip-bot for Mike Galbraith @ 2011-12-06  9:51 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mgalbraith, hpa, mingo, a.p.zijlstra, efault, tglx, mingo

Commit-ID:  916671c08b7808aebec87cc56c85788e665b3c6b
Gitweb:     http://git.kernel.org/tip/916671c08b7808aebec87cc56c85788e665b3c6b
Author:     Mike Galbraith <mgalbraith@suse.de>
AuthorDate: Tue, 22 Nov 2011 15:21:26 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 09:06:24 +0100

sched: Set skip_clock_update in yield_task_fair()

This is another case where we are on our way to schedule(),
so can save a useless clock update and resulting microscopic
vruntime update.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1321971686.6855.18.camel@marge.simson.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/core.c |    7 +++++++
 kernel/sched/fair.c |    6 ++++++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca8fd44..db313c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4547,6 +4547,13 @@ again:
 		 */
 		if (preempt && rq != p_rq)
 			resched_task(p_rq->curr);
+	} else {
+		/*
+		 * We might have set it in task_yield_fair(), but are
+		 * not going to schedule(), so don't want to skip
+		 * the next update.
+		 */
+		rq->skip_clock_update = 0;
 	}
 
 out:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e534a0..81ccb81 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3075,6 +3075,12 @@ static void yield_task_fair(struct rq *rq)
 		 * Update run-time statistics of the 'current'.
 		 */
 		update_curr(cfs_rq);
+		/*
+		 * Tell update_rq_clock() that we've just updated,
+		 * so we don't do microscopic update in schedule()
+		 * and double the fastpath cost.
+		 */
+		 rq->skip_clock_update = 1;
 	}
 
 	set_skip_buddy(se);

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [tip:sched/core] sched: Save some hrtick_start_fair cycles
  2011-11-22 14:20                         ` [patch 2/7] sched: save some hrtick_start_fair cycles Mike Galbraith
@ 2011-12-06 20:20                           ` tip-bot for Mike Galbraith
  0 siblings, 0 replies; 89+ messages in thread
From: tip-bot for Mike Galbraith @ 2011-12-06 20:20 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mgalbraith, hpa, mingo, a.p.zijlstra, efault, tglx, mingo

Commit-ID:  b39e66eaf9c573f38133e894256caeaf9fd2a528
Gitweb:     http://git.kernel.org/tip/b39e66eaf9c573f38133e894256caeaf9fd2a528
Author:     Mike Galbraith <mgalbraith@suse.de>
AuthorDate: Tue, 22 Nov 2011 15:20:07 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 6 Dec 2011 20:51:20 +0100

sched: Save some hrtick_start_fair cycles

hrtick_start_fair() shows up in profiles even when disabled.

v3.0.6

taskset -c 3 pipe-test

   PerfTop:     997 irqs/sec  kernel:89.5%  exact:  0.0% [1000Hz cycles],  (all, CPU: 3)
------------------------------------------------------------------------------------------------

             Virgin                                    Patched
             samples  pcnt function                    samples  pcnt function
             _______ _____ ___________________________ _______ _____ ___________________________

             2880.00 10.2% __schedule                  3136.00 11.3% __schedule
             1634.00  5.8% pipe_read                   1615.00  5.8% pipe_read
             1458.00  5.2% system_call                 1534.00  5.5% system_call
             1382.00  4.9% _raw_spin_lock_irqsave      1412.00  5.1% _raw_spin_lock_irqsave
             1202.00  4.3% pipe_write                  1255.00  4.5% copy_user_generic_string
             1164.00  4.1% copy_user_generic_string    1241.00  4.5% __switch_to
             1097.00  3.9% __switch_to                  929.00  3.3% mutex_lock
              872.00  3.1% mutex_lock                   846.00  3.0% mutex_unlock
              687.00  2.4% mutex_unlock                 804.00  2.9% pipe_write
              682.00  2.4% native_sched_clock           713.00  2.6% native_sched_clock
              643.00  2.3% system_call_after_swapgs     653.00  2.3% _raw_spin_unlock_irqrestore
              617.00  2.2% sched_clock_local            633.00  2.3% fsnotify
              612.00  2.2% fsnotify                     605.00  2.2% sched_clock_local
              596.00  2.1% _raw_spin_unlock_irqrestore  593.00  2.1% system_call_after_swapgs
              542.00  1.9% sysret_check                 559.00  2.0% sysret_check
              467.00  1.7% fget_light                   472.00  1.7% fget_light
              462.00  1.6% finish_task_switch           461.00  1.7% finish_task_switch
              437.00  1.5% vfs_write                    442.00  1.6% vfs_write
              431.00  1.5% do_sync_write                428.00  1.5% do_sync_write
              413.00  1.5% select_task_rq_fair          404.00  1.5% _raw_spin_lock_irq
              386.00  1.4% update_curr                  402.00  1.4% update_curr
              385.00  1.4% rw_verify_area               389.00  1.4% do_sync_read
              377.00  1.3% _raw_spin_lock_irq           378.00  1.4% vfs_read
              369.00  1.3% do_sync_read                 340.00  1.2% pipe_iov_copy_from_user
              360.00  1.3% vfs_read                     316.00  1.1% __wake_up_sync_key
*             342.00  1.2% hrtick_start_fair            313.00  1.1% __wake_up_common

Signed-off-by: Mike Galbraith <efault@gmx.de>
[ fixed !CONFIG_SCHED_HRTICK borkage ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1321971607.6855.17.camel@marge.simson.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c  |    7 ++++---
 kernel/sched/sched.h |    7 +++++++
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65a6f8b..4174338 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2137,7 +2137,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 
 	WARN_ON(task_rq(p) != rq);
 
-	if (hrtick_enabled(rq) && cfs_rq->nr_running > 1) {
+	if (cfs_rq->nr_running > 1) {
 		u64 slice = sched_slice(cfs_rq, se);
 		u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
 		s64 delta = slice - ran;
@@ -2168,7 +2168,7 @@ static void hrtick_update(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 
-	if (curr->sched_class != &fair_sched_class)
+	if (!hrtick_enabled(rq) || curr->sched_class != &fair_sched_class)
 		return;
 
 	if (cfs_rq_of(&curr->se)->nr_running < sched_nr_latency)
@@ -3031,7 +3031,8 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
 	} while (cfs_rq);
 
 	p = task_of(se);
-	hrtick_start_fair(rq, p);
+	if (hrtick_enabled(rq))
+		hrtick_start_fair(rq, p);
 
 	return p;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91810f0..d88545c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -885,6 +885,13 @@ static inline int hrtick_enabled(struct rq *rq)
 
 void hrtick_start(struct rq *rq, u64 delay);
 
+#else
+
+static inline int hrtick_enabled(struct rq *rq)
+{
+	return 0;
+}
+
 #endif /* CONFIG_SCHED_HRTICK */
 
 #ifdef CONFIG_SMP

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-18 15:14                 ` Mike Galbraith
                                     ` (5 preceding siblings ...)
  2011-11-18 15:23                   ` [patch 6/6] sched: set skip_clock_update in yield_task_fair() Mike Galbraith
@ 2012-02-20 14:41                   ` Peter Zijlstra
  2012-02-20 15:03                     ` Srivatsa Vaddagiri
  2012-02-20 18:14                     ` Mike Galbraith
  6 siblings, 2 replies; 89+ messages in thread
From: Peter Zijlstra @ 2012-02-20 14:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:

> ---
>  kernel/sched_fair.c |   10 ++--------
>  1 file changed, 2 insertions(+), 8 deletions(-)
> 
> Index: linux-3.0-tip/kernel/sched_fair.c
> ===================================================================
> --- linux-3.0-tip.orig/kernel/sched_fair.c
> +++ linux-3.0-tip/kernel/sched_fair.c
> @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
>  		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
>  			if (idle_cpu(i)) {
>  				target = i;
> +				if (sd->flags & SD_SHARE_CPUPOWER)
> +					continue;
>  				break;
>  			}
>  		}
> -
> -		/*
> -		 * Lets stop looking for an idle sibling when we reached
> -		 * the domain that spans the current cpu and prev_cpu.
> -		 */
> -		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> -		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> -			break;
>  	}
>  	rcu_read_unlock();

Mike, Suresh, did we ever get this sorted? I was looking at
select_idle_sibling() and it looks like we dropped this.

Also, did anybody ever get an answer from a HW guy on why sharing stuff
over SMT threads is so much worse than sharing it over proper cores? Its
not like this workload actually does anything concurrently.

I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 14:41                   ` sched: Avoid SMT siblings in select_idle_sibling() if possible Peter Zijlstra
@ 2012-02-20 15:03                     ` Srivatsa Vaddagiri
  2012-02-20 18:25                       ` Mike Galbraith
  2012-02-20 18:14                     ` Mike Galbraith
  1 sibling, 1 reply; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-20 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Peter Zijlstra <peterz@infradead.org> [2012-02-20 15:41:01]:

> On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:
> 
> > ---
> >  kernel/sched_fair.c |   10 ++--------
> >  1 file changed, 2 insertions(+), 8 deletions(-)
> > 
> > Index: linux-3.0-tip/kernel/sched_fair.c
> > ===================================================================
> > --- linux-3.0-tip.orig/kernel/sched_fair.c
> > +++ linux-3.0-tip/kernel/sched_fair.c
> > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> >  		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> >  			if (idle_cpu(i)) {
> >  				target = i;
> > +				if (sd->flags & SD_SHARE_CPUPOWER)
> > +					continue;
> >  				break;
> >  			}
> >  		}
> > -
> > -		/*
> > -		 * Lets stop looking for an idle sibling when we reached
> > -		 * the domain that spans the current cpu and prev_cpu.
> > -		 */
> > -		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> > -		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > -			break;
> >  	}
> >  	rcu_read_unlock();
> 
> Mike, Suresh, did we ever get this sorted? I was looking at
> select_idle_sibling() and it looks like we dropped this.
> 
> Also, did anybody ever get an answer from a HW guy on why sharing stuff
> over SMT threads is so much worse than sharing it over proper cores? Its
> not like this workload actually does anything concurrently.
> 
> I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.

>From a quick scan of that code, it seems to prefer selecting an idle cpu
in the same cache domain (vs selecting prev_cpu in absence of a core
that is fully idle). 

I can give that a try for my benchmark and see how much it helps. My
suspicion is it will not fully solve the problem I have on hand.

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 14:41                   ` sched: Avoid SMT siblings in select_idle_sibling() if possible Peter Zijlstra
  2012-02-20 15:03                     ` Srivatsa Vaddagiri
@ 2012-02-20 18:14                     ` Mike Galbraith
  2012-02-20 18:15                       ` Peter Zijlstra
                                         ` (2 more replies)
  1 sibling, 3 replies; 89+ messages in thread
From: Mike Galbraith @ 2012-02-20 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Mon, 2012-02-20 at 15:41 +0100, Peter Zijlstra wrote: 
> On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:
> 
> > ---
> >  kernel/sched_fair.c |   10 ++--------
> >  1 file changed, 2 insertions(+), 8 deletions(-)
> > 
> > Index: linux-3.0-tip/kernel/sched_fair.c
> > ===================================================================
> > --- linux-3.0-tip.orig/kernel/sched_fair.c
> > +++ linux-3.0-tip/kernel/sched_fair.c
> > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> >  		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> >  			if (idle_cpu(i)) {
> >  				target = i;
> > +				if (sd->flags & SD_SHARE_CPUPOWER)
> > +					continue;
> >  				break;
> >  			}
> >  		}
> > -
> > -		/*
> > -		 * Lets stop looking for an idle sibling when we reached
> > -		 * the domain that spans the current cpu and prev_cpu.
> > -		 */
> > -		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> > -		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > -			break;
> >  	}
> >  	rcu_read_unlock();
> 
> Mike, Suresh, did we ever get this sorted? I was looking at
> select_idle_sibling() and it looks like we dropped this.

I thought this was pretty much sorted.  We want to prefer core over
sibling, because on at laest some modern CPUs with L3, siblings suck
rocks.

> Also, did anybody ever get an answer from a HW guy on why sharing stuff
> over SMT threads is so much worse than sharing it over proper cores?

No.  My numbers on westmere indicated to me that siblings do not share
L2, making them fairly worthless.  Hard facts we never got.

> Its
> not like this workload actually does anything concurrently.
> 
> I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.

I really really need to find time to do systematic mainline testing.

Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
Maybe that has changed, but I doubt it.  (general aside: testing with a
bloated distro config is a big mistake)

-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 18:14                     ` Mike Galbraith
@ 2012-02-20 18:15                       ` Peter Zijlstra
  2012-02-20 19:07                       ` Peter Zijlstra
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2 siblings, 0 replies; 89+ messages in thread
From: Peter Zijlstra @ 2012-02-20 18:15 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> I thought this was pretty much sorted.  We want to prefer core over
> sibling, because on at laest some modern CPUs with L3, siblings suck
> rocks. 

Yeah, I since figured out how its supposed (and supposedly) does work.
Suresh was a bit too clever and forgot to put a comment in since clearly
it was obvious at the time ;-)

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 15:03                     ` Srivatsa Vaddagiri
@ 2012-02-20 18:25                       ` Mike Galbraith
  2012-02-21  0:06                         ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2012-02-20 18:25 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Mon, 2012-02-20 at 20:33 +0530, Srivatsa Vaddagiri wrote: 
> * Peter Zijlstra <peterz@infradead.org> [2012-02-20 15:41:01]:
> 
> > On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:
> > 
> > > ---
> > >  kernel/sched_fair.c |   10 ++--------
> > >  1 file changed, 2 insertions(+), 8 deletions(-)
> > > 
> > > Index: linux-3.0-tip/kernel/sched_fair.c
> > > ===================================================================
> > > --- linux-3.0-tip.orig/kernel/sched_fair.c
> > > +++ linux-3.0-tip/kernel/sched_fair.c
> > > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> > >  		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> > >  			if (idle_cpu(i)) {
> > >  				target = i;
> > > +				if (sd->flags & SD_SHARE_CPUPOWER)
> > > +					continue;
> > >  				break;
> > >  			}
> > >  		}
> > > -
> > > -		/*
> > > -		 * Lets stop looking for an idle sibling when we reached
> > > -		 * the domain that spans the current cpu and prev_cpu.
> > > -		 */
> > > -		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> > > -		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > > -			break;
> > >  	}
> > >  	rcu_read_unlock();
> > 
> > Mike, Suresh, did we ever get this sorted? I was looking at
> > select_idle_sibling() and it looks like we dropped this.
> > 
> > Also, did anybody ever get an answer from a HW guy on why sharing stuff
> > over SMT threads is so much worse than sharing it over proper cores? Its
> > not like this workload actually does anything concurrently.
> > 
> > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.
> 
> From a quick scan of that code, it seems to prefer selecting an idle cpu
> in the same cache domain (vs selecting prev_cpu in absence of a core
> that is fully idle).

Yes, that was the sole purpose of select_idle_sibling() from square one.
If you can mobilize a CPU without eating cache penalty, this is most
excellent for load ramp-up.  The gain is huge over affine wakeup if
there is any overlap to regain, ie it's not a 100% synchronous load.

> I can give that a try for my benchmark and see how much it helps. My
> suspicion is it will not fully solve the problem I have on hand.

I doubt it will either.  Your problem is when it doesn't succeed, but
you have an idle core available in another domain.  That's a whole
different ball game.  Yeah, you can reap benefit by doing wakeup
balancing, but you'd better look very closely at the cost.  I haven't
been able to do that lately, so dunno what cost is in the here and now,
but it used to be _way_ too expensive to consider, just as unrestricted
idle balancing is, or high frequency load balancing in general is.

-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 18:14                     ` Mike Galbraith
  2012-02-20 18:15                       ` Peter Zijlstra
@ 2012-02-20 19:07                       ` Peter Zijlstra
  2012-02-21  5:43                         ` Mike Galbraith
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2012-02-20 19:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> Maybe that has changed, but I doubt it.

Right, I through I remembered some such, you could see it on wakeup
heavy things like pipe-bench and that java msg passing thing, right?



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 18:25                       ` Mike Galbraith
@ 2012-02-21  0:06                         ` Srivatsa Vaddagiri
  2012-02-21  6:37                           ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-21  0:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-20 19:25:43]:

> > I can give that a try for my benchmark and see how much it helps. My
> > suspicion is it will not fully solve the problem I have on hand.
> 
> I doubt it will either.  Your problem is when it doesn't succeed, but
> you have an idle core available in another domain.

fwiw the patch I had sent does a wakeup balance within prev_cpu's
cache_domain (and not outside). It handles the case where we don't have
any idle cpu/core within prev_cpu's cache domain, in which case we look
for next best thing (least loaded cpu). I did see good numbers with that
(for both my benchmark and sysbench).

More on this later in the day ..

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 19:07                       ` Peter Zijlstra
@ 2012-02-21  5:43                         ` Mike Galbraith
  2012-02-21  8:32                           ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2012-02-21  5:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote: 
> On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.
> 
> Right, I through I remembered some such, you could see it on wakeup
> heavy things like pipe-bench and that java msg passing thing, right?

Yeah, it beat up switch heavy buddies pretty bad.

-Mike



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  0:06                         ` Srivatsa Vaddagiri
@ 2012-02-21  6:37                           ` Mike Galbraith
  2012-02-21  8:09                             ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2012-02-21  6:37 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2012-02-21 at 05:36 +0530, Srivatsa Vaddagiri wrote:

> fwiw the patch I had sent does a wakeup balance within prev_cpu's
> cache_domain (and not outside).

Yeah, the testing I did was turn on the flag, and measure.  With single
domain scans, maybe select_idle_sibling() could just go away.  It ain't
free either.

-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  6:37                           ` Mike Galbraith
@ 2012-02-21  8:09                             ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-21  8:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-21 07:37:14]:

> On Tue, 2012-02-21 at 05:36 +0530, Srivatsa Vaddagiri wrote:
> 
> > fwiw the patch I had sent does a wakeup balance within prev_cpu's
> > cache_domain (and not outside).
> 
> Yeah, the testing I did was turn on the flag, and measure.  With single
> domain scans, maybe select_idle_sibling() could just go away.

Yeah that's what I was thinking. Essentially find_idlest_group/cpu
should accomplish that job pretty much for us ...

> It ain't free either.

Will find out how much difference it makes ..

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  5:43                         ` Mike Galbraith
@ 2012-02-21  8:32                           ` Srivatsa Vaddagiri
  2012-02-21  9:21                             ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-21  8:32 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-21 06:43:18]:

> On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote: 
> > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > > Maybe that has changed, but I doubt it.
> > 
> > Right, I through I remembered some such, you could see it on wakeup
> > heavy things like pipe-bench and that java msg passing thing, right?
> 
> Yeah, it beat up switch heavy buddies pretty bad.

Do you have pointer to the java benchmark? Also is pipe-bench the same
as the one described below?

http://freecode.com/projects/pipebench

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  8:32                           ` Srivatsa Vaddagiri
@ 2012-02-21  9:21                             ` Mike Galbraith
  2012-02-21 10:37                               ` Peter Zijlstra
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2012-02-21  9:21 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2012-02-21 at 14:02 +0530, Srivatsa Vaddagiri wrote: 
> * Mike Galbraith <efault@gmx.de> [2012-02-21 06:43:18]:
> 
> > On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote: 
> > > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > > > Maybe that has changed, but I doubt it.
> > > 
> > > Right, I through I remembered some such, you could see it on wakeup
> > > heavy things like pipe-bench and that java msg passing thing, right?
> > 
> > Yeah, it beat up switch heavy buddies pretty bad.
> 
> Do you have pointer to the java benchmark? Also is pipe-bench the same
> as the one described below?

I use vmark, find it _somewhat_ useful.  Not a lovely benchmark, but it
is highly affinity sensitive, and switches heftily.  I don't put much
value on it though, too extreme for me, but it is a ~decent indicator.

There are no doubt _lots_ better than vmark for java stuff.

I toss a variety pack at the box in obsessive-compulsive man mode when
testing.  Which benchmarks doesn't matter much, just need to be wide
spectrum and consistent.

> http://freecode.com/projects/pipebench

No, I use Ingo's pipe-test, but that to measure fastpath overhead.

-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  9:21                             ` Mike Galbraith
@ 2012-02-21 10:37                               ` Peter Zijlstra
  2012-02-21 14:58                                 ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2012-02-21 10:37 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Srivatsa Vaddagiri, Suresh Siddha, linux-kernel, Ingo Molnar,
	Paul Turner

On Tue, 2012-02-21 at 10:21 +0100, Mike Galbraith wrote:

> I use vmark, find it _somewhat_ useful.  Not a lovely benchmark, but it
> is highly affinity sensitive, and switches heftily.  I don't put much
> value on it though, too extreme for me, but it is a ~decent indicator.
> 
> There are no doubt _lots_ better than vmark for java stuff.
> 
> I toss a variety pack at the box in obsessive-compulsive man mode when
> testing.  Which benchmarks doesn't matter much, just need to be wide
> spectrum and consistent.

http://www.volano.com/benchmarks.html

> No, I use Ingo's pipe-test, but that to measure fastpath overhead.

http://people.redhat.com/mingo/cfs-scheduler/tools/pipe-test.c

Also I think it lives as: perf bench sched pipe

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21 10:37                               ` Peter Zijlstra
@ 2012-02-21 14:58                                 ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-21 14:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Peter Zijlstra <peterz@infradead.org> [2012-02-21 11:37:23]:

> http://www.volano.com/benchmarks.html

For some reason, volanomark stops working (client hangs waiting for data
to arrive on socket) in 3rd iteration when run on latest tip. I don't see that 
with 2.6.32 based kernel though. Checking which commit may have caused
this ..

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 18:14                     ` Mike Galbraith
  2012-02-20 18:15                       ` Peter Zijlstra
  2012-02-20 19:07                       ` Peter Zijlstra
@ 2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2012-02-23 11:19                         ` Ingo Molnar
                                           ` (2 more replies)
  2 siblings, 3 replies; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-23 10:49 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-20 19:14:21]:

> > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.
> 
> I really really need to find time to do systematic mainline testing.
> 
> Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> Maybe that has changed, but I doubt it.  (general aside: testing with a
> bloated distro config is a big mistake)

I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE 
at SMT/MC domains.

Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
Kernel  : tip (HEAD at 6241cc8)
Java	: OpenJDK 1.6.0_20
Volano  : 2_9_0

Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in
SMT/MC domains.

				Average score	std. dev

SD_BALANCE_WAKE disabled	369459.750	4825.046
SD_BALANCE_WAKE enabled		379070.500	379070.5

I am going to try pipe benchmark next. Do you have suggestions for any other 
benchmarks I should try to see the effect of SD_BALANCE_WAKE turned on in 
SMT/MC domains?

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
@ 2012-02-23 11:19                         ` Ingo Molnar
  2012-02-23 12:18                           ` Srivatsa Vaddagiri
  2012-02-23 11:20                         ` Srivatsa Vaddagiri
  2012-02-23 11:21                         ` Mike Galbraith
  2 siblings, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2012-02-23 11:19 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> * Mike Galbraith <efault@gmx.de> [2012-02-20 19:14:21]:
> 
> > > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.
> > 
> > I really really need to find time to do systematic mainline testing.
> > 
> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.  (general aside: testing with a
> > bloated distro config is a big mistake)
> 
> I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE 
> at SMT/MC domains.
> 
> Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
> Kernel  : tip (HEAD at 6241cc8)
> Java	: OpenJDK 1.6.0_20
> Volano  : 2_9_0
> 
> Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in
> SMT/MC domains.
> 
> 				Average score	std. dev
> 
> SD_BALANCE_WAKE disabled	369459.750	4825.046
> SD_BALANCE_WAKE enabled		379070.500	379070.5
> 
> I am going to try pipe benchmark next. Do you have suggestions 
> for any other benchmarks I should try to see the effect of 
> SD_BALANCE_WAKE turned on in SMT/MC domains?

sysbench is one of the best ones punishing bad scheduler 
balancing mistakes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2012-02-23 11:19                         ` Ingo Molnar
@ 2012-02-23 11:20                         ` Srivatsa Vaddagiri
  2012-02-23 11:26                           ` Ingo Molnar
  2012-02-23 11:21                         ` Mike Galbraith
  2 siblings, 1 reply; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-23 11:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2012-02-23 16:19:59]:

> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.  (general aside: testing with a
> > bloated distro config is a big mistake)
> 
> I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE
> at SMT/MC domains.

Ditto with pipe benchmark.

> Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
> Kernel  : tip (HEAD at 6241cc8)

"perf bench sched pipe" was run 10 times and average ops/sec score
alongwith std. dev is noted as below.


		SD_BALANCE_WAKE		SD_BALANCE_WAKE
		 disabled		enabled

Avg. score	108984.900 		111844.300 (+2.6%)
std dev		20383.457		21668.604

Note:

SD_BALANCE_WAKE for SMT/MC domain was enabled by writing into appropriate 
flags file present in /proc:

	# cd /proc/sys/kernel/sched_domain/
	# for i in `find . -name flags | grep domain1`; do echo 4671 > $i; done
	# for i in `find . -name flags | grep domain0`; do echo 703 > $i; done

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2012-02-23 11:19                         ` Ingo Molnar
  2012-02-23 11:20                         ` Srivatsa Vaddagiri
@ 2012-02-23 11:21                         ` Mike Galbraith
  2012-02-25  6:54                           ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2012-02-23 11:21 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2012-02-23 at 16:19 +0530, Srivatsa Vaddagiri wrote:

> I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE 
> at SMT/MC domains.

That's an unexpected (but welcome) result.

> Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
> Kernel  : tip (HEAD at 6241cc8)
> Java	: OpenJDK 1.6.0_20
> Volano  : 2_9_0
> 
> Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in
> SMT/MC domains.
> 
> 				Average score	std. dev
> 
> SD_BALANCE_WAKE disabled	369459.750	4825.046
> SD_BALANCE_WAKE enabled		379070.500	379070.5
> 
> I am going to try pipe benchmark next. Do you have suggestions for any other 
> benchmarks I should try to see the effect of SD_BALANCE_WAKE turned on in 
> SMT/MC domains?

Unpinned netperf TCP_RR and/or tbench pairs?  Anything that's wakeup
heavy should tell the tail.

-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:20                         ` Srivatsa Vaddagiri
@ 2012-02-23 11:26                           ` Ingo Molnar
  2012-02-23 11:32                             ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 89+ messages in thread
From: Ingo Molnar @ 2012-02-23 11:26 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> "perf bench sched pipe" was run 10 times and average ops/sec 
> score alongwith std. dev is noted as below.
> 
> 
> 		SD_BALANCE_WAKE		SD_BALANCE_WAKE
> 		 disabled		enabled
> 
> Avg. score	108984.900 		111844.300 (+2.6%)
> std dev		20383.457		21668.604

pro perf tip of the day: did you know that it's possible to run:

  perf stat --repeat 10 --null perf bench sched pipe

and get meaningful, normalized stddev calculations for free:

       5.486491487 seconds time elapsed                                          ( +-  5.50% )

the percentage at the end shows the level of standard deviation.

You can add "--sync" as well, which will cause perf stat to call 
sync() - this gives extra stability of individual iterations and 
makes sure all IO cost is accounted to the right run.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:26                           ` Ingo Molnar
@ 2012-02-23 11:32                             ` Srivatsa Vaddagiri
  2012-02-23 16:17                               ` Ingo Molnar
  0 siblings, 1 reply; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-23 11:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner

* Ingo Molnar <mingo@elte.hu> [2012-02-23 12:26:50]:

> pro perf tip of the day: did you know that it's possible to run:
> 
>   perf stat --repeat 10 --null perf bench sched pipe

Ah yes ..had forgotten that! What is the recommended # of iterations 
to run?

> and get meaningful, normalized stddev calculations for free:
> 
>        5.486491487 seconds time elapsed                                          ( +-  5.50% )
> 
> the percentage at the end shows the level of standard deviation.
> 
> You can add "--sync" as well, which will cause perf stat to call 
> sync() - this gives extra stability of individual iterations and 
> makes sure all IO cost is accounted to the right run.

Ok ..thanks for the 'pro' tip!! Will use it in my subsequent runs!

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:19                         ` Ingo Molnar
@ 2012-02-23 12:18                           ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-23 12:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner

* Ingo Molnar <mingo@elte.hu> [2012-02-23 12:19:29]:

> sysbench is one of the best ones punishing bad scheduler 
> balancing mistakes.

Here are the sysbench oltp score on same machine i.e:

Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
Kernel  : tip (HEAD at 6241cc8)
sysbench : 0.4.12

sysbench was run with 16 threads as:

./sysbench --num-threads=16 --max-requests=100000 --test=oltp --oltp-table-size=500000 --mysql-socket=/var/lib/mysql/mysql.sock --oltp-read-only --mysql-user=root --mysql-password=blah run

sysbench was run 5 times with fs-cache being purged before each run
(echo 3 > /proc/sys/vm/drop_caches).

Average of 5 runs alongwith % std. dev. is noted for various OLTP stats

				SD_BALANCE_WAKE		SD_BALANCE_WAKE	
				disabled		enabled

transactions (per sec)		4833.826 (+- 0.75%)	4837.354 (+- 1%)
read/write requests (per sec)	67673.580 (+- 0.75%)	67722.960 (+- 1%)
other ops (per sec)		9667.654 (+- 0.75%)	9674.710 (+- 1%)

There is minor improvement seen when SD_BALANCE_WAKE is enabled at SMT/MC
domains, but no degradation observed with it enabled.

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:32                             ` Srivatsa Vaddagiri
@ 2012-02-23 16:17                               ` Ingo Molnar
  0 siblings, 0 replies; 89+ messages in thread
From: Ingo Molnar @ 2012-02-23 16:17 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@elte.hu> [2012-02-23 12:26:50]:
> 
> > pro perf tip of the day: did you know that it's possible to run:
> > 
> >   perf stat --repeat 10 --null perf bench sched pipe
> 
> Ah yes ..had forgotten that! What is the recommended # of 
> iterations to run?

Well, I start at 3 or 5 and sometimes increase it to get a 
meaningful stddev - i.e. one that is smaller than half of the 
measured effect. (for cases where I expect an improvement and 
want to measure it.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:21                         ` Mike Galbraith
@ 2012-02-25  6:54                           ` Srivatsa Vaddagiri
  2012-02-25  8:30                             ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-25  6:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-23 12:21:04]:

> Unpinned netperf TCP_RR and/or tbench pairs?  Anything that's wakeup
> heavy should tell the tail.

Here are some tbench numbers:

Machine : 2 Intel Xeon X5650 (Westmere) CPUs (6 core/package)
Kernel  : tip (HEAD at ebe97fa)
dbench  : v4.0

One tbench server/client pair was run on same host 5 times (with
fs-cache being purged each time) and avg of 5 run for various cases
noted below:

Case A : HT enabled (24 logical CPUs)

Thr'put	 : 168.166 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
Thr'put	 : 169.564 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc/smt)
Thr'put	 : 173.151 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)

Case B : HT disabled (12 logical CPUs)

Thr'put	 : 167.977 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
Thr'put	 : 167.891 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc)
Thr'put	 : 173.801 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)

Observations:

a. ~3% improvement seen with SD_SHARE_PKG_RESOURCES disabled, which I guess 
  reflects the cost of waking to a cold L2 cache.

b. No degradation seen with SD_BALANCE_WAKE enabled at mc/smt domains

IMO we need to detect tbench type paired wakeups as synchronous case, in
which case blindly wakeup the task to cur_cpu (as cost of L2 cache miss
could outweight the cost of any reduced scheduling latencies).

IOW select_task_rq_fair() needs to be given better hint as to whether L2
cache has been made warm by someone (interrupt handler or a producer
task), in which case (consumer) task needs to be woken in the same L2 cache
domain (i.e on cur_cpu itself)?

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-25  6:54                           ` Srivatsa Vaddagiri
@ 2012-02-25  8:30                             ` Mike Galbraith
  2012-02-27 22:11                               ` Suresh Siddha
  0 siblings, 1 reply; 89+ messages in thread
From: Mike Galbraith @ 2012-02-25  8:30 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Sat, 2012-02-25 at 12:24 +0530, Srivatsa Vaddagiri wrote: 
> * Mike Galbraith <efault@gmx.de> [2012-02-23 12:21:04]:
> 
> > Unpinned netperf TCP_RR and/or tbench pairs?  Anything that's wakeup
> > heavy should tell the tail.
> 
> Here are some tbench numbers:
> 
> Machine : 2 Intel Xeon X5650 (Westmere) CPUs (6 core/package)
> Kernel  : tip (HEAD at ebe97fa)
> dbench  : v4.0
> 
> One tbench server/client pair was run on same host 5 times (with
> fs-cache being purged each time) and avg of 5 run for various cases
> noted below:
> 
> Case A : HT enabled (24 logical CPUs)
> 
> Thr'put	 : 168.166 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> Thr'put	 : 169.564 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc/smt)
> Thr'put	 : 173.151 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> 
> Case B : HT disabled (12 logical CPUs)
> 
> Thr'put	 : 167.977 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> Thr'put	 : 167.891 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc)
> Thr'put	 : 173.801 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> 
> Observations:
> 
> a. ~3% improvement seen with SD_SHARE_PKG_RESOURCES disabled, which I guess 
>   reflects the cost of waking to a cold L2 cache.
> 
> b. No degradation seen with SD_BALANCE_WAKE enabled at mc/smt domains

I haven't done a lot of testing, but yeah, the little I have doesn't
show SD_BALANCE_WAKE making much difference on single socket boxen.

> IMO we need to detect tbench type paired wakeups as synchronous case, in
> which case blindly wakeup the task to cur_cpu (as cost of L2 cache miss
> could outweight the cost of any reduced scheduling latencies).
> 
> IOW select_task_rq_fair() needs to be given better hint as to whether L2
> cache has been made warm by someone (interrupt handler or a producer
> task), in which case (consumer) task needs to be woken in the same L2 cache
> domain (i.e on cur_cpu itself)?

My less rotund config shows the L2 penalty decidedly more prominently.
We used to have avg_overlap as a synchronous wakeup hint, but it was
broken by preemption and whatnot, got the axe to recover some cycles.  A
reliable and dirt cheap replacement would be a good thing to have.

TCP_RR and tbench are far way away from the overlap breakeven point on
E5620, whereas with Q6600s shared L2, you can start converting overlap
into throughput almost immediately. 

2.4 GHz E5620
Throughput 248.994 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
Throughput 379.488 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES

2.4 GHz Q6600
Throughput 299.049 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
Throughput 300.018 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES

-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-25  8:30                             ` Mike Galbraith
@ 2012-02-27 22:11                               ` Suresh Siddha
  2012-02-28  5:05                                 ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Suresh Siddha @ 2012-02-27 22:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Srivatsa Vaddagiri, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Paul Turner

On Sat, 2012-02-25 at 09:30 +0100, Mike Galbraith wrote:
> My less rotund config shows the L2 penalty decidedly more prominently.
> We used to have avg_overlap as a synchronous wakeup hint, but it was
> broken by preemption and whatnot, got the axe to recover some cycles.  A
> reliable and dirt cheap replacement would be a good thing to have.
> 
> TCP_RR and tbench are far way away from the overlap breakeven point on
> E5620, whereas with Q6600s shared L2, you can start converting overlap
> into throughput almost immediately. 
> 
> 2.4 GHz E5620
> Throughput 248.994 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
> Throughput 379.488 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES
> 
> 2.4 GHz Q6600
> Throughput 299.049 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
> Throughput 300.018 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES
> 

Also it is not always about just the L2 cache being shared/not or
warm/cold etc. It also depends on the core c-states/p-states etc. It
will cost waking up an idle core and the cost will depend on the what
core-c state it is in. And also if we ping-pong between cores often,
cpufreq governor will come and request for a lower core p-state even
though the load was keeping one core or the other in the socket always
busy at any given point of time.

thanks,
suresh


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-27 22:11                               ` Suresh Siddha
@ 2012-02-28  5:05                                 ` Mike Galbraith
  0 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2012-02-28  5:05 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Srivatsa Vaddagiri, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Paul Turner

On Mon, 2012-02-27 at 14:11 -0800, Suresh Siddha wrote: 
> On Sat, 2012-02-25 at 09:30 +0100, Mike Galbraith wrote:
> > My less rotund config shows the L2 penalty decidedly more prominently.
> > We used to have avg_overlap as a synchronous wakeup hint, but it was
> > broken by preemption and whatnot, got the axe to recover some cycles.  A
> > reliable and dirt cheap replacement would be a good thing to have.
> > 
> > TCP_RR and tbench are far way away from the overlap breakeven point on
> > E5620, whereas with Q6600s shared L2, you can start converting overlap
> > into throughput almost immediately. 
> > 
> > 2.4 GHz E5620
> > Throughput 248.994 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
> > Throughput 379.488 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES
> > 
> > 2.4 GHz Q6600
> > Throughput 299.049 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
> > Throughput 300.018 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES
> > 
> 
> Also it is not always about just the L2 cache being shared/not or
> warm/cold etc. It also depends on the core c-states/p-states etc. It
> will cost waking up an idle core and the cost will depend on the what
> core-c state it is in. And also if we ping-pong between cores often,
> cpufreq governor will come and request for a lower core p-state even
> though the load was keeping one core or the other in the socket always
> busy at any given point of time.

Yeah, pinning yields a couple percent on Q6600 box, more on E5620
despite its spiffier gearbox.. likely turbo-boost doing it's thing.

-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-26 18:06           ` Peter Zijlstra
@ 2012-03-27 13:56             ` Mike Galbraith
  0 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2012-03-27 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Suresh Siddha, linux-kernel,
	Paul Turner

On Mon, 2012-03-26 at 20:06 +0200, Peter Zijlstra wrote:

> The biggest regression came from tbench, wasn't that mostly a random
> number generator anyway?

That's dbench ;-)  Tbench does wiggle around more than you'd like
though.  I've never tracked, but in lots of running the thing, 2-3% you
can forget chasing, and I once bisected a 5% regression to an unrelated
change, reverted and re-applied the suspect a few times in disbelief,
and it tracked.  You can't trust it _too_ much.  I consider it a
reliable indicator that I'd better do more testing.  If others agree,
it's a real regression, if not, it's those damn Elves 'n Gremlins ;-)

-Mike



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-26 17:35         ` Srivatsa Vaddagiri
@ 2012-03-26 18:06           ` Peter Zijlstra
  2012-03-27 13:56             ` Mike Galbraith
  0 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2012-03-26 18:06 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

On Mon, 2012-03-26 at 23:05 +0530, Srivatsa Vaddagiri wrote:
> 
> From what I can tell, the huge improvement in benchmark score is coming from 
> reduced latencies for its VM tasks. 

But if the machine is pegged latency should not impact throughput (since
there's always work to do), so are you creating extra idle time some
place?

Are you running against lock-inversion in the vcpus? Or that tlb
shootdown issue we had in the gang scheduling thread? Both are typically
busy-wait time, which is of course harder to spot that actual idle
time :/

Then again, reducing latency is good, so I don't object to that per-se,
but that flips the question, why does it regress those other loads?

The biggest regression came from tbench, wasn't that mostly a random
number generator anyway? How stable are those results, do you have a
variance measure on the results?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-26  8:36       ` Peter Zijlstra
@ 2012-03-26 17:35         ` Srivatsa Vaddagiri
  2012-03-26 18:06           ` Peter Zijlstra
  0 siblings, 1 reply; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-26 17:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

* Peter Zijlstra <peterz@infradead.org> [2012-03-26 10:36:00]:

> >                 tip     tip + patch 
> > 
> > volano          1       1.29   (29% improvement)
> > sysbench [n3]   1       2      (100% improvement)
> > tbench 1 [n4]   1       1.07   (7% improvement)
> > tbench 8 [n5]   1       1.26   (26% improvement)
> > httperf  [n6]   1       1.05   (5% improvement)
> > Trade           1       1.31   (31% improvement) 
> 
> That smells like there's more to the story, a 100% improvement is too
> much..

Yeah I have rubbed my eyes several times to make sure its true and ran
the same benchmark (sysbench) again now! I can recreate that ~100%
improvement with the patch even now.

To quickly re-cap my environment, I have a 16-cpu machine w/ 5 cgroups.
1 cgroup (8192 shares) hosts sysbench inside 8-vcpu VM while remaining 4
cgroups (1024 shares each) hosts 4 cpu hogs running on bare metal.
Given this overcommittment, select_idle_sibling() should mostly be a 
no-op (i.e it won't find any idle cores and thus defaults to prev_cpu).
Also the only tasks that will (sleep and) wakeup are the VM tasks.

Since the patch potentially affects (improves) scheduling latencies, I measured 
Sum(se.statistics.wait_sum) for the VM tasks over the benchmark run (5
iterations of sysbench).

tip	    : 987240 ms
tip + patch : 280275 ms 

I will get more comprehensive perf data shortly and post. 

>From what I can tell, the huge improvement in benchmark score is coming from 
reduced latencies for its VM tasks. 

The hard part to figure out (when we are inside select_task_rq_fair()) is 
whether any potential improvement in latencies (because of waking up on a
less loaded cpu) will offshoot the cost of potentially more L2-cache misses, 
for which IMHO we don't have enough data to make a good decision.

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-22 15:32     ` Srivatsa Vaddagiri
  2012-03-23  6:38       ` Mike Galbraith
  2012-03-26  8:29       ` Peter Zijlstra
@ 2012-03-26  8:36       ` Peter Zijlstra
  2012-03-26 17:35         ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 89+ messages in thread
From: Peter Zijlstra @ 2012-03-26  8:36 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote:
> 
> Now the more interesting VM scenario:
> 
More interesting for whoem? I still think virt is an utter waste of
time ;-)

>                 tip     tip + patch
> 
> volano          1       1.29   (29% improvement)
> sysbench [n3]   1       2      (100% improvement)
> tbench 1 [n4]   1       1.07   (7% improvement)
> tbench 8 [n5]   1       1.26   (26% improvement)
> httperf  [n6]   1       1.05   (5% improvement)
> Trade           1       1.31   (31% improvement) 

That smells like there's more to the story, a 100% improvement is too
much..

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-22 15:32     ` Srivatsa Vaddagiri
  2012-03-23  6:38       ` Mike Galbraith
@ 2012-03-26  8:29       ` Peter Zijlstra
  2012-03-26  8:36       ` Peter Zijlstra
  2 siblings, 0 replies; 89+ messages in thread
From: Peter Zijlstra @ 2012-03-26  8:29 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote:
> Anyway, there are degradations as well, considering which I see
> several 
> possibilities:
> 
> 1. Do balance-on-wake for vcpu threads only.

Hell no ;-) we're not going to special case some threads over others.

> 2. Document tuning possibility to improve performance in virtualized
>    environment:
>         - Either via sched_domain flags (disable SD_WAKE_AFFINE 
>           at all levels and enable SD_BALANCE_WAKE at SMT/MC levels)

But domain flags are not exported -- except under SCHED_DEBUG and that
sysctl mess.. also SD_flags are not stable.

>         - Or via a new sched_feat(BALANCE_WAKE) tunable 

sched_feat() is not a stable ABI and shouldn't ever be used for anything
but debugging (hence it lives in debugfs and goes away if you disable
SCHED_DEBUG).


I would very much like more information on why things are a loss. Is it
really related to what cpu you pick, or is it the cost of doing the
balance thing?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-22 15:32     ` Srivatsa Vaddagiri
@ 2012-03-23  6:38       ` Mike Galbraith
  2012-03-26  8:29       ` Peter Zijlstra
  2012-03-26  8:36       ` Peter Zijlstra
  2 siblings, 0 replies; 89+ messages in thread
From: Mike Galbraith @ 2012-03-23  6:38 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner

On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote: 
> * Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]:
> 
> > > I did some experiments with volanomark and it does turn out to 
> > > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy 
> > > benchmark that I am dealing with (Trade) benefits from it.
> > 
> > Does volanomark still do yield(), thereby invoking a random 
> > shuffle of thread scheduling and pretty much voluntarily 
> > ejecting itself from most scheduler performance considerations?
> > 
> > If it uses a real locking primitive such as futexes then its 
> > performance matters more.
> 
> Some more interesting results on more recent tip kernel.

Yeah, interesting.  I find myself ever returning to this message, as
gears grind away trying to imagine what's going on in those vcpus.

> Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus)
> Kernel  : tip (HEAD at ee415e2)
> guest VM : 2.6.18 linux kernel based enterprise guest
> 
> Benchmarks are run in two scenarios:
> 
> 1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup
> 2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in
>  	 various cgroups) are run on host. Cgroup setup is as below:
> 
> 	/sys (cpu.shares = 1024, hosts all system tasks)
> 	/libvirt (cpu.shares = 20000)
>  	/libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus)
> 	/libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs)
> 	/libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs)
> 	/libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs)
> 	/libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs)
> 
> First BM (bare metal) scenario:
> 
> 		tip	tip + patch
> 
> volano 		1	0.955   (4.5% degradation)
> sysbench [n1] 	1	0.9984  (0.16% degradation)
> tbench 1 [n2]	1	0.9096  (9% degradation)

Those make sense, fast path cycles added.

> Now the more interesting VM scenario:
> 
> 		tip	tip + patch
> 
> volano		1	1.29   (29% improvement)
> sysbench [n3]	1	2      (100% improvement)
> tbench 1 [n4] 	1       1.07   (7% improvement)
> tbench 8 [n5] 	1       1.26   (26% improvement)
> httperf  [n6]	1 	1.05   (5% improvement)
> Trade		1	1.31   (31% improvement)
> 
> Notes:
>  
> n1. sysbench was run with 16 threads.
> n2. tbench was run on localhost with 1 client 
> n3. sysbench was run with 8 threads
> n4. tbench was run on localhost with 1 client
> n5. tbench was run over network with 8 clients
> n6. httperf was run as with burst-length of 100 and wsess of 100,500,0
> 
> So the patch seems to be a wholesome win when VCPU threads are waking
> up (in a highly contended environment). One reason could be that any assumption 
> of better cache hits by running (vcpu) threads on its prev_cpu may not
> be fully correct as vcpu threads could represent many different threads
> internally?
> 
> Anyway, there are degradations as well, considering which I see several 
> possibilities:
> 
> 1. Do balance-on-wake for vcpu threads only.

That's what your numbers say to me with this patch.  I'm not getting the
why, but your patch appears to reduce vcpu internal latencies hugely.

> 2. Document tuning possibility to improve performance in virtualized
>    environment:
> 	- Either via sched_domain flags (disable SD_WAKE_AFFINE 
>  	  at all levels and enable SD_BALANCE_WAKE at SMT/MC levels)
> 	- Or via a new sched_feat(BALANCE_WAKE) tunable
> 
> Any other thoughts or suggestions for more experiments?

Other than nuke select_idle_sibling() entirely instead, none here.

-Mike


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-06  9:14   ` Ingo Molnar
  2012-03-06 10:03     ` Srivatsa Vaddagiri
@ 2012-03-22 15:32     ` Srivatsa Vaddagiri
  2012-03-23  6:38       ` Mike Galbraith
                         ` (2 more replies)
  1 sibling, 3 replies; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-22 15:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

* Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]:

> > I did some experiments with volanomark and it does turn out to 
> > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy 
> > benchmark that I am dealing with (Trade) benefits from it.
> 
> Does volanomark still do yield(), thereby invoking a random 
> shuffle of thread scheduling and pretty much voluntarily 
> ejecting itself from most scheduler performance considerations?
> 
> If it uses a real locking primitive such as futexes then its 
> performance matters more.

Some more interesting results on more recent tip kernel.

Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus)
Kernel  : tip (HEAD at ee415e2)
guest VM : 2.6.18 linux kernel based enterprise guest

Benchmarks are run in two scenarios:

1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup
2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in
 	 various cgroups) are run on host. Cgroup setup is as below:

	/sys (cpu.shares = 1024, hosts all system tasks)
	/libvirt (cpu.shares = 20000)
 	/libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus)
	/libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs)

First BM (bare metal) scenario:

		tip	tip + patch

volano 		1	0.955   (4.5% degradation)
sysbench [n1] 	1	0.9984  (0.16% degradation)
tbench 1 [n2]	1	0.9096  (9% degradation)

Now the more interesting VM scenario:

		tip	tip + patch

volano		1	1.29   (29% improvement)
sysbench [n3]	1	2      (100% improvement)
tbench 1 [n4] 	1       1.07   (7% improvement)
tbench 8 [n5] 	1       1.26   (26% improvement)
httperf  [n6]	1 	1.05   (5% improvement)
Trade		1	1.31   (31% improvement)

Notes:
 
n1. sysbench was run with 16 threads.
n2. tbench was run on localhost with 1 client 
n3. sysbench was run with 8 threads
n4. tbench was run on localhost with 1 client
n5. tbench was run over network with 8 clients
n6. httperf was run as with burst-length of 100 and wsess of 100,500,0

So the patch seems to be a wholesome win when VCPU threads are waking
up (in a highly contended environment). One reason could be that any assumption 
of better cache hits by running (vcpu) threads on its prev_cpu may not
be fully correct as vcpu threads could represent many different threads
internally?

Anyway, there are degradations as well, considering which I see several 
possibilities:

1. Do balance-on-wake for vcpu threads only.
2. Document tuning possibility to improve performance in virtualized
   environment:
	- Either via sched_domain flags (disable SD_WAKE_AFFINE 
 	  at all levels and enable SD_BALANCE_WAKE at SMT/MC levels)
	- Or via a new sched_feat(BALANCE_WAKE) tunable

Any other thoughts or suggestions for more experiments?


--

Balance threads on wakeup to let it run on least-loaded CPU in the same
cache domain as its prev_cpu (or cur_cpu if wake_affine() test obliges).

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 include/linux/topology.h |    4 ++--
 kernel/sched/fair.c      |    5 ++++-
 2 files changed, 6 insertions(+), 3 deletions(-)

Index: current/include/linux/topology.h
===================================================================
--- current.orig/include/linux/topology.h
+++ current/include/linux/topology.h
@@ -96,7 +96,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
@@ -129,7 +129,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_PREFER_LOCAL			\
 				| 0*SD_SHARE_CPUPOWER			\
Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2766,7 +2766,10 @@ select_task_rq_fair(struct task_struct *
 			prev_cpu = cpu;
 
 		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		if (idle_cpu(new_cpu))
+			goto unlock;
+		sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+		cpu = prev_cpu;
 	}
 
 	while (sd) {


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-06  9:14   ` Ingo Molnar
@ 2012-03-06 10:03     ` Srivatsa Vaddagiri
  2012-03-22 15:32     ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-06 10:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

* Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]:

> Does volanomark still do yield(),

I could count 3.5M yield() calls across 16 cpus over a period of 5.4sec,
so yes it does seem to make very good use of yeild()!

> thereby invoking a random 
> shuffle of thread scheduling and pretty much voluntarily 
> ejecting itself from most scheduler performance considerations?
> 
> If it uses a real locking primitive such as futexes then its 
> performance matters more.

Good point ..I will gather more data for additional benchmarks
(tbench, sysbench) and post shortly.

- vatsa


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-05 15:24 ` Srivatsa Vaddagiri
@ 2012-03-06  9:14   ` Ingo Molnar
  2012-03-06 10:03     ` Srivatsa Vaddagiri
  2012-03-22 15:32     ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 89+ messages in thread
From: Ingo Molnar @ 2012-03-06  9:14 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> * Peter Zijlstra <peterz@infradead.org> [2012-02-20 20:07:46]:
> 
> > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:

> > > Enabling SD_BALANCE_WAKE used to be decidedly too 
> > > expensive to consider. Maybe that has changed, but I doubt 
> > > it.
> > 
> > Right, I through I remembered somet such, you could see it 
> > on wakeup heavy things like pipe-bench and that java msg 
> > passing thing, right?
> 
> I did some experiments with volanomark and it does turn out to 
> be sensitive to SD_BALANCE_WAKE, while the other wake-heavy 
> benchmark that I am dealing with (Trade) benefits from it.

Does volanomark still do yield(), thereby invoking a random 
shuffle of thread scheduling and pretty much voluntarily 
ejecting itself from most scheduler performance considerations?

If it uses a real locking primitive such as futexes then its 
performance matters more.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
       [not found] <1329764866.2293.376.camhel@twins>
@ 2012-03-05 15:24 ` Srivatsa Vaddagiri
  2012-03-06  9:14   ` Ingo Molnar
  0 siblings, 1 reply; 89+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-05 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Peter Zijlstra <peterz@infradead.org> [2012-02-20 20:07:46]:

> On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.
> 
> Right, I through I remembered somet such, you could see it on wakeup
> heavy things like pipe-bench and that java msg passing thing, right?

I did some experiments with volanomark and it does turn out to be
sensitive to SD_BALANCE_WAKE, while the other wake-heavy benchmark that I am
dealing with (Trade) benefits from it.

Normalized results for both benchmarks provided below. 

Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
Kernel  : tip (HEAD at b86148a) 

	   	   Before patch	    After patch

Trade thr'put		1		2.17 (~200% improvement)
volanomark		1		0.8  (20% degradation)


Quick description of benchmarks 
===============================

Trade was run inside a 8-vcpu VM (cgroup). 4 other 4-vcpu VMs running
cpu hogs were also present, leading to this cgroup setup:

	/cgroup/sys (1024 shares - hosts all system tasks)
	/cgroup/libvirt (20000 shares)
	/cgroup/libvirt/qemu/VM1 (8192 cpu shares)
	/cgroup/libvirt/qemu/VM2-5 (1024 shares)

Volanomark server/client programs were run in root cgroup.

The patch essentially does balance on wake to look for any idle cpu in
same cache domain as its prev_cpu (or cur_cpu if wake_affine obliges),
failing to find looks for least loaded cpu. This helps minimize
latencies for trade workload (and thus boost its score). For volanomark, 
it seems to hurt because of waking on a colder L2 cache.  The
tradeoff seems to be between latency and cache-misses. Short of adding
another tunable, are there better suggestions on how we can address this
sort of tradeoff?

Not-yet-Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 include/linux/topology.h |    4 ++--
 kernel/sched/fair.c      |   26 +++++++++++++++++++++-----
 2 files changed, 23 insertions(+), 7 deletions(-)

Index: current/include/linux/topology.h
===================================================================
--- current.orig/include/linux/topology.h
+++ current/include/linux/topology.h
@@ -96,7 +96,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
@@ -129,7 +129,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_PREFER_LOCAL			\
 				| 0*SD_SHARE_CPUPOWER			\
Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2638,7 +2638,7 @@ static int select_idle_sibling(struct ta
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i;
+	int i, some_idle_cpu = -1;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2661,15 +2661,25 @@ static int select_idle_sibling(struct ta
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
+			int skip = 0;
+
 			if (!cpumask_intersects(sched_group_cpus(sg),
 						tsk_cpus_allowed(p)))
 				goto next;
 
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+							 tsk_cpus_allowed(p)) {
+				if (!idle_cpu(i)) {
+					if (some_idle_cpu >= 0)
+						goto next;
+					skip = 1;
+				} else
+					some_idle_cpu = i;
 			}
 
+			if (skip)
+				goto next;
+
 			target = cpumask_first_and(sched_group_cpus(sg),
 					tsk_cpus_allowed(p));
 			goto done;
@@ -2677,6 +2687,9 @@ next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+
+	if (some_idle_cpu >= 0)
+		target = some_idle_cpu;
 done:
 	return target;
 }
@@ -2766,7 +2779,10 @@ select_task_rq_fair(struct task_struct *
 			prev_cpu = cpu;
 
 		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		if (idle_cpu(new_cpu))
+			goto unlock;
+		sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+		cpu = prev_cpu;
 	}
 
 	while (sd) {








^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2012-03-27 13:56 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-15  9:46 sched: Avoid SMT siblings in select_idle_sibling() if possible Peter Zijlstra
2011-11-16  1:14 ` Suresh Siddha
2011-11-16  9:24   ` Mike Galbraith
2011-11-16 18:37     ` Suresh Siddha
2011-11-17  1:59       ` Mike Galbraith
2011-11-17 15:38         ` Mike Galbraith
2011-11-17 15:56           ` Peter Zijlstra
2011-11-17 16:38             ` Mike Galbraith
2011-11-17 17:36               ` Suresh Siddha
2011-11-18 15:14                 ` Mike Galbraith
2011-11-18 15:17                   ` [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles Mike Galbraith
2011-11-18 15:35                     ` Peter Zijlstra
2011-11-18 17:34                       ` Mike Galbraith
2011-11-22 14:17                       ` Mike Galbraith
2011-11-22 14:18                         ` [patch 1/7] " Mike Galbraith
2011-12-06  9:50                           ` [tip:sched/core] sched: Use " tip-bot for Mike Galbraith
2011-11-22 14:20                         ` [patch 2/7] sched: save some hrtick_start_fair cycles Mike Galbraith
2011-12-06 20:20                           ` [tip:sched/core] sched: Save " tip-bot for Mike Galbraith
2011-11-22 14:21                         ` [patch 3/7] sched: set skip_clock_update in yield_task_fair() Mike Galbraith
2011-11-23 11:53                           ` Peter Zijlstra
2011-11-23 12:06                             ` Mike Galbraith
2011-11-23 14:48                           ` Peter Zijlstra
2011-11-24  3:50                             ` Mike Galbraith
2011-11-24 10:12                               ` Peter Zijlstra
2011-11-25  6:39                                 ` Mike Galbraith
2011-12-06  9:51                           ` [tip:sched/core] sched: Set " tip-bot for Mike Galbraith
2011-11-22 14:22                         ` [patch 4/7] sched: convert rq->avg_idle to rq->avg_event Mike Galbraith
2011-11-23 11:55                           ` Peter Zijlstra
2011-11-23 12:09                             ` Mike Galbraith
2011-11-23 12:27                               ` Peter Zijlstra
2011-11-23 12:57                                 ` Mike Galbraith
2011-11-23 14:21                                   ` Mike Galbraith
2011-11-22 14:23                         ` [patch 5/7] sched: ratelimit select_idle_sibling()for sync wakeups Mike Galbraith
2011-11-22 14:24                         ` [patch 6/7] sched: use rq->avg_event to resurrect nohz ratelimiting Mike Galbraith
2011-11-23 11:57                           ` Peter Zijlstra
2011-11-23 12:35                           ` Mike Galbraith
2011-11-22 14:26                         ` [patch 7/7] sched: only use TTWU_QUEUE when waker/wakee CPUs do not share top level cache Mike Galbraith
2011-11-23 12:08                           ` Peter Zijlstra
2011-11-18 15:39                     ` [patch 1/6] sched: use rt.nr_cpus_allowed to recover select_task_rq() cycles Hillf Danton
2011-11-18 15:18                   ` [patch 2/6] sched: convert rq->avg_idle to rq->avg_event Mike Galbraith
2011-11-18 15:19                   ` [patch 3/6] sched: use rq->avg_event to resurrect nohz ratelimiting Mike Galbraith
2011-11-18 15:36                     ` Peter Zijlstra
2011-11-18 17:42                       ` Mike Galbraith
2011-11-19  0:51                         ` Van De Ven, Arjan
2011-11-19  4:15                           ` Mike Galbraith
2011-11-18 15:20                   ` [patch 4/6] sched: ratelimit select_idle_sibling()for sync wakeups Mike Galbraith
2011-11-18 15:22                   ` [patch 5/6] sched: save some hrtick_start_fair cycles Mike Galbraith
2011-11-18 15:23                   ` [patch 6/6] sched: set skip_clock_update in yield_task_fair() Mike Galbraith
2012-02-20 14:41                   ` sched: Avoid SMT siblings in select_idle_sibling() if possible Peter Zijlstra
2012-02-20 15:03                     ` Srivatsa Vaddagiri
2012-02-20 18:25                       ` Mike Galbraith
2012-02-21  0:06                         ` Srivatsa Vaddagiri
2012-02-21  6:37                           ` Mike Galbraith
2012-02-21  8:09                             ` Srivatsa Vaddagiri
2012-02-20 18:14                     ` Mike Galbraith
2012-02-20 18:15                       ` Peter Zijlstra
2012-02-20 19:07                       ` Peter Zijlstra
2012-02-21  5:43                         ` Mike Galbraith
2012-02-21  8:32                           ` Srivatsa Vaddagiri
2012-02-21  9:21                             ` Mike Galbraith
2012-02-21 10:37                               ` Peter Zijlstra
2012-02-21 14:58                                 ` Srivatsa Vaddagiri
2012-02-23 10:49                       ` Srivatsa Vaddagiri
2012-02-23 11:19                         ` Ingo Molnar
2012-02-23 12:18                           ` Srivatsa Vaddagiri
2012-02-23 11:20                         ` Srivatsa Vaddagiri
2012-02-23 11:26                           ` Ingo Molnar
2012-02-23 11:32                             ` Srivatsa Vaddagiri
2012-02-23 16:17                               ` Ingo Molnar
2012-02-23 11:21                         ` Mike Galbraith
2012-02-25  6:54                           ` Srivatsa Vaddagiri
2012-02-25  8:30                             ` Mike Galbraith
2012-02-27 22:11                               ` Suresh Siddha
2012-02-28  5:05                                 ` Mike Galbraith
2011-11-17 19:08             ` Suresh Siddha
2011-11-18 15:12               ` Peter Zijlstra
2011-11-18 15:26                 ` Mike Galbraith
2011-12-06  9:49               ` [tip:sched/core] sched: Clean up domain traversal in select_idle_sibling() tip-bot for Suresh Siddha
2011-11-18 23:40 ` [tip:sched/core] sched: Avoid SMT siblings in select_idle_sibling() if possible tip-bot for Peter Zijlstra
     [not found] <1329764866.2293.376.camhel@twins>
2012-03-05 15:24 ` Srivatsa Vaddagiri
2012-03-06  9:14   ` Ingo Molnar
2012-03-06 10:03     ` Srivatsa Vaddagiri
2012-03-22 15:32     ` Srivatsa Vaddagiri
2012-03-23  6:38       ` Mike Galbraith
2012-03-26  8:29       ` Peter Zijlstra
2012-03-26  8:36       ` Peter Zijlstra
2012-03-26 17:35         ` Srivatsa Vaddagiri
2012-03-26 18:06           ` Peter Zijlstra
2012-03-27 13:56             ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).