All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine()
@ 2010-03-05 18:39 Suresh Siddha
  2010-03-05 18:39 ` [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair() Suresh Siddha
  2010-03-05 19:36 ` [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine() Mike Galbraith
  0 siblings, 2 replies; 12+ messages in thread
From: Suresh Siddha @ 2010-03-05 18:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Mike Galbraith
  Cc: Arjan van de Ven, linux-kernel, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy, Suresh Siddha

[-- Attachment #1: fix_wake_affine.patch --]
[-- Type: text/plain, Size: 2083 bytes --]

On a single cpu system with SMT, in the scenario of one SMT thread being
idle with another SMT thread running a task and doing a non sync wakeup of
another task, we see (from the traces) that the woken up task ends up running
on the busy thread instead of the idle thread. Idle balancing that comes
in little bit later is fixing the scernaio.

But fixing this wake balance and running the woken up task directly on the
idle SMT thread improved the performance (phoronix 7zip compression workload)
by ~9% on an atom platform.

During the process wakeup, select_task_rq_fair() and wake_affine() makes
the decision to wakeup the task either on the previous cpu that the task
ran or the cpu that the task is currently woken up.

select_task_rq_fair() also goes through to see if there are any idle siblings
for the cpu that the task is woken up on. This is to ensure that we select
any idle sibling rather than choose a busy cpu.

In the above load scenario, it so happens that the prev_cpu (that the
task ran before) and this_cpu (where it is woken up currently) are the same. And
in this case, it looks like wake_affine() returns 0 and ultimately not selecting
the idle sibling chosen by select_idle_sibling() in select_task_rq_fair().
Further down the path of select_task_rq_fair(), we ultimately select
the currently running cpu (busy SMT thread instead of the idle SMT thread).

Check for prev_cpu == this_cpu in wake_affine() and no need to do
any fancy stuff(and ultimately taking wrong decisions) in this case.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched_fair.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1252,6 +1252,10 @@ static int wake_affine(struct sched_doma
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
+
+	if (prev_cpu == this_cpu)
+		return 1;
+
 	load	  = source_load(prev_cpu, idx);
 	this_load = target_load(this_cpu, idx);
 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-05 18:39 [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine() Suresh Siddha
@ 2010-03-05 18:39 ` Suresh Siddha
  2010-03-05 20:25   ` Mike Galbraith
  2010-03-06  8:36   ` Mike Galbraith
  2010-03-05 19:36 ` [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine() Mike Galbraith
  1 sibling, 2 replies; 12+ messages in thread
From: Suresh Siddha @ 2010-03-05 18:39 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Mike Galbraith
  Cc: Arjan van de Ven, linux-kernel, Vaidyanathan Srinivasan,
	Yanmin Zhang, Gautham R Shenoy, Suresh Siddha

[-- Attachment #1: fix_lat_ctx.patch --]
[-- Type: text/plain, Size: 4822 bytes --]

Performance improvements with this patch:
"lat_ctx -s 0 2" ~22usec (before-this-patch)	~5usec (after-this-patch)

There are number of things wrong with the select_idle_sibling() logic

a) Once we select the idle sibling, we use that domain (spanning the cpu that
   the task is currently woken-up and the idle sibling that we found) in our
   wake_affine() comparisons. This domain is completely different from the
   domain(we are supposed to use) that spans the cpu that the task currently
   woken-up and the cpu where the task previously ran.

b) We do select_idle_sibling() check only for the cpu that the task is
   currently woken-up on. If the wake_affine makes the decision of selecting
   the cpu where the task previously ran, doing a select_idle_sibling() check
   for that cpu also helps and we don't do this currently.
 
c) Also, selelct_idle_sibling() should also treat the current cpu as an idle
   cpu if it is a sync wakeup and we have only one task running.

Fixing all this improves the lat_ctx performance. Also, there might be
other workloads where select_idle_sibling() check on previously ran cpu
will also help.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched_fair.c |   73 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 42 insertions(+), 31 deletions(-)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1411,28 +1411,49 @@ find_idlest_cpu(struct sched_group *grou
  * Try and locate an idle CPU in the sched_domain.
  */
 static int
-select_idle_sibling(struct task_struct *p, struct sched_domain *sd, int target)
+select_idle_sibling(struct task_struct *p, int target, int sync)
 {
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int i;
+	struct sched_domain *sd;
+
+	/*
+	 * If the task is going to be woken-up on this cpu and if it is
+	 * already idle or going to be idle, then it is the right target.
+	 */
+	if (target == cpu && (!cpu_rq(cpu)->cfs.nr_running ||
+			      (sync && cpu_rq(cpu)->cfs.nr_running == 1)))
+		return cpu;
 
 	/*
-	 * If this domain spans both cpu and prev_cpu (see the SD_WAKE_AFFINE
-	 * test in select_task_rq_fair) and the prev_cpu is idle then that's
-	 * always a better target than the current cpu.
+	 * If the task is going to be woken-up on the cpu where it previously
+	 * ran and if it is currently idle, then it the right target.
 	 */
-	if (target == cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
+	if (target == prev_cpu && !cpu_rq(prev_cpu)->cfs.nr_running)
 		return prev_cpu;
 
 	/*
-	 * Otherwise, iterate the domain and find an elegible idle cpu.
+	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
-		if (!cpu_rq(i)->cfs.nr_running) {
-			target = i;
+	for_each_domain(target, sd) {
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
 			break;
+
+		for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
+			if (!cpu_rq(i)->cfs.nr_running) {
+				target = i;
+				break;
+			}
 		}
+
+		/*
+		 * Lets stop looking for an idle sibling when we reached
+		 * the domain that spans the current cpu and prev_cpu.
+		 */
+		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+			break;
 	}
 
 	return target;
@@ -1496,32 +1517,17 @@ static int select_task_rq_fair(struct ta
 
 		/*
 		 * While iterating the domains looking for a spanning
-		 * WAKE_AFFINE domain, adjust the affine target to any idle cpu
-		 * in cache sharing domains along the way.
+		 * WAKE_AFFINE domain.
 		 */
 		if (want_affine) {
-			int target = -1;
-
 			/*
 			 * If both cpu and prev_cpu are part of this domain,
 			 * cpu is a valid SD_WAKE_AFFINE target.
 			 */
-			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
-				target = cpu;
-
-			/*
-			 * If there's an idle sibling in this domain, make that
-			 * the wake_affine target instead of the current cpu.
-			 */
-			if (tmp->flags & SD_SHARE_PKG_RESOURCES)
-				target = select_idle_sibling(p, tmp, target);
-
-			if (target >= 0) {
-				if (tmp->flags & SD_WAKE_AFFINE) {
-					affine_sd = tmp;
-					want_affine = 0;
-				}
-				cpu = target;
+			if (cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
+			    && (tmp->flags & SD_WAKE_AFFINE)) {
+				affine_sd = tmp;
+				want_affine = 0;
 			}
 		}
 
@@ -1549,8 +1555,13 @@ static int select_task_rq_fair(struct ta
 			update_shares(tmp);
 	}
 
-	if (affine_sd && wake_affine(affine_sd, p, sync))
-		return cpu;
+	if (affine_sd) {
+		int target;
+
+		target = wake_affine(affine_sd, p, sync) ? cpu : prev_cpu;
+
+		return select_idle_sibling(p, target, sync);
+	}
 
 	while (sd) {
 		int load_idx = sd->forkexec_idx;



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine()
  2010-03-05 18:39 [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine() Suresh Siddha
  2010-03-05 18:39 ` [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair() Suresh Siddha
@ 2010-03-05 19:36 ` Mike Galbraith
  2010-03-08 19:09   ` Suresh Siddha
  1 sibling, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2010-03-05 19:36 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Fri, 2010-03-05 at 10:39 -0800, Suresh Siddha wrote:
> plain text document attachment (fix_wake_affine.patch)
> On a single cpu system with SMT, in the scenario of one SMT thread being
> idle with another SMT thread running a task and doing a non sync wakeup of
> another task, we see (from the traces) that the woken up task ends up running
> on the busy thread instead of the idle thread. Idle balancing that comes
> in little bit later is fixing the scernaio.

Yup, wake_affine() fails for non sync wakeup when 1 task is running.
That's annoying, but making it succeed globally worries me.  We need a
high quality hint, and avg_overlap ain't it unfortunately, because to
get accurate overlap info cross cpu, you have to double clock and
update_curr() overhead.  We need dirt cheap.

> But fixing this wake balance and running the woken up task directly on the
> idle SMT thread improved the performance (phoronix 7zip compression workload)
> by ~9% on an atom platform.

So there is profit to be had.
  
> During the process wakeup, select_task_rq_fair() and wake_affine() makes
> the decision to wakeup the task either on the previous cpu that the task
> ran or the cpu that the task is currently woken up.
> 
> select_task_rq_fair() also goes through to see if there are any idle siblings
> for the cpu that the task is woken up on. This is to ensure that we select
> any idle sibling rather than choose a busy cpu.

Yeah, but with the 1 task + non-sync wakeup scenario, we miss the boat
because select_idle_sibling() uses wake_affine() success as it's
enabler.  I did that because I couldn't think up something else which
did not harm multiple buddy pairs.  You can globally say sibling is
idle, go for it, but that _does_ cause throughput loss during ramp up.

Best alternative I've found is to only check for an idle sibling/cache
when there is exactly one task on the current cpu (ie put some faith in
load balancing), then force idle sibling selection.  Also not optimal.
 
> In the above load scenario, it so happens that the prev_cpu (that the
> task ran before) and this_cpu (where it is woken up currently) are the same. And
> in this case, it looks like wake_affine() returns 0 and ultimately not selecting
> the idle sibling chosen by select_idle_sibling() in select_task_rq_fair().
> Further down the path of select_task_rq_fair(), we ultimately select
> the currently running cpu (busy SMT thread instead of the idle SMT thread).
> 
> Check for prev_cpu == this_cpu in wake_affine() and no need to do
> any fancy stuff(and ultimately taking wrong decisions) in this case.

I have a slightly different patch for that in my tree.  There's no need
to even call wake_affine() since the result is meaningless.

---
 kernel/sched_fair.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Index: linux-2.6.34.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.34.git.orig/kernel/sched_fair.c
+++ linux-2.6.34.git/kernel/sched_fair.c
@@ -1547,8 +1547,14 @@ static int select_task_rq_fair(struct ta
 	}
 #endif
 
-	if (affine_sd && wake_affine(affine_sd, p, sync))
-		return cpu;
+	if (affine_sd) {
+		if (cpu == prev_cpu)
+			return cpu;
+		if (wake_affine(affine_sd, p, sync))
+			return cpu;
+		if (!(affine_sd->flags & SD_BALANCE_WAKE))
+			return prev_cpu;
+	}
 
 	while (sd) {
 		int load_idx = sd->forkexec_idx;



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-05 18:39 ` [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair() Suresh Siddha
@ 2010-03-05 20:25   ` Mike Galbraith
  2010-03-06  8:21     ` Mike Galbraith
  2010-03-08 22:24     ` Suresh Siddha
  2010-03-06  8:36   ` Mike Galbraith
  1 sibling, 2 replies; 12+ messages in thread
From: Mike Galbraith @ 2010-03-05 20:25 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Fri, 2010-03-05 at 10:39 -0800, Suresh Siddha wrote:
> plain text document attachment (fix_lat_ctx.patch)
> Performance improvements with this patch:
> "lat_ctx -s 0 2" ~22usec (before-this-patch)	~5usec (after-this-patch)

Hm.  On my Q6600 box, it's nowhere near that.

> There are number of things wrong with the select_idle_sibling() logic
> 
> a) Once we select the idle sibling, we use that domain (spanning the cpu that
>    the task is currently woken-up and the idle sibling that we found) in our
>    wake_affine() comparisons. This domain is completely different from the
>    domain(we are supposed to use) that spans the cpu that the task currently
>    woken-up and the cpu where the task previously ran.
> 
> b) We do select_idle_sibling() check only for the cpu that the task is
>    currently woken-up on. If the wake_affine makes the decision of selecting
>    the cpu where the task previously ran, doing a select_idle_sibling() check
>    for that cpu also helps and we don't do this currently.
>  
> c) Also, selelct_idle_sibling() should also treat the current cpu as an idle
>    cpu if it is a sync wakeup and we have only one task running.

I'm going to have to crawl over and test the above, but this bit sounds
like a decidedly un-good thing to do.  Maybe I'm misunderstanding.

Check these lmbench3 numbers, ie the AF UNIX numbers in the last three
runs vs the three above that.  That's what I get with the load running
on one core because I disabled select_idle_sibling() for these runs to
compare cost/benefit of using an idle shared cache core.  The wakeup in
question is a sync wakeup, otherwise, we'd be taking the same beating
TCP is in stock 31.12 and stock 33.  (first 2 sets of triple runs)

Calling the waking cpu idle in that case is a mistake.  Just because the
sync hint was used does not mean there is no gain to be had.  In the
case of this benchmark proggy, that gain is a _lot_, same for the TCP
proggy after I enabled sync hint in smpx tree.  We don't want high
frequency cache misses for sure, but we also don't want to assume
there's nothing to be had by using another core.  There's currently no
way to tell if you can gain by using another core or not, other than to
try it.

If you run tip, you can see a throughput gain even with the pipe test,
because there's a buffer increase patch there, which combined with
owner_spin, produces a gain even with the highly synchronous pipe test,
select_idle_sibling() is only the enabler (hard to spin if on same core
as mutex owner:).


*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
marge     2.6.31.12-smp 0.730 2.845 4.85 6.463  11.3  26.2  14.9  31.
marge     2.6.31.12-smp 0.750 2.864 4.78 6.460  11.2  22.9  14.6  31.
marge     2.6.31.12-smp 0.710 2.835 4.81 6.478  11.5  11.0  14.5  30.
marge        2.6.33-smp 1.320 4.552 5.02 9.169  12.5  26.5  15.4  18.
marge        2.6.33-smp 1.450 4.621 5.45 9.286  12.5  11.4  15.4  18.
marge        2.6.33-smp 1.450 4.589 5.53 9.168  12.6  27.5  15.4  18.
marge       2.6.33-smpx 1.160 3.565 5.97 7.513  11.3 9.776  13.9  18.
marge       2.6.33-smpx 1.140 3.569 6.02 7.479  11.2 9.849  14.0  18.
marge       2.6.33-smpx 1.090 3.563 6.39 7.450  11.2 9.785  14.0  18.
marge       2.6.33-smpx 0.730 2.665 4.85 6.565  11.9  10.3  15.2  31.
marge       2.6.33-smpx 0.740 2.701 4.03 6.573  11.7  10.3  15.4  31.
marge       2.6.33-smpx 0.710 2.753 4.86 6.533  11.7  10.3  15.3  31.


*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
marge     2.6.31.12-smp 2821 2971 762. 2829.2 4799.0 1243.0 1230.3 4469 1682.
marge     2.6.31.12-smp 2824 2931 760. 2833.3 4736.5 1239.5 1235.8 4462 1678.
marge     2.6.31.12-smp 2796 2936 1139 2843.3 4815.7 1242.8 1234.6 4471 1685.
marge        2.6.33-smp 2670 5151 739. 2816.6 4768.5 1243.7 1237.2 4389 1684.
marge        2.6.33-smp 2627 5126 1135 2806.9 4783.1 1245.1 1236.1 4413 1684.
marge        2.6.33-smp 2582 5037 1137 2799.6 4755.4 1242.0 1239.1 4471 1683.
marge       2.6.33-smpx 2848 5184 2972 2820.5 4804.8 1242.6 1236.9 4315 1686.
marge       2.6.33-smpx 2804 5183 2934 2822.8 4759.3 1245.0 1234.7 4462 1688.
marge       2.6.33-smpx 2729 5177 2920 2837.6 4820.0 1246.9 1238.5 4467 1684.
marge       2.6.33-smpx 2843 2896 1928 2786.5 4751.2 1242.2 1238.6 4493 1682.
marge       2.6.33-smpx 2869 2886 1936 2841.4 4748.9 1244.3 1237.7 4456 1683.
marge       2.6.33-smpx 2845 2895 1947 2836.0 4813.6 1242.7 1236.3 4473 1674.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-05 20:25   ` Mike Galbraith
@ 2010-03-06  8:21     ` Mike Galbraith
  2010-03-08 22:24     ` Suresh Siddha
  1 sibling, 0 replies; 12+ messages in thread
From: Mike Galbraith @ 2010-03-06  8:21 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Fri, 2010-03-05 at 21:25 +0100, Mike Galbraith wrote: 
> On Fri, 2010-03-05 at 10:39 -0800, Suresh Siddha wrote:
>  
> > c) Also, selelct_idle_sibling() should also treat the current cpu as an idle
> >    cpu if it is a sync wakeup and we have only one task running.
> 
> I'm going to have to crawl over and test the above, but this bit sounds
> like a decidedly un-good thing to do.  Maybe I'm misunderstanding.

Nope, no misunderstanding.  This patch does kill throughput gains.  Once
awakened affine, always awaken affine is a bad idea.

I dug up my old P4 though.  With it's wimpy siblings, the cost of
running two schedulers doesn't appear to be generally worth it at a
glance.  You need considerable overlap to break even.

	-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-05 18:39 ` [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair() Suresh Siddha
  2010-03-05 20:25   ` Mike Galbraith
@ 2010-03-06  8:36   ` Mike Galbraith
  1 sibling, 0 replies; 12+ messages in thread
From: Mike Galbraith @ 2010-03-06  8:36 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Fri, 2010-03-05 at 10:39 -0800, Suresh Siddha wrote:
> plain text document attachment (fix_lat_ctx.patch)
> Performance improvements with this patch:
> "lat_ctx -s 0 2" ~22usec (before-this-patch)	~5usec (after-this-patch)

BTW, is your cpu governor kicking into high gear?  That unpatched number
looks like a throttled down cpu to me.  I know on my box, when there are
micro idles, sometimes cores do not throttle up, making for some fairly
crappy numbers.  Q6600 doesn't throttle down far enough to make such a
_huge_ difference though.

	-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine()
  2010-03-05 19:36 ` [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine() Mike Galbraith
@ 2010-03-08 19:09   ` Suresh Siddha
  2010-03-08 22:25     ` Mike Galbraith
  0 siblings, 1 reply; 12+ messages in thread
From: Suresh Siddha @ 2010-03-08 19:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

hi Mike,

On Fri, 2010-03-05 at 11:36 -0800, Mike Galbraith wrote:
> Yeah, but with the 1 task + non-sync wakeup scenario, we miss the boat
> because select_idle_sibling() uses wake_affine() success as it's
> enabler.

But the wake_affine() decision is broken when this_cpu == prev_cpu. All
we need to do is to fix that, to recover that ~9% improvement.

> I have a slightly different patch for that in my tree.  There's no need
> to even call wake_affine() since the result is meaningless.

I don't think your below fix is correct because:


> -	if (affine_sd && wake_affine(affine_sd, p, sync))
> -		return cpu;
> +	if (affine_sd) {
> +		if (cpu == prev_cpu)
> +			return cpu;


by this time, we have overwritten cpu using the select_idle_sibling()
logic and cpu no longer points to this_cpu.

What we need is a comparison with this_cpu.

thanks,
suresh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-05 20:25   ` Mike Galbraith
  2010-03-06  8:21     ` Mike Galbraith
@ 2010-03-08 22:24     ` Suresh Siddha
  2010-03-09  6:05       ` Mike Galbraith
  1 sibling, 1 reply; 12+ messages in thread
From: Suresh Siddha @ 2010-03-08 22:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Fri, 2010-03-05 at 12:25 -0800, Mike Galbraith wrote:
> On Fri, 2010-03-05 at 10:39 -0800, Suresh Siddha wrote:
> > plain text document attachment (fix_lat_ctx.patch)
> > Performance improvements with this patch:
> > "lat_ctx -s 0 2" ~22usec (before-this-patch)	~5usec (after-this-patch)
> 
> Hm.  On my Q6600 box, it's nowhere near that.

My numbers are based on an atom netbook.

> Calling the waking cpu idle in that case is a mistake.  Just because the
> sync hint was used does not mean there is no gain to be had. 

Ok. I dropped that part in v2 patches that I just posted.

thanks,
suresh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine()
  2010-03-08 19:09   ` Suresh Siddha
@ 2010-03-08 22:25     ` Mike Galbraith
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Galbraith @ 2010-03-08 22:25 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Mon, 2010-03-08 at 11:09 -0800, Suresh Siddha wrote:
> hi Mike,
> 
> On Fri, 2010-03-05 at 11:36 -0800, Mike Galbraith wrote:
> > Yeah, but with the 1 task + non-sync wakeup scenario, we miss the boat
> > because select_idle_sibling() uses wake_affine() success as it's
> > enabler.
> 
> But the wake_affine() decision is broken when this_cpu == prev_cpu. All
> we need to do is to fix that, to recover that ~9% improvement.

The wake_affine() decision isn't broken, it simply meaningless in that
case.  The primary decision is this cpu or previous cpu.  That's all
wake_affine() does.  If we have a serious imbalance, it says no, which
means "better luck next time", nothing more.  It's partner in crime is
active load balancing.

WRT the 9% improvement... we're talking about a single core yes?  No
alien cache with massive misses possible, yes?  IFF that's true, the
worst that can happen is you eat the price of running two schedulers vs
one.  The closer you get to a pure scheduler load, the more that appears
to matter.  In real life, there's very very frequently much more going
on that just scheduling, so these "is it 700ns or one whole usec"
benchmarks can distort reality.  If you are very close to only
scheduling, yes, select_idle_sibling() is a loser.  The cost of the
second scheduler is nowhere near free.

You can't get cheaper than one scheduler and preemption.  However, even
with something like TCP_RR (highly synchronous), I get better throughput
than a single core/single scheduler.  That for a pure latency benchmark,
communicating just as fast as the network stack can service.

The reason is fairness.  We don't insta-preempt, we have a bar that must
be reached.  If I tweak to increase preemption, you'll _see_ the cost of
running that second scheduler.  In reality, we have an idle core, this
load is entirely latency dominated, so cost of tapping that core is
negated.  It's a win, even for this heavy switching latency measurement
benchmark.  It's only a win because it does do a bit more than merely
schedule.

(worst case is pipes, pure scheduler, as pure as it gets, but even with
that there are cases where throughput gains are dramatic, because when
using two cores, you don't have to care about fairness, which is well
known to not necessarily be throughput's best friend)

marge:/root/tmp # netperf.sh 10
Starting netserver at port 12865
Starting netserver at hostname 0.0.0.0 port 12865 and family AF_UNSPEC
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1 (127.0.0.1) port 0 AF_INET
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    107340.56
16384  87380
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1 (127.0.0.1) port 0 AF_INET : cpu bind
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    103564.71
16384  87380

The first instance is free floating, the second is pinned to one core.
If I twiddle preemption, pinned will out perform free floating.  With
the preemption bar in place (a must), it's a modest win.

> > I have a slightly different patch for that in my tree.  There's no need
> > to even call wake_affine() since the result is meaningless.
> 
> I don't think your below fix is correct because:
> 
> 
> > -	if (affine_sd && wake_affine(affine_sd, p, sync))
> > -		return cpu;
> > +	if (affine_sd) {
> > +		if (cpu == prev_cpu)
> > +			return cpu;
> 
> 
> by this time, we have overwritten cpu using the select_idle_sibling()
> logic and cpu no longer points to this_cpu.

Yes, maybe.  And wake_affine() will say yeah or nay.  It only matters if
the decision _sticks_, ie we can't/don't adapt.  We only need
wake_affine() because of the "not now".  Set it up to always select an
idle core if available, watch what happens to buddy loads.
 
> What we need is a comparison with this_cpu.

I disagree.  It's really cheap to say if it was affine previously, wake
it affine again, but thereby you tie tasks to one core for no good
reason.  As tested, tasks which demonstrably _can_ effectively use two
cores were tied to one core with your patch, and suffered dramatic
throughput loss.

I really don't think a pure scheduler benchmark has any meaning beyond
overhead measurement.

	-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-08 22:24     ` Suresh Siddha
@ 2010-03-09  6:05       ` Mike Galbraith
  2010-03-09 13:27         ` Mike Galbraith
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2010-03-09  6:05 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Mon, 2010-03-08 at 14:24 -0800, Suresh Siddha wrote:
> On Fri, 2010-03-05 at 12:25 -0800, Mike Galbraith wrote:
> > On Fri, 2010-03-05 at 10:39 -0800, Suresh Siddha wrote:
> > > plain text document attachment (fix_lat_ctx.patch)
> > > Performance improvements with this patch:
> > > "lat_ctx -s 0 2" ~22usec (before-this-patch)	~5usec (after-this-patch)
> > 
> > Hm.  On my Q6600 box, it's nowhere near that.
> 
> My numbers are based on an atom netbook.

Wish I had an atom to play with, hard to even imagine 22 usec lat_ctx,
that's some _serious_ pain.

> > Calling the waking cpu idle in that case is a mistake.  Just because the
> > sync hint was used does not mean there is no gain to be had. 
> 
> Ok. I dropped that part in v2 patches that I just posted.

Yeah, I see mails (ramble ramble) crossed in the night.  I'll take them
out for a spin, see if your box can get fixed up without busting mine :)

	-Mike


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-09  6:05       ` Mike Galbraith
@ 2010-03-09 13:27         ` Mike Galbraith
  2010-03-10 21:14           ` Mike Galbraith
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2010-03-09 13:27 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Tue, 2010-03-09 at 07:05 +0100, Mike Galbraith wrote:
> On Mon, 2010-03-08 at 14:24 -0800, Suresh Siddha wrote:

> > Ok. I dropped that part in v2 patches that I just posted.
> 
> Yeah, I see mails (ramble ramble) crossed in the night.  I'll take them
> out for a spin, see if your box can get fixed up without busting mine :)

Ok, done, pretty darn boring testing.  My worries wrt ramp didn't
materialize, and without the sync+1=idle bit, box no longer hates your
patch ;)  It adds a tad of overhead though.

I tested it in my tinkered tree due to some things in virgin tip that I
think would have upset results.

tip = v2.6.33-6681-g91259c4
tip-x = tip + my patches
tip-xx = tip-x + your patches

netperf TCP_RR
unpinned
tip           64942.13    52203.28    92852.34   avg  69999.25   1.000
tip-x        107387.04   107334.34   107495.54   avg 107405.64   1.534   1.000
tip-xx       104602.54   104183.84   104095.35   avg 104293.91   1.489    .971

pinned
tip          101044.77    99478.75   100091.53   avg 100205.01   1.000
tip-x        103767.53   103138.71   103459.92   avg 103455.38   1.032   1.000
tip-xx       103465.68   103306.90   103189.46   avg 103320.68   1.031    .998

tbench 8
tip          1181.10   1173.81   1179.42   avg 1178.11   1.000
tip-x        1212.81   1214.73   1212.99   avg 1213.51   1.030   1.000
tip-xx       1212.56   1208.96   1207.57   avg 1209.69   1.026    .996

mysql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          10364.73   19210.43   37055.69   36724.26   35797.53   34806.00   32533.02   27463.10   21331.35
              8499.12   19438.84   36873.36   36574.78   35712.63   34763.84   32086.15   27325.47   20845.97
              9707.81   19431.68   36940.02   36495.55   35590.46   34612.24   32125.14   27124.94   21228.64
tip avg       9523.88   19360.31   36956.35   36598.19   35700.20   34727.36   32248.10   27304.50   21135.32

tip-x        11143.63   20398.50   36954.97   36515.57   35877.48   35101.78   32974.35   28018.32   22172.95
             11134.06   20615.39   36873.60   36541.21   35957.70   34954.68   32613.53   27737.31   21792.48
             11156.34   20481.79   36773.47   36473.38   35778.64   34867.88   32485.94   27705.44   22024.18
tip-x avg    11144.67   20498.56   36867.34   36510.05   35871.27   34974.78   32691.27   27820.35   21996.53
vs tip          1.170      1.058       .997       .997      1.004      1.007      1.013      1.018      1.040

tip-xx       11208.50   20952.28   36842.91   36495.95   35642.27   35049.26   32566.01   27096.20   22058.94
             11176.96   21123.91   37334.72   36849.41   36298.53   35269.21   32625.55   27248.00   21448.19
             11112.04   21042.49   37317.42   36835.94   36147.29   35144.11   32695.83   27393.21   21881.37
tip-xx avg   11165.83   21039.56   37165.01   36727.10   36029.36   35154.19   32629.13   27245.80   21796.16
vs tip          1.172      1.086      1.005      1.003      1.009      1.012      1.011       .997      1.031
vs tip-x        1.001      1.026      1.008      1.005      1.004      1.005       .998       .979       .990

pgsql+oltp
clients             1          2          4          8         16         32         64        128        256
tip          15408.13   30181.76   53244.49   52684.00   52306.08   51115.08   50156.59   48907.81   45657.03
             15658.69   30379.33   53308.72   52879.11   52169.98   51080.25   50023.30   48758.86   46009.55
             15534.79   30343.67   53030.92   52437.59   52063.99   50897.84   50073.97   48720.65   45990.12
tip avg      15533.87   30301.58   53194.71   52666.90   52180.01   51031.05   50084.62   48795.77   45885.56

tip-x        15897.83   31418.82   54116.88   53712.74   52628.43   51310.64   50624.15   49291.35   46430.55
             15848.06   30702.00   53166.82   52935.46   51859.48   51024.90   50123.57   48739.02   46033.40
             16043.14   30919.32   53576.71   53400.79   52525.99   51635.50   50515.64   49095.82   46374.42
tip-x avg    15929.67   31013.38   53620.13   53349.66   52337.96   51323.68   50421.12   49042.06   46279.45
vs tip          1.025      1.023      1.007      1.012      1.003      1.005      1.006      1.005      1.008 

tip-xx       16071.66   31145.83   53953.67   53280.52   52476.55   51472.32   50219.63   49016.45   46280.48
             15899.41   30769.93   53051.25   53006.46   52000.32   51151.68   50062.48   48539.12   45634.23
             16019.03   31035.88   53850.74   53086.54   52136.27   51246.17   50432.08   48544.87   45693.85
tip-xx avg   15996.70   30983.88   53618.55   53124.50   52204.38   51290.05   50238.06   48700.14   45869.52
vs tip          1.075      1.012       .879      1.001      1.006      1.022      1.010      1.013      1.074
vs tip-x        1.004       .999       .999       .995       .997       .999       .996       .993       .991


Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
marge    2.6.34-tip-smp 1.3400 1.4400 1.4000 2.1200 1.6400 2.26000 1.68000
marge    2.6.34-tip-smp 1.3600 1.3700 1.2700 2.1000 1.5600 2.26000 1.61000
marge    2.6.34-tip-smp 1.3800 1.3400 1.4000 2.1000 1.6400 2.29000 1.83000
marge   2.6.34-tip-smpx 1.0600 1.0900 1.0800 1.8200 1.3000 1.82000 1.33000
marge   2.6.34-tip-smpx 1.0700 1.1000 1.1200 1.8000 1.3100 1.83000 1.37000
marge   2.6.34-tip-smpx 1.0600 1.0900 1.1100 1.7800 1.2600 1.83000 1.38000
marge  2.6.34-tip-smpxx 1.1000 1.0900 1.1100 1.8000 1.3100 1.86000 1.37000
marge  2.6.34-tip-smpxx 1.1000 1.1000 1.1000 1.8000 1.3100 1.84000 1.35000
marge  2.6.34-tip-smpxx 1.1000 1.0800 1.1100 1.8100 1.3000 1.85000 1.36000

*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
marge    2.6.34-tip-smp 1.340 4.179 4.42 8.959  12.1  27.4  14.9  18.
marge    2.6.34-tip-smp 1.360 4.369 4.61 8.868  12.0  26.9  14.9  18.
marge    2.6.34-tip-smp 1.380 4.364 4.70 8.931  11.9  25.6  14.9  18.
marge   2.6.34-tip-smpx 1.060 3.486 6.16 7.635  10.9 9.422  13.7  18.
marge   2.6.34-tip-smpx 1.070 3.456 5.93 7.677  11.0 9.412  13.7  18.
marge   2.6.34-tip-smpx 1.060 3.474 6.12 7.544  10.9 9.495  13.7  18.
marge  2.6.34-tip-smpxx 1.100 3.556 6.30 7.911  11.3 9.756  14.0  18.
marge  2.6.34-tip-smpxx 1.100 3.563 6.12 7.879  11.4 9.806  14.2  18.
marge  2.6.34-tip-smpxx 1.100 3.553 6.17 7.908  11.3 9.779  14.0  18.

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
marge    2.6.34-tip-smp 1756 5168 754. 2818.3 4815.3 1243.7 1233.1 4342 1686.
marge    2.6.34-tip-smp 1778 5193 1152 2809.7 4748.9 1242.3 1236.1 4454 1679.
marge    2.6.34-tip-smp 1763 5161 1141 2801.5 4731.5 1246.6 1233.7 4493 1683.
marge   2.6.34-tip-smpx 2790 5230 2979 2838.2 4833.4 1322.5 1310.3 4488 1749.
marge   2.6.34-tip-smpx 2726 5222 2980 2825.6 4770.3 1336.4 1320.7 4447 1741.
marge   2.6.34-tip-smpx 2800 5214 2973 2817.3 4768.1 1383.3 1359.4 4447 1746.
marge  2.6.34-tip-smpxx 2549 5228 2931 2830.9 4795.9 1236.3 1229.3 4384 1686.
marge  2.6.34-tip-smpxx 2568 5212 740. 2833.0 4782.2 1239.2 1235.3 4462 1683.
marge  2.6.34-tip-smpxx 2608 5219 2949 2845.9 4806.4 1243.5 1232.7 4474 1683.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair()
  2010-03-09 13:27         ` Mike Galbraith
@ 2010-03-10 21:14           ` Mike Galbraith
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Galbraith @ 2010-03-10 21:14 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Ingo Molnar, Arjan van de Ven, linux-kernel,
	Vaidyanathan Srinivasan, Yanmin Zhang, Gautham R Shenoy

On Tue, 2010-03-09 at 14:27 +0100, Mike Galbraith wrote:
> On Tue, 2010-03-09 at 07:05 +0100, Mike Galbraith wrote:
> > On Mon, 2010-03-08 at 14:24 -0800, Suresh Siddha wrote:
> 
> > > Ok. I dropped that part in v2 patches that I just posted.
> > 
> > Yeah, I see mails (ramble ramble) crossed in the night.  I'll take them
> > out for a spin, see if your box can get fixed up without busting mine :)
> 
...

BTW, does the below also cure your netbook's woes?  (should)


sched: fix select_idle_sibling()

Don't bother with selection when the current cpu is idle.  Recent load
balancing changes also make it no longer necessary to check wake_affine()
success before returning the selected sibling, so we now always use it.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>

---
 kernel/sched_fair.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1438,7 +1438,7 @@ static int select_task_rq_fair(struct ta
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
-	int want_affine = 0;
+	int want_affine = 0, cpu_idle = !current->pid;
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
 
@@ -1496,13 +1496,15 @@ static int select_task_rq_fair(struct ta
 			 * If there's an idle sibling in this domain, make that
 			 * the wake_affine target instead of the current cpu.
 			 */
-			if (tmp->flags & SD_SHARE_PKG_RESOURCES)
+			if (!cpu_idle && tmp->flags & SD_SHARE_PKG_RESOURCES)
 				target = select_idle_sibling(p, tmp, target);
 
 			if (target >= 0) {
 				if (tmp->flags & SD_WAKE_AFFINE) {
 					affine_sd = tmp;
 					want_affine = 0;
+					if (target != cpu)
+						cpu_idle = 1;
 				}
 				cpu = target;
 			}
@@ -1518,6 +1520,7 @@ static int select_task_rq_fair(struct ta
 			sd = tmp;
 	}
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
 	if (sched_feat(LB_SHARES_UPDATE)) {
 		/*
 		 * Pick the largest domain to update shares over
@@ -1531,9 +1534,12 @@ static int select_task_rq_fair(struct ta
 		if (tmp)
 			update_shares(tmp);
 	}
+#endif
 
-	if (affine_sd && wake_affine(affine_sd, p, sync))
-		return cpu;
+	if (affine_sd) {
+		if (cpu_idle || cpu == prev_cpu || wake_affine(affine_sd, p, sync))
+			return cpu;
+	}
 
 	while (sd) {
 		int load_idx = sd->forkexec_idx;



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-03-10 21:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-05 18:39 [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine() Suresh Siddha
2010-03-05 18:39 ` [patch 2/2] sched: fix select_idle_sibling() logic in select_task_rq_fair() Suresh Siddha
2010-03-05 20:25   ` Mike Galbraith
2010-03-06  8:21     ` Mike Galbraith
2010-03-08 22:24     ` Suresh Siddha
2010-03-09  6:05       ` Mike Galbraith
2010-03-09 13:27         ` Mike Galbraith
2010-03-10 21:14           ` Mike Galbraith
2010-03-06  8:36   ` Mike Galbraith
2010-03-05 19:36 ` [patch 1/2] sched: check for prev_cpu == this_cpu in wake_affine() Mike Galbraith
2010-03-08 19:09   ` Suresh Siddha
2010-03-08 22:25     ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.