linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
       [not found] <1329764866.2293.376.camhel@twins>
@ 2012-03-05 15:24 ` Srivatsa Vaddagiri
  2012-03-06  9:14   ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-05 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Peter Zijlstra <peterz@infradead.org> [2012-02-20 20:07:46]:

> On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.
> 
> Right, I through I remembered somet such, you could see it on wakeup
> heavy things like pipe-bench and that java msg passing thing, right?

I did some experiments with volanomark and it does turn out to be
sensitive to SD_BALANCE_WAKE, while the other wake-heavy benchmark that I am
dealing with (Trade) benefits from it.

Normalized results for both benchmarks provided below. 

Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
Kernel  : tip (HEAD at b86148a) 

	   	   Before patch	    After patch

Trade thr'put		1		2.17 (~200% improvement)
volanomark		1		0.8  (20% degradation)


Quick description of benchmarks 
===============================

Trade was run inside a 8-vcpu VM (cgroup). 4 other 4-vcpu VMs running
cpu hogs were also present, leading to this cgroup setup:

	/cgroup/sys (1024 shares - hosts all system tasks)
	/cgroup/libvirt (20000 shares)
	/cgroup/libvirt/qemu/VM1 (8192 cpu shares)
	/cgroup/libvirt/qemu/VM2-5 (1024 shares)

Volanomark server/client programs were run in root cgroup.

The patch essentially does balance on wake to look for any idle cpu in
same cache domain as its prev_cpu (or cur_cpu if wake_affine obliges),
failing to find looks for least loaded cpu. This helps minimize
latencies for trade workload (and thus boost its score). For volanomark, 
it seems to hurt because of waking on a colder L2 cache.  The
tradeoff seems to be between latency and cache-misses. Short of adding
another tunable, are there better suggestions on how we can address this
sort of tradeoff?

Not-yet-Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 include/linux/topology.h |    4 ++--
 kernel/sched/fair.c      |   26 +++++++++++++++++++++-----
 2 files changed, 23 insertions(+), 7 deletions(-)

Index: current/include/linux/topology.h
===================================================================
--- current.orig/include/linux/topology.h
+++ current/include/linux/topology.h
@@ -96,7 +96,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
@@ -129,7 +129,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_PREFER_LOCAL			\
 				| 0*SD_SHARE_CPUPOWER			\
Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2638,7 +2638,7 @@ static int select_idle_sibling(struct ta
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i;
+	int i, some_idle_cpu = -1;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2661,15 +2661,25 @@ static int select_idle_sibling(struct ta
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
+			int skip = 0;
+
 			if (!cpumask_intersects(sched_group_cpus(sg),
 						tsk_cpus_allowed(p)))
 				goto next;
 
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+							 tsk_cpus_allowed(p)) {
+				if (!idle_cpu(i)) {
+					if (some_idle_cpu >= 0)
+						goto next;
+					skip = 1;
+				} else
+					some_idle_cpu = i;
 			}
 
+			if (skip)
+				goto next;
+
 			target = cpumask_first_and(sched_group_cpus(sg),
 					tsk_cpus_allowed(p));
 			goto done;
@@ -2677,6 +2687,9 @@ next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+
+	if (some_idle_cpu >= 0)
+		target = some_idle_cpu;
 done:
 	return target;
 }
@@ -2766,7 +2779,10 @@ select_task_rq_fair(struct task_struct *
 			prev_cpu = cpu;
 
 		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		if (idle_cpu(new_cpu))
+			goto unlock;
+		sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+		cpu = prev_cpu;
 	}
 
 	while (sd) {








^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-05 15:24 ` sched: Avoid SMT siblings in select_idle_sibling() if possible Srivatsa Vaddagiri
@ 2012-03-06  9:14   ` Ingo Molnar
  2012-03-06 10:03     ` Srivatsa Vaddagiri
  2012-03-22 15:32     ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 49+ messages in thread
From: Ingo Molnar @ 2012-03-06  9:14 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> * Peter Zijlstra <peterz@infradead.org> [2012-02-20 20:07:46]:
> 
> > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:

> > > Enabling SD_BALANCE_WAKE used to be decidedly too 
> > > expensive to consider. Maybe that has changed, but I doubt 
> > > it.
> > 
> > Right, I through I remembered somet such, you could see it 
> > on wakeup heavy things like pipe-bench and that java msg 
> > passing thing, right?
> 
> I did some experiments with volanomark and it does turn out to 
> be sensitive to SD_BALANCE_WAKE, while the other wake-heavy 
> benchmark that I am dealing with (Trade) benefits from it.

Does volanomark still do yield(), thereby invoking a random 
shuffle of thread scheduling and pretty much voluntarily 
ejecting itself from most scheduler performance considerations?

If it uses a real locking primitive such as futexes then its 
performance matters more.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-06  9:14   ` Ingo Molnar
@ 2012-03-06 10:03     ` Srivatsa Vaddagiri
  2012-03-22 15:32     ` Srivatsa Vaddagiri
  1 sibling, 0 replies; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-06 10:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

* Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]:

> Does volanomark still do yield(),

I could count 3.5M yield() calls across 16 cpus over a period of 5.4sec,
so yes it does seem to make very good use of yeild()!

> thereby invoking a random 
> shuffle of thread scheduling and pretty much voluntarily 
> ejecting itself from most scheduler performance considerations?
> 
> If it uses a real locking primitive such as futexes then its 
> performance matters more.

Good point ..I will gather more data for additional benchmarks
(tbench, sysbench) and post shortly.

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-06  9:14   ` Ingo Molnar
  2012-03-06 10:03     ` Srivatsa Vaddagiri
@ 2012-03-22 15:32     ` Srivatsa Vaddagiri
  2012-03-23  6:38       ` Mike Galbraith
                         ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-22 15:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

* Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]:

> > I did some experiments with volanomark and it does turn out to 
> > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy 
> > benchmark that I am dealing with (Trade) benefits from it.
> 
> Does volanomark still do yield(), thereby invoking a random 
> shuffle of thread scheduling and pretty much voluntarily 
> ejecting itself from most scheduler performance considerations?
> 
> If it uses a real locking primitive such as futexes then its 
> performance matters more.

Some more interesting results on more recent tip kernel.

Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus)
Kernel  : tip (HEAD at ee415e2)
guest VM : 2.6.18 linux kernel based enterprise guest

Benchmarks are run in two scenarios:

1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup
2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in
 	 various cgroups) are run on host. Cgroup setup is as below:

	/sys (cpu.shares = 1024, hosts all system tasks)
	/libvirt (cpu.shares = 20000)
 	/libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus)
	/libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs)
	/libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs)

First BM (bare metal) scenario:

		tip	tip + patch

volano 		1	0.955   (4.5% degradation)
sysbench [n1] 	1	0.9984  (0.16% degradation)
tbench 1 [n2]	1	0.9096  (9% degradation)

Now the more interesting VM scenario:

		tip	tip + patch

volano		1	1.29   (29% improvement)
sysbench [n3]	1	2      (100% improvement)
tbench 1 [n4] 	1       1.07   (7% improvement)
tbench 8 [n5] 	1       1.26   (26% improvement)
httperf  [n6]	1 	1.05   (5% improvement)
Trade		1	1.31   (31% improvement)

Notes:
 
n1. sysbench was run with 16 threads.
n2. tbench was run on localhost with 1 client 
n3. sysbench was run with 8 threads
n4. tbench was run on localhost with 1 client
n5. tbench was run over network with 8 clients
n6. httperf was run as with burst-length of 100 and wsess of 100,500,0

So the patch seems to be a wholesome win when VCPU threads are waking
up (in a highly contended environment). One reason could be that any assumption 
of better cache hits by running (vcpu) threads on its prev_cpu may not
be fully correct as vcpu threads could represent many different threads
internally?

Anyway, there are degradations as well, considering which I see several 
possibilities:

1. Do balance-on-wake for vcpu threads only.
2. Document tuning possibility to improve performance in virtualized
   environment:
	- Either via sched_domain flags (disable SD_WAKE_AFFINE 
 	  at all levels and enable SD_BALANCE_WAKE at SMT/MC levels)
	- Or via a new sched_feat(BALANCE_WAKE) tunable

Any other thoughts or suggestions for more experiments?


--

Balance threads on wakeup to let it run on least-loaded CPU in the same
cache domain as its prev_cpu (or cur_cpu if wake_affine() test obliges).

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>


---
 include/linux/topology.h |    4 ++--
 kernel/sched/fair.c      |    5 ++++-
 2 files changed, 6 insertions(+), 3 deletions(-)

Index: current/include/linux/topology.h
===================================================================
--- current.orig/include/linux/topology.h
+++ current/include/linux/topology.h
@@ -96,7 +96,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
@@ -129,7 +129,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_PREFER_LOCAL			\
 				| 0*SD_SHARE_CPUPOWER			\
Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2766,7 +2766,10 @@ select_task_rq_fair(struct task_struct *
 			prev_cpu = cpu;
 
 		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		if (idle_cpu(new_cpu))
+			goto unlock;
+		sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+		cpu = prev_cpu;
 	}
 
 	while (sd) {


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-22 15:32     ` Srivatsa Vaddagiri
@ 2012-03-23  6:38       ` Mike Galbraith
  2012-03-26  8:29       ` Peter Zijlstra
  2012-03-26  8:36       ` Peter Zijlstra
  2 siblings, 0 replies; 49+ messages in thread
From: Mike Galbraith @ 2012-03-23  6:38 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner

On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote: 
> * Ingo Molnar <mingo@elte.hu> [2012-03-06 10:14:11]:
> 
> > > I did some experiments with volanomark and it does turn out to 
> > > be sensitive to SD_BALANCE_WAKE, while the other wake-heavy 
> > > benchmark that I am dealing with (Trade) benefits from it.
> > 
> > Does volanomark still do yield(), thereby invoking a random 
> > shuffle of thread scheduling and pretty much voluntarily 
> > ejecting itself from most scheduler performance considerations?
> > 
> > If it uses a real locking primitive such as futexes then its 
> > performance matters more.
> 
> Some more interesting results on more recent tip kernel.

Yeah, interesting.  I find myself ever returning to this message, as
gears grind away trying to imagine what's going on in those vcpus.

> Machine : 2 Quad-core Intel X5570 CPU w/ H/T enabled (16 cpus)
> Kernel  : tip (HEAD at ee415e2)
> guest VM : 2.6.18 linux kernel based enterprise guest
> 
> Benchmarks are run in two scenarios:
> 
> 1. BM -> Bare Metal. Benchmark is run on bare metal in root cgroup
> 2. VM -> Benchmark is run inside a guest VM. Several cpu hogs (in
>  	 various cgroups) are run on host. Cgroup setup is as below:
> 
> 	/sys (cpu.shares = 1024, hosts all system tasks)
> 	/libvirt (cpu.shares = 20000)
>  	/libvirt/qemu/VM (cpu.shares = 8192. guest VM w/ 8 vcpus)
> 	/libvirt/qemu/hoga (cpu.shares = 1024. hosts 4 cpu hogs)
> 	/libvirt/qemu/hogb (cpu.shares = 1024. hosts 4 cpu hogs)
> 	/libvirt/qemu/hogc (cpu.shares = 1024. hosts 4 cpu hogs)
> 	/libvirt/qemu/hogd (cpu.shares = 1024. hosts 4 cpu hogs)
> 
> First BM (bare metal) scenario:
> 
> 		tip	tip + patch
> 
> volano 		1	0.955   (4.5% degradation)
> sysbench [n1] 	1	0.9984  (0.16% degradation)
> tbench 1 [n2]	1	0.9096  (9% degradation)

Those make sense, fast path cycles added.

> Now the more interesting VM scenario:
> 
> 		tip	tip + patch
> 
> volano		1	1.29   (29% improvement)
> sysbench [n3]	1	2      (100% improvement)
> tbench 1 [n4] 	1       1.07   (7% improvement)
> tbench 8 [n5] 	1       1.26   (26% improvement)
> httperf  [n6]	1 	1.05   (5% improvement)
> Trade		1	1.31   (31% improvement)
> 
> Notes:
>  
> n1. sysbench was run with 16 threads.
> n2. tbench was run on localhost with 1 client 
> n3. sysbench was run with 8 threads
> n4. tbench was run on localhost with 1 client
> n5. tbench was run over network with 8 clients
> n6. httperf was run as with burst-length of 100 and wsess of 100,500,0
> 
> So the patch seems to be a wholesome win when VCPU threads are waking
> up (in a highly contended environment). One reason could be that any assumption 
> of better cache hits by running (vcpu) threads on its prev_cpu may not
> be fully correct as vcpu threads could represent many different threads
> internally?
> 
> Anyway, there are degradations as well, considering which I see several 
> possibilities:
> 
> 1. Do balance-on-wake for vcpu threads only.

That's what your numbers say to me with this patch.  I'm not getting the
why, but your patch appears to reduce vcpu internal latencies hugely.

> 2. Document tuning possibility to improve performance in virtualized
>    environment:
> 	- Either via sched_domain flags (disable SD_WAKE_AFFINE 
>  	  at all levels and enable SD_BALANCE_WAKE at SMT/MC levels)
> 	- Or via a new sched_feat(BALANCE_WAKE) tunable
> 
> Any other thoughts or suggestions for more experiments?

Other than nuke select_idle_sibling() entirely instead, none here.

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-22 15:32     ` Srivatsa Vaddagiri
  2012-03-23  6:38       ` Mike Galbraith
@ 2012-03-26  8:29       ` Peter Zijlstra
  2012-03-26  8:36       ` Peter Zijlstra
  2 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2012-03-26  8:29 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote:
> Anyway, there are degradations as well, considering which I see
> several 
> possibilities:
> 
> 1. Do balance-on-wake for vcpu threads only.

Hell no ;-) we're not going to special case some threads over others.

> 2. Document tuning possibility to improve performance in virtualized
>    environment:
>         - Either via sched_domain flags (disable SD_WAKE_AFFINE 
>           at all levels and enable SD_BALANCE_WAKE at SMT/MC levels)

But domain flags are not exported -- except under SCHED_DEBUG and that
sysctl mess.. also SD_flags are not stable.

>         - Or via a new sched_feat(BALANCE_WAKE) tunable 

sched_feat() is not a stable ABI and shouldn't ever be used for anything
but debugging (hence it lives in debugfs and goes away if you disable
SCHED_DEBUG).


I would very much like more information on why things are a loss. Is it
really related to what cpu you pick, or is it the cost of doing the
balance thing?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-22 15:32     ` Srivatsa Vaddagiri
  2012-03-23  6:38       ` Mike Galbraith
  2012-03-26  8:29       ` Peter Zijlstra
@ 2012-03-26  8:36       ` Peter Zijlstra
  2012-03-26 17:35         ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2012-03-26  8:36 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

On Thu, 2012-03-22 at 21:02 +0530, Srivatsa Vaddagiri wrote:
> 
> Now the more interesting VM scenario:
> 
More interesting for whoem? I still think virt is an utter waste of
time ;-)

>                 tip     tip + patch
> 
> volano          1       1.29   (29% improvement)
> sysbench [n3]   1       2      (100% improvement)
> tbench 1 [n4]   1       1.07   (7% improvement)
> tbench 8 [n5]   1       1.26   (26% improvement)
> httperf  [n6]   1       1.05   (5% improvement)
> Trade           1       1.31   (31% improvement) 

That smells like there's more to the story, a 100% improvement is too
much..

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-26  8:36       ` Peter Zijlstra
@ 2012-03-26 17:35         ` Srivatsa Vaddagiri
  2012-03-26 18:06           ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-26 17:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

* Peter Zijlstra <peterz@infradead.org> [2012-03-26 10:36:00]:

> >                 tip     tip + patch 
> > 
> > volano          1       1.29   (29% improvement)
> > sysbench [n3]   1       2      (100% improvement)
> > tbench 1 [n4]   1       1.07   (7% improvement)
> > tbench 8 [n5]   1       1.26   (26% improvement)
> > httperf  [n6]   1       1.05   (5% improvement)
> > Trade           1       1.31   (31% improvement) 
> 
> That smells like there's more to the story, a 100% improvement is too
> much..

Yeah I have rubbed my eyes several times to make sure its true and ran
the same benchmark (sysbench) again now! I can recreate that ~100%
improvement with the patch even now.

To quickly re-cap my environment, I have a 16-cpu machine w/ 5 cgroups.
1 cgroup (8192 shares) hosts sysbench inside 8-vcpu VM while remaining 4
cgroups (1024 shares each) hosts 4 cpu hogs running on bare metal.
Given this overcommittment, select_idle_sibling() should mostly be a 
no-op (i.e it won't find any idle cores and thus defaults to prev_cpu).
Also the only tasks that will (sleep and) wakeup are the VM tasks.

Since the patch potentially affects (improves) scheduling latencies, I measured 
Sum(se.statistics.wait_sum) for the VM tasks over the benchmark run (5
iterations of sysbench).

tip	    : 987240 ms
tip + patch : 280275 ms 

I will get more comprehensive perf data shortly and post. 

>From what I can tell, the huge improvement in benchmark score is coming from 
reduced latencies for its VM tasks. 

The hard part to figure out (when we are inside select_task_rq_fair()) is 
whether any potential improvement in latencies (because of waking up on a
less loaded cpu) will offshoot the cost of potentially more L2-cache misses, 
for which IMHO we don't have enough data to make a good decision.

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-26 17:35         ` Srivatsa Vaddagiri
@ 2012-03-26 18:06           ` Peter Zijlstra
  2012-03-27 13:56             ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2012-03-26 18:06 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, Suresh Siddha, linux-kernel, Paul Turner

On Mon, 2012-03-26 at 23:05 +0530, Srivatsa Vaddagiri wrote:
> 
> From what I can tell, the huge improvement in benchmark score is coming from 
> reduced latencies for its VM tasks. 

But if the machine is pegged latency should not impact throughput (since
there's always work to do), so are you creating extra idle time some
place?

Are you running against lock-inversion in the vcpus? Or that tlb
shootdown issue we had in the gang scheduling thread? Both are typically
busy-wait time, which is of course harder to spot that actual idle
time :/

Then again, reducing latency is good, so I don't object to that per-se,
but that flips the question, why does it regress those other loads?

The biggest regression came from tbench, wasn't that mostly a random
number generator anyway? How stable are those results, do you have a
variance measure on the results?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-03-26 18:06           ` Peter Zijlstra
@ 2012-03-27 13:56             ` Mike Galbraith
  0 siblings, 0 replies; 49+ messages in thread
From: Mike Galbraith @ 2012-03-27 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Suresh Siddha, linux-kernel,
	Paul Turner

On Mon, 2012-03-26 at 20:06 +0200, Peter Zijlstra wrote:

> The biggest regression came from tbench, wasn't that mostly a random
> number generator anyway?

That's dbench ;-)  Tbench does wiggle around more than you'd like
though.  I've never tracked, but in lots of running the thing, 2-3% you
can forget chasing, and I once bisected a 5% regression to an unrelated
change, reverted and re-applied the suspect a few times in disbelief,
and it tracked.  You can't trust it _too_ much.  I consider it a
reliable indicator that I'd better do more testing.  If others agree,
it's a real regression, if not, it's those damn Elves 'n Gremlins ;-)

-Mike



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-27 22:11                               ` Suresh Siddha
@ 2012-02-28  5:05                                 ` Mike Galbraith
  0 siblings, 0 replies; 49+ messages in thread
From: Mike Galbraith @ 2012-02-28  5:05 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Srivatsa Vaddagiri, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Paul Turner

On Mon, 2012-02-27 at 14:11 -0800, Suresh Siddha wrote: 
> On Sat, 2012-02-25 at 09:30 +0100, Mike Galbraith wrote:
> > My less rotund config shows the L2 penalty decidedly more prominently.
> > We used to have avg_overlap as a synchronous wakeup hint, but it was
> > broken by preemption and whatnot, got the axe to recover some cycles.  A
> > reliable and dirt cheap replacement would be a good thing to have.
> > 
> > TCP_RR and tbench are far way away from the overlap breakeven point on
> > E5620, whereas with Q6600s shared L2, you can start converting overlap
> > into throughput almost immediately. 
> > 
> > 2.4 GHz E5620
> > Throughput 248.994 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
> > Throughput 379.488 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES
> > 
> > 2.4 GHz Q6600
> > Throughput 299.049 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
> > Throughput 300.018 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES
> > 
> 
> Also it is not always about just the L2 cache being shared/not or
> warm/cold etc. It also depends on the core c-states/p-states etc. It
> will cost waking up an idle core and the cost will depend on the what
> core-c state it is in. And also if we ping-pong between cores often,
> cpufreq governor will come and request for a lower core p-state even
> though the load was keeping one core or the other in the socket always
> busy at any given point of time.

Yeah, pinning yields a couple percent on Q6600 box, more on E5620
despite its spiffier gearbox.. likely turbo-boost doing it's thing.

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-25  8:30                             ` Mike Galbraith
@ 2012-02-27 22:11                               ` Suresh Siddha
  2012-02-28  5:05                                 ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2012-02-27 22:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Srivatsa Vaddagiri, Peter Zijlstra, linux-kernel, Ingo Molnar,
	Paul Turner

On Sat, 2012-02-25 at 09:30 +0100, Mike Galbraith wrote:
> My less rotund config shows the L2 penalty decidedly more prominently.
> We used to have avg_overlap as a synchronous wakeup hint, but it was
> broken by preemption and whatnot, got the axe to recover some cycles.  A
> reliable and dirt cheap replacement would be a good thing to have.
> 
> TCP_RR and tbench are far way away from the overlap breakeven point on
> E5620, whereas with Q6600s shared L2, you can start converting overlap
> into throughput almost immediately. 
> 
> 2.4 GHz E5620
> Throughput 248.994 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
> Throughput 379.488 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES
> 
> 2.4 GHz Q6600
> Throughput 299.049 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
> Throughput 300.018 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES
> 

Also it is not always about just the L2 cache being shared/not or
warm/cold etc. It also depends on the core c-states/p-states etc. It
will cost waking up an idle core and the cost will depend on the what
core-c state it is in. And also if we ping-pong between cores often,
cpufreq governor will come and request for a lower core p-state even
though the load was keeping one core or the other in the socket always
busy at any given point of time.

thanks,
suresh


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-25  6:54                           ` Srivatsa Vaddagiri
@ 2012-02-25  8:30                             ` Mike Galbraith
  2012-02-27 22:11                               ` Suresh Siddha
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2012-02-25  8:30 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Sat, 2012-02-25 at 12:24 +0530, Srivatsa Vaddagiri wrote: 
> * Mike Galbraith <efault@gmx.de> [2012-02-23 12:21:04]:
> 
> > Unpinned netperf TCP_RR and/or tbench pairs?  Anything that's wakeup
> > heavy should tell the tail.
> 
> Here are some tbench numbers:
> 
> Machine : 2 Intel Xeon X5650 (Westmere) CPUs (6 core/package)
> Kernel  : tip (HEAD at ebe97fa)
> dbench  : v4.0
> 
> One tbench server/client pair was run on same host 5 times (with
> fs-cache being purged each time) and avg of 5 run for various cases
> noted below:
> 
> Case A : HT enabled (24 logical CPUs)
> 
> Thr'put	 : 168.166 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> Thr'put	 : 169.564 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc/smt)
> Thr'put	 : 173.151 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> 
> Case B : HT disabled (12 logical CPUs)
> 
> Thr'put	 : 167.977 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> Thr'put	 : 167.891 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc)
> Thr'put	 : 173.801 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
> 
> Observations:
> 
> a. ~3% improvement seen with SD_SHARE_PKG_RESOURCES disabled, which I guess 
>   reflects the cost of waking to a cold L2 cache.
> 
> b. No degradation seen with SD_BALANCE_WAKE enabled at mc/smt domains

I haven't done a lot of testing, but yeah, the little I have doesn't
show SD_BALANCE_WAKE making much difference on single socket boxen.

> IMO we need to detect tbench type paired wakeups as synchronous case, in
> which case blindly wakeup the task to cur_cpu (as cost of L2 cache miss
> could outweight the cost of any reduced scheduling latencies).
> 
> IOW select_task_rq_fair() needs to be given better hint as to whether L2
> cache has been made warm by someone (interrupt handler or a producer
> task), in which case (consumer) task needs to be woken in the same L2 cache
> domain (i.e on cur_cpu itself)?

My less rotund config shows the L2 penalty decidedly more prominently.
We used to have avg_overlap as a synchronous wakeup hint, but it was
broken by preemption and whatnot, got the axe to recover some cycles.  A
reliable and dirt cheap replacement would be a good thing to have.

TCP_RR and tbench are far way away from the overlap breakeven point on
E5620, whereas with Q6600s shared L2, you can start converting overlap
into throughput almost immediately. 

2.4 GHz E5620
Throughput 248.994 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
Throughput 379.488 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES

2.4 GHz Q6600
Throughput 299.049 MB/sec 1 procs  SD_SHARE_PKG_RESOURCES
Throughput 300.018 MB/sec 1 procs  !SD_SHARE_PKG_RESOURCES

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:21                         ` Mike Galbraith
@ 2012-02-25  6:54                           ` Srivatsa Vaddagiri
  2012-02-25  8:30                             ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-25  6:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-23 12:21:04]:

> Unpinned netperf TCP_RR and/or tbench pairs?  Anything that's wakeup
> heavy should tell the tail.

Here are some tbench numbers:

Machine : 2 Intel Xeon X5650 (Westmere) CPUs (6 core/package)
Kernel  : tip (HEAD at ebe97fa)
dbench  : v4.0

One tbench server/client pair was run on same host 5 times (with
fs-cache being purged each time) and avg of 5 run for various cases
noted below:

Case A : HT enabled (24 logical CPUs)

Thr'put	 : 168.166 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
Thr'put	 : 169.564 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc/smt)
Thr'put	 : 173.151 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)

Case B : HT disabled (12 logical CPUs)

Thr'put	 : 167.977 MB/s (SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)
Thr'put	 : 167.891 MB/s (SD_SHARE_PKG_RESOURCES + SD_BALANCE_WAKE at mc)
Thr'put	 : 173.801 MB/s (!SD_SHARE_PKG_RESOURCES + !SD_BALANCE_WAKE)

Observations:

a. ~3% improvement seen with SD_SHARE_PKG_RESOURCES disabled, which I guess 
  reflects the cost of waking to a cold L2 cache.

b. No degradation seen with SD_BALANCE_WAKE enabled at mc/smt domains

IMO we need to detect tbench type paired wakeups as synchronous case, in
which case blindly wakeup the task to cur_cpu (as cost of L2 cache miss
could outweight the cost of any reduced scheduling latencies).

IOW select_task_rq_fair() needs to be given better hint as to whether L2
cache has been made warm by someone (interrupt handler or a producer
task), in which case (consumer) task needs to be woken in the same L2 cache
domain (i.e on cur_cpu itself)?

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:32                             ` Srivatsa Vaddagiri
@ 2012-02-23 16:17                               ` Ingo Molnar
  0 siblings, 0 replies; 49+ messages in thread
From: Ingo Molnar @ 2012-02-23 16:17 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> * Ingo Molnar <mingo@elte.hu> [2012-02-23 12:26:50]:
> 
> > pro perf tip of the day: did you know that it's possible to run:
> > 
> >   perf stat --repeat 10 --null perf bench sched pipe
> 
> Ah yes ..had forgotten that! What is the recommended # of 
> iterations to run?

Well, I start at 3 or 5 and sometimes increase it to get a 
meaningful stddev - i.e. one that is smaller than half of the 
measured effect. (for cases where I expect an improvement and 
want to measure it.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:19                         ` Ingo Molnar
@ 2012-02-23 12:18                           ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-23 12:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner

* Ingo Molnar <mingo@elte.hu> [2012-02-23 12:19:29]:

> sysbench is one of the best ones punishing bad scheduler 
> balancing mistakes.

Here are the sysbench oltp score on same machine i.e:

Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
Kernel  : tip (HEAD at 6241cc8)
sysbench : 0.4.12

sysbench was run with 16 threads as:

./sysbench --num-threads=16 --max-requests=100000 --test=oltp --oltp-table-size=500000 --mysql-socket=/var/lib/mysql/mysql.sock --oltp-read-only --mysql-user=root --mysql-password=blah run

sysbench was run 5 times with fs-cache being purged before each run
(echo 3 > /proc/sys/vm/drop_caches).

Average of 5 runs alongwith % std. dev. is noted for various OLTP stats

				SD_BALANCE_WAKE		SD_BALANCE_WAKE	
				disabled		enabled

transactions (per sec)		4833.826 (+- 0.75%)	4837.354 (+- 1%)
read/write requests (per sec)	67673.580 (+- 0.75%)	67722.960 (+- 1%)
other ops (per sec)		9667.654 (+- 0.75%)	9674.710 (+- 1%)

There is minor improvement seen when SD_BALANCE_WAKE is enabled at SMT/MC
domains, but no degradation observed with it enabled.

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:26                           ` Ingo Molnar
@ 2012-02-23 11:32                             ` Srivatsa Vaddagiri
  2012-02-23 16:17                               ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-23 11:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner

* Ingo Molnar <mingo@elte.hu> [2012-02-23 12:26:50]:

> pro perf tip of the day: did you know that it's possible to run:
> 
>   perf stat --repeat 10 --null perf bench sched pipe

Ah yes ..had forgotten that! What is the recommended # of iterations 
to run?

> and get meaningful, normalized stddev calculations for free:
> 
>        5.486491487 seconds time elapsed                                          ( +-  5.50% )
> 
> the percentage at the end shows the level of standard deviation.
> 
> You can add "--sync" as well, which will cause perf stat to call 
> sync() - this gives extra stability of individual iterations and 
> makes sure all IO cost is accounted to the right run.

Ok ..thanks for the 'pro' tip!! Will use it in my subsequent runs!

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 11:20                         ` Srivatsa Vaddagiri
@ 2012-02-23 11:26                           ` Ingo Molnar
  2012-02-23 11:32                             ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 49+ messages in thread
From: Ingo Molnar @ 2012-02-23 11:26 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> "perf bench sched pipe" was run 10 times and average ops/sec 
> score alongwith std. dev is noted as below.
> 
> 
> 		SD_BALANCE_WAKE		SD_BALANCE_WAKE
> 		 disabled		enabled
> 
> Avg. score	108984.900 		111844.300 (+2.6%)
> std dev		20383.457		21668.604

pro perf tip of the day: did you know that it's possible to run:

  perf stat --repeat 10 --null perf bench sched pipe

and get meaningful, normalized stddev calculations for free:

       5.486491487 seconds time elapsed                                          ( +-  5.50% )

the percentage at the end shows the level of standard deviation.

You can add "--sync" as well, which will cause perf stat to call 
sync() - this gives extra stability of individual iterations and 
makes sure all IO cost is accounted to the right run.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2012-02-23 11:19                         ` Ingo Molnar
  2012-02-23 11:20                         ` Srivatsa Vaddagiri
@ 2012-02-23 11:21                         ` Mike Galbraith
  2012-02-25  6:54                           ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2012-02-23 11:21 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2012-02-23 at 16:19 +0530, Srivatsa Vaddagiri wrote:

> I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE 
> at SMT/MC domains.

That's an unexpected (but welcome) result.

> Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
> Kernel  : tip (HEAD at 6241cc8)
> Java	: OpenJDK 1.6.0_20
> Volano  : 2_9_0
> 
> Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in
> SMT/MC domains.
> 
> 				Average score	std. dev
> 
> SD_BALANCE_WAKE disabled	369459.750	4825.046
> SD_BALANCE_WAKE enabled		379070.500	379070.5
> 
> I am going to try pipe benchmark next. Do you have suggestions for any other 
> benchmarks I should try to see the effect of SD_BALANCE_WAKE turned on in 
> SMT/MC domains?

Unpinned netperf TCP_RR and/or tbench pairs?  Anything that's wakeup
heavy should tell the tail.

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2012-02-23 11:19                         ` Ingo Molnar
@ 2012-02-23 11:20                         ` Srivatsa Vaddagiri
  2012-02-23 11:26                           ` Ingo Molnar
  2012-02-23 11:21                         ` Mike Galbraith
  2 siblings, 1 reply; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-23 11:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2012-02-23 16:19:59]:

> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.  (general aside: testing with a
> > bloated distro config is a big mistake)
> 
> I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE
> at SMT/MC domains.

Ditto with pipe benchmark.

> Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
> Kernel  : tip (HEAD at 6241cc8)

"perf bench sched pipe" was run 10 times and average ops/sec score
alongwith std. dev is noted as below.


		SD_BALANCE_WAKE		SD_BALANCE_WAKE
		 disabled		enabled

Avg. score	108984.900 		111844.300 (+2.6%)
std dev		20383.457		21668.604

Note:

SD_BALANCE_WAKE for SMT/MC domain was enabled by writing into appropriate 
flags file present in /proc:

	# cd /proc/sys/kernel/sched_domain/
	# for i in `find . -name flags | grep domain1`; do echo 4671 > $i; done
	# for i in `find . -name flags | grep domain0`; do echo 703 > $i; done

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
@ 2012-02-23 11:19                         ` Ingo Molnar
  2012-02-23 12:18                           ` Srivatsa Vaddagiri
  2012-02-23 11:20                         ` Srivatsa Vaddagiri
  2012-02-23 11:21                         ` Mike Galbraith
  2 siblings, 1 reply; 49+ messages in thread
From: Ingo Molnar @ 2012-02-23 11:19 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Mike Galbraith, Peter Zijlstra, Suresh Siddha, linux-kernel, Paul Turner


* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> wrote:

> * Mike Galbraith <efault@gmx.de> [2012-02-20 19:14:21]:
> 
> > > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.
> > 
> > I really really need to find time to do systematic mainline testing.
> > 
> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.  (general aside: testing with a
> > bloated distro config is a big mistake)
> 
> I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE 
> at SMT/MC domains.
> 
> Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
> Kernel  : tip (HEAD at 6241cc8)
> Java	: OpenJDK 1.6.0_20
> Volano  : 2_9_0
> 
> Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in
> SMT/MC domains.
> 
> 				Average score	std. dev
> 
> SD_BALANCE_WAKE disabled	369459.750	4825.046
> SD_BALANCE_WAKE enabled		379070.500	379070.5
> 
> I am going to try pipe benchmark next. Do you have suggestions 
> for any other benchmarks I should try to see the effect of 
> SD_BALANCE_WAKE turned on in SMT/MC domains?

sysbench is one of the best ones punishing bad scheduler 
balancing mistakes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 18:14                     ` Mike Galbraith
  2012-02-20 18:15                       ` Peter Zijlstra
  2012-02-20 19:07                       ` Peter Zijlstra
@ 2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2012-02-23 11:19                         ` Ingo Molnar
                                           ` (2 more replies)
  2 siblings, 3 replies; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-23 10:49 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-20 19:14:21]:

> > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.
> 
> I really really need to find time to do systematic mainline testing.
> 
> Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> Maybe that has changed, but I doubt it.  (general aside: testing with a
> bloated distro config is a big mistake)

I am seeing 2.6% _improvement_ in volanomark score by enabling SD_BALANCE_WAKE 
at SMT/MC domains.

Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
Kernel  : tip (HEAD at 6241cc8)
Java	: OpenJDK 1.6.0_20
Volano  : 2_9_0

Volano benchmark was run 4 times with and w/o SD_BALANCE_WAKE enabled in
SMT/MC domains.

				Average score	std. dev

SD_BALANCE_WAKE disabled	369459.750	4825.046
SD_BALANCE_WAKE enabled		379070.500	379070.5

I am going to try pipe benchmark next. Do you have suggestions for any other 
benchmarks I should try to see the effect of SD_BALANCE_WAKE turned on in 
SMT/MC domains?

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21 10:37                               ` Peter Zijlstra
@ 2012-02-21 14:58                                 ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-21 14:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Peter Zijlstra <peterz@infradead.org> [2012-02-21 11:37:23]:

> http://www.volano.com/benchmarks.html

For some reason, volanomark stops working (client hangs waiting for data
to arrive on socket) in 3rd iteration when run on latest tip. I don't see that 
with 2.6.32 based kernel though. Checking which commit may have caused
this ..

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  9:21                             ` Mike Galbraith
@ 2012-02-21 10:37                               ` Peter Zijlstra
  2012-02-21 14:58                                 ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2012-02-21 10:37 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Srivatsa Vaddagiri, Suresh Siddha, linux-kernel, Ingo Molnar,
	Paul Turner

On Tue, 2012-02-21 at 10:21 +0100, Mike Galbraith wrote:

> I use vmark, find it _somewhat_ useful.  Not a lovely benchmark, but it
> is highly affinity sensitive, and switches heftily.  I don't put much
> value on it though, too extreme for me, but it is a ~decent indicator.
> 
> There are no doubt _lots_ better than vmark for java stuff.
> 
> I toss a variety pack at the box in obsessive-compulsive man mode when
> testing.  Which benchmarks doesn't matter much, just need to be wide
> spectrum and consistent.

http://www.volano.com/benchmarks.html

> No, I use Ingo's pipe-test, but that to measure fastpath overhead.

http://people.redhat.com/mingo/cfs-scheduler/tools/pipe-test.c

Also I think it lives as: perf bench sched pipe

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  8:32                           ` Srivatsa Vaddagiri
@ 2012-02-21  9:21                             ` Mike Galbraith
  2012-02-21 10:37                               ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2012-02-21  9:21 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2012-02-21 at 14:02 +0530, Srivatsa Vaddagiri wrote: 
> * Mike Galbraith <efault@gmx.de> [2012-02-21 06:43:18]:
> 
> > On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote: 
> > > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > > > Maybe that has changed, but I doubt it.
> > > 
> > > Right, I through I remembered some such, you could see it on wakeup
> > > heavy things like pipe-bench and that java msg passing thing, right?
> > 
> > Yeah, it beat up switch heavy buddies pretty bad.
> 
> Do you have pointer to the java benchmark? Also is pipe-bench the same
> as the one described below?

I use vmark, find it _somewhat_ useful.  Not a lovely benchmark, but it
is highly affinity sensitive, and switches heftily.  I don't put much
value on it though, too extreme for me, but it is a ~decent indicator.

There are no doubt _lots_ better than vmark for java stuff.

I toss a variety pack at the box in obsessive-compulsive man mode when
testing.  Which benchmarks doesn't matter much, just need to be wide
spectrum and consistent.

> http://freecode.com/projects/pipebench

No, I use Ingo's pipe-test, but that to measure fastpath overhead.

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  5:43                         ` Mike Galbraith
@ 2012-02-21  8:32                           ` Srivatsa Vaddagiri
  2012-02-21  9:21                             ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-21  8:32 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-21 06:43:18]:

> On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote: 
> > On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > > Maybe that has changed, but I doubt it.
> > 
> > Right, I through I remembered some such, you could see it on wakeup
> > heavy things like pipe-bench and that java msg passing thing, right?
> 
> Yeah, it beat up switch heavy buddies pretty bad.

Do you have pointer to the java benchmark? Also is pipe-bench the same
as the one described below?

http://freecode.com/projects/pipebench

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  6:37                           ` Mike Galbraith
@ 2012-02-21  8:09                             ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-21  8:09 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-21 07:37:14]:

> On Tue, 2012-02-21 at 05:36 +0530, Srivatsa Vaddagiri wrote:
> 
> > fwiw the patch I had sent does a wakeup balance within prev_cpu's
> > cache_domain (and not outside).
> 
> Yeah, the testing I did was turn on the flag, and measure.  With single
> domain scans, maybe select_idle_sibling() could just go away.

Yeah that's what I was thinking. Essentially find_idlest_group/cpu
should accomplish that job pretty much for us ...

> It ain't free either.

Will find out how much difference it makes ..

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-21  0:06                         ` Srivatsa Vaddagiri
@ 2012-02-21  6:37                           ` Mike Galbraith
  2012-02-21  8:09                             ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2012-02-21  6:37 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2012-02-21 at 05:36 +0530, Srivatsa Vaddagiri wrote:

> fwiw the patch I had sent does a wakeup balance within prev_cpu's
> cache_domain (and not outside).

Yeah, the testing I did was turn on the flag, and measure.  With single
domain scans, maybe select_idle_sibling() could just go away.  It ain't
free either.

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 19:07                       ` Peter Zijlstra
@ 2012-02-21  5:43                         ` Mike Galbraith
  2012-02-21  8:32                           ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2012-02-21  5:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Mon, 2012-02-20 at 20:07 +0100, Peter Zijlstra wrote: 
> On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.
> 
> Right, I through I remembered some such, you could see it on wakeup
> heavy things like pipe-bench and that java msg passing thing, right?

Yeah, it beat up switch heavy buddies pretty bad.

-Mike



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 18:25                       ` Mike Galbraith
@ 2012-02-21  0:06                         ` Srivatsa Vaddagiri
  2012-02-21  6:37                           ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-21  0:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Mike Galbraith <efault@gmx.de> [2012-02-20 19:25:43]:

> > I can give that a try for my benchmark and see how much it helps. My
> > suspicion is it will not fully solve the problem I have on hand.
> 
> I doubt it will either.  Your problem is when it doesn't succeed, but
> you have an idle core available in another domain.

fwiw the patch I had sent does a wakeup balance within prev_cpu's
cache_domain (and not outside). It handles the case where we don't have
any idle cpu/core within prev_cpu's cache domain, in which case we look
for next best thing (least loaded cpu). I did see good numbers with that
(for both my benchmark and sysbench).

More on this later in the day ..

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 18:14                     ` Mike Galbraith
  2012-02-20 18:15                       ` Peter Zijlstra
@ 2012-02-20 19:07                       ` Peter Zijlstra
  2012-02-21  5:43                         ` Mike Galbraith
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2012-02-20 19:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> Maybe that has changed, but I doubt it.

Right, I through I remembered some such, you could see it on wakeup
heavy things like pipe-bench and that java msg passing thing, right?



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 15:03                     ` Srivatsa Vaddagiri
@ 2012-02-20 18:25                       ` Mike Galbraith
  2012-02-21  0:06                         ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2012-02-20 18:25 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Mon, 2012-02-20 at 20:33 +0530, Srivatsa Vaddagiri wrote: 
> * Peter Zijlstra <peterz@infradead.org> [2012-02-20 15:41:01]:
> 
> > On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:
> > 
> > > ---
> > >  kernel/sched_fair.c |   10 ++--------
> > >  1 file changed, 2 insertions(+), 8 deletions(-)
> > > 
> > > Index: linux-3.0-tip/kernel/sched_fair.c
> > > ===================================================================
> > > --- linux-3.0-tip.orig/kernel/sched_fair.c
> > > +++ linux-3.0-tip/kernel/sched_fair.c
> > > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> > >  		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> > >  			if (idle_cpu(i)) {
> > >  				target = i;
> > > +				if (sd->flags & SD_SHARE_CPUPOWER)
> > > +					continue;
> > >  				break;
> > >  			}
> > >  		}
> > > -
> > > -		/*
> > > -		 * Lets stop looking for an idle sibling when we reached
> > > -		 * the domain that spans the current cpu and prev_cpu.
> > > -		 */
> > > -		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> > > -		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > > -			break;
> > >  	}
> > >  	rcu_read_unlock();
> > 
> > Mike, Suresh, did we ever get this sorted? I was looking at
> > select_idle_sibling() and it looks like we dropped this.
> > 
> > Also, did anybody ever get an answer from a HW guy on why sharing stuff
> > over SMT threads is so much worse than sharing it over proper cores? Its
> > not like this workload actually does anything concurrently.
> > 
> > I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.
> 
> From a quick scan of that code, it seems to prefer selecting an idle cpu
> in the same cache domain (vs selecting prev_cpu in absence of a core
> that is fully idle).

Yes, that was the sole purpose of select_idle_sibling() from square one.
If you can mobilize a CPU without eating cache penalty, this is most
excellent for load ramp-up.  The gain is huge over affine wakeup if
there is any overlap to regain, ie it's not a 100% synchronous load.

> I can give that a try for my benchmark and see how much it helps. My
> suspicion is it will not fully solve the problem I have on hand.

I doubt it will either.  Your problem is when it doesn't succeed, but
you have an idle core available in another domain.  That's a whole
different ball game.  Yeah, you can reap benefit by doing wakeup
balancing, but you'd better look very closely at the cost.  I haven't
been able to do that lately, so dunno what cost is in the here and now,
but it used to be _way_ too expensive to consider, just as unrestricted
idle balancing is, or high frequency load balancing in general is.

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 18:14                     ` Mike Galbraith
@ 2012-02-20 18:15                       ` Peter Zijlstra
  2012-02-20 19:07                       ` Peter Zijlstra
  2012-02-23 10:49                       ` Srivatsa Vaddagiri
  2 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2012-02-20 18:15 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> I thought this was pretty much sorted.  We want to prefer core over
> sibling, because on at laest some modern CPUs with L3, siblings suck
> rocks. 

Yeah, I since figured out how its supposed (and supposedly) does work.
Suresh was a bit too clever and forgot to put a comment in since clearly
it was obvious at the time ;-)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 14:41                   ` Peter Zijlstra
  2012-02-20 15:03                     ` Srivatsa Vaddagiri
@ 2012-02-20 18:14                     ` Mike Galbraith
  2012-02-20 18:15                       ` Peter Zijlstra
                                         ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Mike Galbraith @ 2012-02-20 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Mon, 2012-02-20 at 15:41 +0100, Peter Zijlstra wrote: 
> On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:
> 
> > ---
> >  kernel/sched_fair.c |   10 ++--------
> >  1 file changed, 2 insertions(+), 8 deletions(-)
> > 
> > Index: linux-3.0-tip/kernel/sched_fair.c
> > ===================================================================
> > --- linux-3.0-tip.orig/kernel/sched_fair.c
> > +++ linux-3.0-tip/kernel/sched_fair.c
> > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> >  		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> >  			if (idle_cpu(i)) {
> >  				target = i;
> > +				if (sd->flags & SD_SHARE_CPUPOWER)
> > +					continue;
> >  				break;
> >  			}
> >  		}
> > -
> > -		/*
> > -		 * Lets stop looking for an idle sibling when we reached
> > -		 * the domain that spans the current cpu and prev_cpu.
> > -		 */
> > -		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> > -		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > -			break;
> >  	}
> >  	rcu_read_unlock();
> 
> Mike, Suresh, did we ever get this sorted? I was looking at
> select_idle_sibling() and it looks like we dropped this.

I thought this was pretty much sorted.  We want to prefer core over
sibling, because on at laest some modern CPUs with L3, siblings suck
rocks.

> Also, did anybody ever get an answer from a HW guy on why sharing stuff
> over SMT threads is so much worse than sharing it over proper cores?

No.  My numbers on westmere indicated to me that siblings do not share
L2, making them fairly worthless.  Hard facts we never got.

> Its
> not like this workload actually does anything concurrently.
> 
> I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.

I really really need to find time to do systematic mainline testing.

Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
Maybe that has changed, but I doubt it.  (general aside: testing with a
bloated distro config is a big mistake)

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2012-02-20 14:41                   ` Peter Zijlstra
@ 2012-02-20 15:03                     ` Srivatsa Vaddagiri
  2012-02-20 18:25                       ` Mike Galbraith
  2012-02-20 18:14                     ` Mike Galbraith
  1 sibling, 1 reply; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-02-20 15:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Peter Zijlstra <peterz@infradead.org> [2012-02-20 15:41:01]:

> On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:
> 
> > ---
> >  kernel/sched_fair.c |   10 ++--------
> >  1 file changed, 2 insertions(+), 8 deletions(-)
> > 
> > Index: linux-3.0-tip/kernel/sched_fair.c
> > ===================================================================
> > --- linux-3.0-tip.orig/kernel/sched_fair.c
> > +++ linux-3.0-tip/kernel/sched_fair.c
> > @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
> >  		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> >  			if (idle_cpu(i)) {
> >  				target = i;
> > +				if (sd->flags & SD_SHARE_CPUPOWER)
> > +					continue;
> >  				break;
> >  			}
> >  		}
> > -
> > -		/*
> > -		 * Lets stop looking for an idle sibling when we reached
> > -		 * the domain that spans the current cpu and prev_cpu.
> > -		 */
> > -		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> > -		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > -			break;
> >  	}
> >  	rcu_read_unlock();
> 
> Mike, Suresh, did we ever get this sorted? I was looking at
> select_idle_sibling() and it looks like we dropped this.
> 
> Also, did anybody ever get an answer from a HW guy on why sharing stuff
> over SMT threads is so much worse than sharing it over proper cores? Its
> not like this workload actually does anything concurrently.
> 
> I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.

>From a quick scan of that code, it seems to prefer selecting an idle cpu
in the same cache domain (vs selecting prev_cpu in absence of a core
that is fully idle). 

I can give that a try for my benchmark and see how much it helps. My
suspicion is it will not fully solve the problem I have on hand.

- vatsa


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-18 15:14                 ` Mike Galbraith
@ 2012-02-20 14:41                   ` Peter Zijlstra
  2012-02-20 15:03                     ` Srivatsa Vaddagiri
  2012-02-20 18:14                     ` Mike Galbraith
  0 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2012-02-20 14:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner,
	Srivatsa Vaddagiri

On Fri, 2011-11-18 at 16:14 +0100, Mike Galbraith wrote:

> ---
>  kernel/sched_fair.c |   10 ++--------
>  1 file changed, 2 insertions(+), 8 deletions(-)
> 
> Index: linux-3.0-tip/kernel/sched_fair.c
> ===================================================================
> --- linux-3.0-tip.orig/kernel/sched_fair.c
> +++ linux-3.0-tip/kernel/sched_fair.c
> @@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
>  		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
>  			if (idle_cpu(i)) {
>  				target = i;
> +				if (sd->flags & SD_SHARE_CPUPOWER)
> +					continue;
>  				break;
>  			}
>  		}
> -
> -		/*
> -		 * Lets stop looking for an idle sibling when we reached
> -		 * the domain that spans the current cpu and prev_cpu.
> -		 */
> -		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
> -		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> -			break;
>  	}
>  	rcu_read_unlock();

Mike, Suresh, did we ever get this sorted? I was looking at
select_idle_sibling() and it looks like we dropped this.

Also, did anybody ever get an answer from a HW guy on why sharing stuff
over SMT threads is so much worse than sharing it over proper cores? Its
not like this workload actually does anything concurrently.

I was looking at this code due to vatsa wanting to do SD_BALANCE_WAKE.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-18 15:12               ` Peter Zijlstra
@ 2011-11-18 15:26                 ` Mike Galbraith
  0 siblings, 0 replies; 49+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Fri, 2011-11-18 at 16:12 +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-17 at 11:08 -0800, Suresh Siddha wrote:
> > 
> > From: Suresh Siddha <suresh.b.siddha@intel.com>
> > Subject: sched: cleanup domain traversal in select_idle_sibling
> > 
> > Instead of going through the scheduler domain hierarchy multiple times
> > (for giving priority to an idle core over an idle SMT sibling in a busy
> > core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES
> > flag and traverse the domain hierarchy down till we find an idle group.
> > 
> > This cleanup also addresses an issue reported by Mike where the recent
> > changes returned the busy thread even in the presence of an idle SMT
> > sibling in single socket platforms.
> > 
> > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > Cc: Mike Galbraith <efault@gmx.de> 
> 
> Thanks Suresh!

Tested-by: Mike Galbraith <efault@gmx.de>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 17:36               ` Suresh Siddha
@ 2011-11-18 15:14                 ` Mike Galbraith
  2012-02-20 14:41                   ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2011-11-18 15:14 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 09:36 -0800, Suresh Siddha wrote:
> On Thu, 2011-11-17 at 08:38 -0800, Mike Galbraith wrote:
> > On Thu, 2011-11-17 at 16:56 +0100, Peter Zijlstra wrote:
> > > Something like the below maybe, although I'm certain it all can be
> > > written much nicer indeed.
> > 
> > I'll give it a go.
> > 
> > Squabbling with bouncing buddies in an isolated and otherwise idle
> > cpuset ate my day.
> >  
> 
> Well looks like I managed to have the similar issue in my patch too.
> Anyways here is the updated cleaned up version of the patch ;)

Works fine.  However, unpinned buddies bounce more than with virgin
mainline.  I tried doing it differently (mikie in numbers below), and it
worked for a single unbound pair, but raped multiple unbound pairs.

---
 kernel/sched_fair.c |   10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

Index: linux-3.0-tip/kernel/sched_fair.c
===================================================================
--- linux-3.0-tip.orig/kernel/sched_fair.c
+++ linux-3.0-tip/kernel/sched_fair.c
@@ -2276,17 +2276,11 @@ static int select_idle_sibling(struct ta
 		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
 			if (idle_cpu(i)) {
 				target = i;
+				if (sd->flags & SD_SHARE_CPUPOWER)
+					continue;
 				break;
 			}
 		}
-
-		/*
-		 * Lets stop looking for an idle sibling when we reached
-		 * the domain that spans the current cpu and prev_cpu.
-		 */
-		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
-			break;
 	}
 	rcu_read_unlock();
 
 

mikie2 is your patch + twiddles I'll post as a reply to this post.
  
kernel v3.2-rc1-306-g7f80850

TTWU_QUEUE off (skews results), test in cpuset 1-3,5-7

Test1: one unbound TCP_RR pair, three runs

virgin      66611.73      71376.00      61297.09                       avg 66428.27   1.000
suresh      68488.88      68412.48      68149.73 (bounce)                  68350.36   1.028
mikie       75925.91      75851.63      74617.29 (bounce--)                75464.94   1.136
mikie2      71403.39      71396.73      72258.91 NO_SIBLING_LIMIT_SYNC     71686.34   1.079
mikie2     139210.06     140485.95     140189.95 SIBLING_LIMIT_SYNC       139961.98   2.106


Test2: one unbound TCP_RR pair plus 2 unbound hogs, three runs

virgin      87108.59      88737.30      87383.98                       avg 87743.29  1.000
suresh      84281.24      84725.07      84823.57                           84931.93   .967
mikie       87850.37      86081.73      85789.49                           86573.86   .986
mikie2      92613.79      92022.95      92014.26 NO_SIBLING_LIMIT_SYNC     92217.00  1.050
mikie2     134682.16     133497.30     133584.48 SIBLING_LIMIT_SYNC


Test3: three unbound TCP_RR pairs, single run

virgin      55246.99      55138.67      55248.95                       avg 55211.53  1.000
suresh      53141.24      53165.45      53224.71                           53177.13   .963
mikie       47627.14      47361.68      47389.41                           47459.41   .859
mikie2      57969.49      57704.79      58218.14 NO_SIBLING_LIMIT_SYNC     57964.14  1.049
mikie2     132205.11     133726.94     133706.09 SIBLING_LIMIT_SYNC       133212.71  2.412


Test4: three bound TCP_RR pairs, single run

virgin     130073.67     130202.02     131666.48                      avg 130647.39  1.000
suresh     129805.98     128058.25     128709.77                          128858.00   .986
mikie      125597.11     127260.39     127208.73                          126688.74   .969
mikie2     135441.58     134961.89     137162.00                          135855.15  1.039


Test5: drop shield, tbench 8

virgin     2118.26 MB/sec  1.000
suresh     2036.32 MB/sec   .961
mikie      2051.18 MB/sec   .968
mikie2     2125.21 MB/sec  1.003  (hohum, all within tbench jitter)

Problem reference: select_idle_sibling() = painful L2 misses with westmere.

Identical configs, nohz=off NO_TTWU_QUEUE,
processor.max_cstate=0 intel_idle.max_cstate=0
turbo-boost off (so both are now plain 2.4GHz boxen)

single bound TCP_RR pair
              E5620      Q6600  bound
           90196.84   42517.96  3->0
           92654.92   43946.50  3->1
           91735.26   95274.10  3->2
          129394.55   95266.83  3->3
           89127.98             3->4
           91303.15             3->5
           91345.85             3->6
           74141.88             3->7  huh?.. load is synchronous!



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 19:08             ` Suresh Siddha
@ 2011-11-18 15:12               ` Peter Zijlstra
  2011-11-18 15:26                 ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2011-11-18 15:12 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Mike Galbraith, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 11:08 -0800, Suresh Siddha wrote:
> 
> From: Suresh Siddha <suresh.b.siddha@intel.com>
> Subject: sched: cleanup domain traversal in select_idle_sibling
> 
> Instead of going through the scheduler domain hierarchy multiple times
> (for giving priority to an idle core over an idle SMT sibling in a busy
> core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES
> flag and traverse the domain hierarchy down till we find an idle group.
> 
> This cleanup also addresses an issue reported by Mike where the recent
> changes returned the busy thread even in the presence of an idle SMT
> sibling in single socket platforms.
> 
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> Cc: Mike Galbraith <efault@gmx.de> 

Thanks Suresh!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 15:56           ` Peter Zijlstra
  2011-11-17 16:38             ` Mike Galbraith
@ 2011-11-17 19:08             ` Suresh Siddha
  2011-11-18 15:12               ` Peter Zijlstra
  1 sibling, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2011-11-17 19:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 07:56 -0800, Peter Zijlstra wrote:
> D'0h, indeed.. 
> 
> Something like the below maybe, although I'm certain it all can be
> written much nicer indeed.
> 

Peter, I just noticed that the -tip tree has the original proposed patch
and the new sched/ directory. So updated my cleanup patch accordingly.
Thanks.
---

From: Suresh Siddha <suresh.b.siddha@intel.com>
Subject: sched: cleanup domain traversal in select_idle_sibling

Instead of going through the scheduler domain hierarchy multiple times
(for giving priority to an idle core over an idle SMT sibling in a busy
core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES
flag and traverse the domain hierarchy down till we find an idle group.

This cleanup also addresses an issue reported by Mike where the recent
changes returned the busy thread even in the presence of an idle SMT
sibling in single socket platforms.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/fair.c  |   38 +++++++++++++++++++++++++-------------
 kernel/sched/sched.h |    2 ++
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd3b642..537e16a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2642,6 +2642,28 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			break;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2651,7 +2673,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i, smt = 0;
+	int i;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2671,19 +2693,9 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
-again:
-	for_each_domain(target, sd) {
-		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
-			continue;
-
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
-			if (!smt) {
-				smt = 1;
-				goto again;
-			}
-			break;
-		}
 
+	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
 			if (!cpumask_intersects(sched_group_cpus(sg),
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c2e7802..8715055 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -501,6 +501,8 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define for_each_domain(cpu, __sd) \
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
 
+#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
+
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		(&__get_cpu_var(runqueues))
 #define task_rq(p)		cpu_rq(task_cpu(p))



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 16:38             ` Mike Galbraith
@ 2011-11-17 17:36               ` Suresh Siddha
  2011-11-18 15:14                 ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2011-11-17 17:36 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 08:38 -0800, Mike Galbraith wrote:
> On Thu, 2011-11-17 at 16:56 +0100, Peter Zijlstra wrote:
> > Something like the below maybe, although I'm certain it all can be
> > written much nicer indeed.
> 
> I'll give it a go.
> 
> Squabbling with bouncing buddies in an isolated and otherwise idle
> cpuset ate my day.
>  

Well looks like I managed to have the similar issue in my patch too.
Anyways here is the updated cleaned up version of the patch ;)
---

Avoid select_idle_sibling() from picking a sibling thread if there's
an idle core that shares cache.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |    2 +
 kernel/sched_fair.c |   57 ++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0e9344a..4b0bc6a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -734,6 +734,8 @@ static inline int cpu_of(struct rq *rq)
 #define for_each_domain(cpu, __sd) \
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
 
+#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
+
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		(&__get_cpu_var(runqueues))
 #define task_rq(p)		cpu_rq(task_cpu(p))
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..020483a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2241,6 +2241,28 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd, *hsd = NULL;
+
+	for_each_domain(cpu, sd) {
+		if (!(sd->flags & flag))
+			break;
+		hsd = sd;
+	}
+
+	return hsd;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2249,6 +2271,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
+	struct sched_group *sg;
 	int i;
 
 	/*
@@ -2269,25 +2292,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
-	for_each_domain(target, sd) {
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
-			break;
+	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	for_each_lower_domain(sd) {
+		sg = sd->groups;
+		do {
+			if (!cpumask_intersects(sched_group_cpus(sg),
+						tsk_cpus_allowed(p)))
+				goto next;
 
-		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
-			if (idle_cpu(i)) {
-				target = i;
-				break;
+			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (!idle_cpu(i))
+					goto next;
 			}
-		}
 
-		/*
-		 * Lets stop looking for an idle sibling when we reached
-		 * the domain that spans the current cpu and prev_cpu.
-		 */
-		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
-			break;
+			target = cpumask_first_and(sched_group_cpus(sg),
+					tsk_cpus_allowed(p));
+			goto done;
+next:
+			sg = sg->next;
+		} while (sg != sd->groups);
 	}
+done:
 	rcu_read_unlock();
 
 	return target;



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 15:56           ` Peter Zijlstra
@ 2011-11-17 16:38             ` Mike Galbraith
  2011-11-17 17:36               ` Suresh Siddha
  2011-11-17 19:08             ` Suresh Siddha
  1 sibling, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2011-11-17 16:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 16:56 +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-17 at 16:38 +0100, Mike Galbraith wrote:
> > On Thu, 2011-11-17 at 02:59 +0100, Mike Galbraith wrote:
> > > On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote:
> > 
> > > > It should be ok. Isn't it?
> > > 
> > > Nope, wasn't ok.
> > 
> > Because you can't get to again: with a single E5620 package, it having
> > only SIBLING and MC domains.
> >  
> > again:
> >         for_each_domain(target, sd) {
> >                 if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
> >                         continue;
> > 
> >                 if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
> >                         if (!smt) {
> >                                 smt = 1;
> >                                 goto again;
> > 
> 
> D'0h, indeed.. 
> 
> Something like the below maybe, although I'm certain it all can be
> written much nicer indeed.

I'll give it a go.

Squabbling with bouncing buddies in an isolated and otherwise idle
cpuset ate my day.
 
> ---
>  kernel/sched/fair.c |   14 ++++++++------
>  1 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cd3b642..340e62e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2676,13 +2676,11 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
>  			continue;
>  
> -		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
> -			if (!smt) {
> -				smt = 1;
> -				goto again;
> -			}
> +		if (smt && !(sd->flags & SD_SHARE_CPUPOWER))
> +			break;
> +
> +		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
>  			break;
> -		}
>  
>  		sg = sd->groups;
>  		do {
> @@ -2702,6 +2700,10 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  			sg = sg->next;
>  		} while (sg != sd->groups);
>  	}
> +	if (!smt) {
> +		smt = 1;
> +		goto again;
> +	}
>  done:
>  	rcu_read_unlock();
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17 15:38         ` Mike Galbraith
@ 2011-11-17 15:56           ` Peter Zijlstra
  2011-11-17 16:38             ` Mike Galbraith
  2011-11-17 19:08             ` Suresh Siddha
  0 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2011-11-17 15:56 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 16:38 +0100, Mike Galbraith wrote:
> On Thu, 2011-11-17 at 02:59 +0100, Mike Galbraith wrote:
> > On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote:
> 
> > > It should be ok. Isn't it?
> > 
> > Nope, wasn't ok.
> 
> Because you can't get to again: with a single E5620 package, it having
> only SIBLING and MC domains.
>  
> again:
>         for_each_domain(target, sd) {
>                 if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
>                         continue;
> 
>                 if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
>                         if (!smt) {
>                                 smt = 1;
>                                 goto again;
> 

D'0h, indeed.. 

Something like the below maybe, although I'm certain it all can be
written much nicer indeed.

---
 kernel/sched/fair.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd3b642..340e62e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2676,13 +2676,11 @@ static int select_idle_sibling(struct task_struct *p, int target)
 		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
 			continue;
 
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
-			if (!smt) {
-				smt = 1;
-				goto again;
-			}
+		if (smt && !(sd->flags & SD_SHARE_CPUPOWER))
+			break;
+
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
 			break;
-		}
 
 		sg = sd->groups;
 		do {
@@ -2702,6 +2700,10 @@ static int select_idle_sibling(struct task_struct *p, int target)
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+	if (!smt) {
+		smt = 1;
+		goto again;
+	}
 done:
 	rcu_read_unlock();
 



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-17  1:59       ` Mike Galbraith
@ 2011-11-17 15:38         ` Mike Galbraith
  2011-11-17 15:56           ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2011-11-17 15:38 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Thu, 2011-11-17 at 02:59 +0100, Mike Galbraith wrote:
> On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote:

> > It should be ok. Isn't it?
> 
> Nope, wasn't ok.

Because you can't get to again: with a single E5620 package, it having
only SIBLING and MC domains.
 
again:
        for_each_domain(target, sd) {
                if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
                        continue;

                if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
                        if (!smt) {
                                smt = 1;
                                goto again;


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-16 18:37     ` Suresh Siddha
@ 2011-11-17  1:59       ` Mike Galbraith
  2011-11-17 15:38         ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2011-11-17  1:59 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-16 at 10:37 -0800, Suresh Siddha wrote:
> On Wed, 2011-11-16 at 01:24 -0800, Mike Galbraith wrote:

> > At SIBLING, group = 0,4 = 0x5, 0x5 & 0xff = 1 = target.
> 
> Mike, At the sibling level, domain span will be 0,4 which is 0x5. But
> there are two individual groups. First group just contains cpu0 and the
> second group contains cpu4.
> 
> So if cpu0 is busy, we will check the next group to see if it is idle
> (which is cpu4 in your example). So we will return cpu-4.

Oh duh, right.

> It should be ok. Isn't it?

Nope, wasn't ok.  I'll double check today though, and bend, spindle
and/or mutilate as required.

	-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-16  9:24   ` Mike Galbraith
@ 2011-11-16 18:37     ` Suresh Siddha
  2011-11-17  1:59       ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2011-11-16 18:37 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Wed, 2011-11-16 at 01:24 -0800, Mike Galbraith wrote:
> On Tue, 2011-11-15 at 17:14 -0800, Suresh Siddha wrote:
> 
> > How about this patch which is more self explanatory?
> 
> Actually, after further testing/reading, it looks to me like both of
> these patches have a problem.  They'll never select SMT siblings (so a
> skip SIBLING should accomplish the same).
> 
> This patch didn't select an idle core either though, where Peter's did.
> 
> Tested by pinning a hog to cores 1-3, then starting up an unpinned
> tbench pair.  Peter's patch didn't do the BadThing (as in bad for
> TCP_RR/tbench) in that case, but should have.
> 
> > +		sg = sd->groups;
> > +		do {
> > +			if (!cpumask_intersects(sched_group_cpus(sg),
> > +						tsk_cpus_allowed(p)))
> > +				goto next;
> >  
> > +			for_each_cpu(i, sched_group_cpus(sg)) {
> > +				if (!idle_cpu(i))
> > +					goto next;
> 
> Say target is CPU0.  Groups are 0,4 1,5 2,6 3,7.  0-3 are first CPUs
> encountered in MC groups, all were busy.  At SIBLING level, the only
> group is 0,4.  First encountered CPU of sole group is busy target, so
> we're done.. so we return busy target.
> 
> > +			target = cpumask_first_and(sched_group_cpus(sg),
> > +					tsk_cpus_allowed(p));
> 
> At SIBLING, group = 0,4 = 0x5, 0x5 & 0xff = 1 = target.

Mike, At the sibling level, domain span will be 0,4 which is 0x5. But
there are two individual groups. First group just contains cpu0 and the
second group contains cpu4.

So if cpu0 is busy, we will check the next group to see if it is idle
(which is cpu4 in your example). So we will return cpu-4.

It should be ok. Isn't it?

thanks,
suresh


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-16  1:14 ` Suresh Siddha
@ 2011-11-16  9:24   ` Mike Galbraith
  2011-11-16 18:37     ` Suresh Siddha
  0 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2011-11-16  9:24 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Paul Turner

On Tue, 2011-11-15 at 17:14 -0800, Suresh Siddha wrote:

> How about this patch which is more self explanatory?

Actually, after further testing/reading, it looks to me like both of
these patches have a problem.  They'll never select SMT siblings (so a
skip SIBLING should accomplish the same).

This patch didn't select an idle core either though, where Peter's did.

Tested by pinning a hog to cores 1-3, then starting up an unpinned
tbench pair.  Peter's patch didn't do the BadThing (as in bad for
TCP_RR/tbench) in that case, but should have.

> +		sg = sd->groups;
> +		do {
> +			if (!cpumask_intersects(sched_group_cpus(sg),
> +						tsk_cpus_allowed(p)))
> +				goto next;
>  
> +			for_each_cpu(i, sched_group_cpus(sg)) {
> +				if (!idle_cpu(i))
> +					goto next;

Say target is CPU0.  Groups are 0,4 1,5 2,6 3,7.  0-3 are first CPUs
encountered in MC groups, all were busy.  At SIBLING level, the only
group is 0,4.  First encountered CPU of sole group is busy target, so
we're done.. so we return busy target.

> +			target = cpumask_first_and(sched_group_cpus(sg),
> +					tsk_cpus_allowed(p));

At SIBLING, group = 0,4 = 0x5, 0x5 & 0xff = 1 = target.

	-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
  2011-11-15  9:46 Peter Zijlstra
@ 2011-11-16  1:14 ` Suresh Siddha
  2011-11-16  9:24   ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Suresh Siddha @ 2011-11-16  1:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, Paul Turner, Mike Galbraith

On Tue, 2011-11-15 at 01:46 -0800, Peter Zijlstra wrote:
> @@ -2346,25 +2347,38 @@ static int select_idle_sibling(struct ta
>  	 * Otherwise, iterate the domains and find an elegible idle cpu.
>  	 */
>  	rcu_read_lock();
> +again:
>  	for_each_domain(target, sd) {
> -		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
> -			break;
> +		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
> +			continue;
>  
> -		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
> -			if (idle_cpu(i)) {
> -				target = i;
> -				break;
> +		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
> +			if (!smt) {
> +				smt = 1;
> +				goto again;
>  			}
> +			break;
>  		}

It looks like you will be checking the core domain twice (with smt == 0
and smt == 1) if there are no idle siblings.

How about this patch which is more self explanatory?
---

Avoid select_idle_sibling() from picking a sibling thread if there's
an idle core that shares cache.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |    2 +
 kernel/sched_fair.c |   54 +++++++++++++++++++++++++++++++++++---------------
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0e9344a..4b0bc6a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -734,6 +734,8 @@ static inline int cpu_of(struct rq *rq)
 #define for_each_domain(cpu, __sd) \
 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); __sd; __sd = __sd->parent)
 
+#define for_each_lower_domain(sd) for (; sd; sd = sd->child)
+
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		(&__get_cpu_var(runqueues))
 #define task_rq(p)		cpu_rq(task_cpu(p))
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..cb7a5ef 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2241,6 +2241,25 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	return idlest;
 }
 
+/**
+ * highest_flag_domain - Return highest sched_domain containing flag.
+ * @cpu:	The cpu whose highest level of sched domain is to
+ *		be returned.
+ * @flag:	The flag to check for the highest sched_domain
+ *		for the given cpu.
+ *
+ * Returns the highest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
+{
+	struct sched_domain *sd;
+
+	for_each_domain(cpu, sd)
+		if (!(sd->flags & flag))
+			return sd->child;
+	return NULL;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -2249,6 +2268,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
+	struct sched_group *sg;
 	int i;
 
 	/*
@@ -2269,25 +2289,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
-	for_each_domain(target, sd) {
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
-			break;
+	sd = highest_flag_domain(target, SD_SHARE_PKG_RESOURCES);
+	for_each_lower_domain(sd) {
+		sg = sd->groups;
+		do {
+			if (!cpumask_intersects(sched_group_cpus(sg),
+						tsk_cpus_allowed(p)))
+				goto next;
 
-		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
-			if (idle_cpu(i)) {
-				target = i;
-				break;
+			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (!idle_cpu(i))
+					goto next;
 			}
-		}
 
-		/*
-		 * Lets stop looking for an idle sibling when we reached
-		 * the domain that spans the current cpu and prev_cpu.
-		 */
-		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
-			break;
+			target = cpumask_first_and(sched_group_cpus(sg),
+					tsk_cpus_allowed(p));
+			goto done;
+next:
+			sg = sg->next;
+		} while (sg != sd->groups);
 	}
+done:
 	rcu_read_unlock();
 
 	return target;



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* sched: Avoid SMT siblings in select_idle_sibling() if possible
@ 2011-11-15  9:46 Peter Zijlstra
  2011-11-16  1:14 ` Suresh Siddha
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2011-11-15  9:46 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar; +Cc: Paul Turner, Suresh Siddha, Mike Galbraith

Subject: sched: Avoid SMT siblings in select_idle_sibling() if possible
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 10 Nov 2011 13:01:10 +0100

Avoid select_idle_sibling() from picking a sibling thread if there's
an idle core that shares cache.

Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-smlz1ah4jwood10f7eml9gzk@git.kernel.org
---
 kernel/sched_fair.c |   42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -2326,7 +2326,8 @@ static int select_idle_sibling(struct ta
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
-	int i;
+	struct sched_group *sg;
+	int i, smt = 0;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2346,25 +2347,38 @@ static int select_idle_sibling(struct ta
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
 	rcu_read_lock();
+again:
 	for_each_domain(target, sd) {
-		if (!(sd->flags & SD_SHARE_PKG_RESOURCES))
-			break;
+		if (!smt && (sd->flags & SD_SHARE_CPUPOWER))
+			continue;
 
-		for_each_cpu_and(i, sched_domain_span(sd), tsk_cpus_allowed(p)) {
-			if (idle_cpu(i)) {
-				target = i;
-				break;
+		if (!(sd->flags & SD_SHARE_PKG_RESOURCES)) {
+			if (!smt) {
+				smt = 1;
+				goto again;
 			}
+			break;
 		}
 
-		/*
-		 * Lets stop looking for an idle sibling when we reached
-		 * the domain that spans the current cpu and prev_cpu.
-		 */
-		if (cpumask_test_cpu(cpu, sched_domain_span(sd)) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
-			break;
+		sg = sd->groups;
+		do {
+			if (!cpumask_intersects(sched_group_cpus(sg),
+						tsk_cpus_allowed(p)))
+				goto next;
+
+			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (!idle_cpu(i))
+					goto next;
+			}
+
+			target = cpumask_first_and(sched_group_cpus(sg),
+					tsk_cpus_allowed(p));
+			goto done;
+next:
+			sg = sg->next;
+		} while (sg != sd->groups);
 	}
+done:
 	rcu_read_unlock();
 
 	return target;


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2012-03-27 13:56 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1329764866.2293.376.camhel@twins>
2012-03-05 15:24 ` sched: Avoid SMT siblings in select_idle_sibling() if possible Srivatsa Vaddagiri
2012-03-06  9:14   ` Ingo Molnar
2012-03-06 10:03     ` Srivatsa Vaddagiri
2012-03-22 15:32     ` Srivatsa Vaddagiri
2012-03-23  6:38       ` Mike Galbraith
2012-03-26  8:29       ` Peter Zijlstra
2012-03-26  8:36       ` Peter Zijlstra
2012-03-26 17:35         ` Srivatsa Vaddagiri
2012-03-26 18:06           ` Peter Zijlstra
2012-03-27 13:56             ` Mike Galbraith
2011-11-15  9:46 Peter Zijlstra
2011-11-16  1:14 ` Suresh Siddha
2011-11-16  9:24   ` Mike Galbraith
2011-11-16 18:37     ` Suresh Siddha
2011-11-17  1:59       ` Mike Galbraith
2011-11-17 15:38         ` Mike Galbraith
2011-11-17 15:56           ` Peter Zijlstra
2011-11-17 16:38             ` Mike Galbraith
2011-11-17 17:36               ` Suresh Siddha
2011-11-18 15:14                 ` Mike Galbraith
2012-02-20 14:41                   ` Peter Zijlstra
2012-02-20 15:03                     ` Srivatsa Vaddagiri
2012-02-20 18:25                       ` Mike Galbraith
2012-02-21  0:06                         ` Srivatsa Vaddagiri
2012-02-21  6:37                           ` Mike Galbraith
2012-02-21  8:09                             ` Srivatsa Vaddagiri
2012-02-20 18:14                     ` Mike Galbraith
2012-02-20 18:15                       ` Peter Zijlstra
2012-02-20 19:07                       ` Peter Zijlstra
2012-02-21  5:43                         ` Mike Galbraith
2012-02-21  8:32                           ` Srivatsa Vaddagiri
2012-02-21  9:21                             ` Mike Galbraith
2012-02-21 10:37                               ` Peter Zijlstra
2012-02-21 14:58                                 ` Srivatsa Vaddagiri
2012-02-23 10:49                       ` Srivatsa Vaddagiri
2012-02-23 11:19                         ` Ingo Molnar
2012-02-23 12:18                           ` Srivatsa Vaddagiri
2012-02-23 11:20                         ` Srivatsa Vaddagiri
2012-02-23 11:26                           ` Ingo Molnar
2012-02-23 11:32                             ` Srivatsa Vaddagiri
2012-02-23 16:17                               ` Ingo Molnar
2012-02-23 11:21                         ` Mike Galbraith
2012-02-25  6:54                           ` Srivatsa Vaddagiri
2012-02-25  8:30                             ` Mike Galbraith
2012-02-27 22:11                               ` Suresh Siddha
2012-02-28  5:05                                 ` Mike Galbraith
2011-11-17 19:08             ` Suresh Siddha
2011-11-18 15:12               ` Peter Zijlstra
2011-11-18 15:26                 ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).