Re: sched: Avoid SMT siblings in select_idle_sibling() if possible

* Re: sched: Avoid SMT siblings in select_idle_sibling() if possible
       [not found] <1329764866.2293.376.camhel@twins>
@ 2012-03-05 15:24 ` Srivatsa Vaddagiri
  2012-03-06  9:14   ` Ingo Molnar
  0 siblings, 1 reply; 49+ messages in thread
From: Srivatsa Vaddagiri @ 2012-03-05 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Suresh Siddha, linux-kernel, Ingo Molnar, Paul Turner

* Peter Zijlstra <peterz@infradead.org> [2012-02-20 20:07:46]:

> On Mon, 2012-02-20 at 19:14 +0100, Mike Galbraith wrote:
> > Enabling SD_BALANCE_WAKE used to be decidedly too expensive to consider.
> > Maybe that has changed, but I doubt it.
> 
> Right, I through I remembered somet such, you could see it on wakeup
> heavy things like pipe-bench and that java msg passing thing, right?

I did some experiments with volanomark and it does turn out to be
sensitive to SD_BALANCE_WAKE, while the other wake-heavy benchmark that I am
dealing with (Trade) benefits from it.

Normalized results for both benchmarks provided below. 

Machine : 2 Quad-core Intel X5570 CPU (H/T enabled)
Kernel  : tip (HEAD at b86148a) 

	   	   Before patch	    After patch

Trade thr'put		1		2.17 (~200% improvement)
volanomark		1		0.8  (20% degradation)


Quick description of benchmarks 
===============================

Trade was run inside a 8-vcpu VM (cgroup). 4 other 4-vcpu VMs running
cpu hogs were also present, leading to this cgroup setup:

	/cgroup/sys (1024 shares - hosts all system tasks)
	/cgroup/libvirt (20000 shares)
	/cgroup/libvirt/qemu/VM1 (8192 cpu shares)
	/cgroup/libvirt/qemu/VM2-5 (1024 shares)

Volanomark server/client programs were run in root cgroup.

The patch essentially does balance on wake to look for any idle cpu in
same cache domain as its prev_cpu (or cur_cpu if wake_affine obliges),
failing to find looks for least loaded cpu. This helps minimize
latencies for trade workload (and thus boost its score). For volanomark, 
it seems to hurt because of waking on a colder L2 cache.  The
tradeoff seems to be between latency and cache-misses. Short of adding
another tunable, are there better suggestions on how we can address this
sort of tradeoff?

Not-yet-Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 include/linux/topology.h |    4 ++--
 kernel/sched/fair.c      |   26 +++++++++++++++++++++-----
 2 files changed, 23 insertions(+), 7 deletions(-)

Index: current/include/linux/topology.h
===================================================================

--- current.orig/include/linux/topology.h
+++ current/include/linux/topology.h
@@ -96,7 +96,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 1*SD_SHARE_CPUPOWER			\
 				| 0*SD_POWERSAVINGS_BALANCE		\
@@ -129,7 +129,7 @@ int arch_update_cpu_topology(void);
 				| 1*SD_BALANCE_NEWIDLE			\
 				| 1*SD_BALANCE_EXEC			\
 				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
+				| 1*SD_BALANCE_WAKE			\
 				| 1*SD_WAKE_AFFINE			\
 				| 0*SD_PREFER_LOCAL			\
 				| 0*SD_SHARE_CPUPOWER			\
Index: current/kernel/sched/fair.c
===================================================================
--- current.orig/kernel/sched/fair.c
+++ current/kernel/sched/fair.c
@@ -2638,7 +2638,7 @@ static int select_idle_sibling(struct ta
 	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i;
+	int i, some_idle_cpu = -1;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
@@ -2661,15 +2661,25 @@ static int select_idle_sibling(struct ta
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
+			int skip = 0;
+
 			if (!cpumask_intersects(sched_group_cpus(sg),
 						tsk_cpus_allowed(p)))
 				goto next;
 
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
+			for_each_cpu_and(i, sched_group_cpus(sg),
+							 tsk_cpus_allowed(p)) {
+				if (!idle_cpu(i)) {
+					if (some_idle_cpu >= 0)
+						goto next;
+					skip = 1;
+				} else
+					some_idle_cpu = i;
 			}
 
+			if (skip)
+				goto next;
+
 			target = cpumask_first_and(sched_group_cpus(sg),
 					tsk_cpus_allowed(p));
 			goto done;
@@ -2677,6 +2687,9 @@ next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+
+	if (some_idle_cpu >= 0)
+		target = some_idle_cpu;
 done:
 	return target;
 }
@@ -2766,7 +2779,10 @@ select_task_rq_fair(struct task_struct *
 			prev_cpu = cpu;
 
 		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		if (idle_cpu(new_cpu))
+			goto unlock;
+		sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+		cpu = prev_cpu;
 	}
 
 	while (sd) {








^ permalink raw reply	[flat|nested] 49+ messages in thread