linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] various sched and numa bits
@ 2012-05-01 18:14 Peter Zijlstra
  2012-05-01 18:14 ` [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Peter Zijlstra @ 2012-05-01 18:14 UTC (permalink / raw)
  To: mingo, pjt, vatsa, suresh.b.siddha, efault; +Cc: linux-kernel

Hi,

The first two patches change how the load-balancer traverses the sched_domain
tree. Currently we go one level up on the first non-idle cpu, change that to
be the least loaded cpu. The second adds a little serialization to the
sched_domain traversal, so that no two cpus of the same group go up.

Paul, can you run these through linsched to see if they make anything worse?
They make conceptual sense, but that never says much these days :/

The following two patches extend NUMA emulation and were used to test the last
patch.

The last patch does a complete re-implementation of CONFIG_NUMA support for
the scheduler and should get us a topology that matches the NUMA interconnects
as opposed to the semi-random stuff we have now. The code assumes a number of
things which I hope are true, but lacking any interesting hardware what do I
know... Its tested by using the node_distance() table from an quad-socket AMD
Magny-Cours, which is a non-fully-connected system -- see 3/5.




^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group
  2012-05-01 18:14 [RFC][PATCH 0/5] various sched and numa bits Peter Zijlstra
@ 2012-05-01 18:14 ` Peter Zijlstra
  2012-05-02 10:25   ` Srivatsa Vaddagiri
  2012-05-01 18:14 ` [RFC][PATCH 2/5] sched, fair: Add some serialization to the sched_domain load-balance walk Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2012-05-01 18:14 UTC (permalink / raw)
  To: mingo, pjt, vatsa, suresh.b.siddha, efault; +Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-balance-min-cpu.patch --]
[-- Type: text/plain, Size: 1476 bytes --]

Currently we let the leftmost (or first idle) cpu acend the
sched_domain tree and perform load-balancing. The result is that the
busiest cpu in the group might be performing this function and pull
more load to itself. The next load balance pass will then try to
equalize this again.

Change this to pick the least loaded cpu to perform higher domain
balancing.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/fair.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -3779,7 +3779,8 @@ static inline void update_sg_lb_stats(st
 {
 	unsigned long load, max_cpu_load, min_cpu_load, max_nr_running;
 	int i;
-	unsigned int balance_cpu = -1, first_idle_cpu = 0;
+	unsigned int balance_cpu = -1;
+	unsigned long balance_load = ~0UL;
 	unsigned long avg_load_per_task = 0;
 
 	if (local_group)
@@ -3795,12 +3796,11 @@ static inline void update_sg_lb_stats(st
 
 		/* Bias balancing toward cpus of our domain */
 		if (local_group) {
-			if (idle_cpu(i) && !first_idle_cpu) {
-				first_idle_cpu = 1;
+			load = target_load(i, load_idx);
+			if (load < balance_load || idle_cpu(i)) {
+				balance_load = load;
 				balance_cpu = i;
 			}
-
-			load = target_load(i, load_idx);
 		} else {
 			load = source_load(i, load_idx);
 			if (load > max_cpu_load) {



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC][PATCH 2/5] sched, fair: Add some serialization to the sched_domain load-balance walk
  2012-05-01 18:14 [RFC][PATCH 0/5] various sched and numa bits Peter Zijlstra
  2012-05-01 18:14 ` [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group Peter Zijlstra
@ 2012-05-01 18:14 ` Peter Zijlstra
  2012-05-01 18:14 ` [RFC][PATCH 3/5] x86: Allow specifying node_distance() for numa=fake Peter Zijlstra
  2012-05-01 18:14 ` [RFC][PATCH 4/5] x86: Hard partition cpu topology masks on node boundaries Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2012-05-01 18:14 UTC (permalink / raw)
  To: mingo, pjt, vatsa, suresh.b.siddha, efault; +Cc: linux-kernel, Peter Zijlstra

[-- Attachment #1: sched-balance-serialize.patch --]
[-- Type: text/plain, Size: 2296 bytes --]

Since the sched_domain walk is completely unserialized (!SD_SERIALIZE)
it is possible that multiple cpus in the group get elected to do the
next level. Avoid this by adding some serialization.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 ++
 kernel/sched/fair.c   |    9 +++++++--
 3 files changed, 10 insertions(+), 2 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -927,6 +927,7 @@ struct sched_group_power {
 struct sched_group {
 	struct sched_group *next;	/* Must be a circular list */
 	atomic_t ref;
+	int balance_cpu;
 
 	unsigned int group_weight;
 	struct sched_group_power *sgp;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6057,6 +6057,7 @@ build_overlap_sched_groups(struct sched_
 
 		sg->sgp = *per_cpu_ptr(sdd->sgp, cpumask_first(sg_span));
 		atomic_inc(&sg->sgp->ref);
+		sg->balance_cpu = -1;
 
 		if (cpumask_test_cpu(cpu, sg_span))
 			groups = sg;
@@ -6132,6 +6133,7 @@ build_sched_groups(struct sched_domain *
 
 		cpumask_clear(sched_group_cpus(sg));
 		sg->sgp->power = 0;
+		sg->balance_cpu = -1;
 
 		for_each_cpu(j, span) {
 			if (get_group(j, sdd, NULL) != group)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3831,7 +3831,8 @@ static inline void update_sg_lb_stats(st
 	 */
 	if (local_group) {
 		if (idle != CPU_NEWLY_IDLE) {
-			if (balance_cpu != this_cpu) {
+			if (balance_cpu != this_cpu ||
+			    cmpxchg(&group->balance_cpu, -1, balance_cpu) != -1) {
 				*balance = 0;
 				return;
 			}
@@ -4933,7 +4934,7 @@ static void rebalance_domains(int cpu, e
 	int balance = 1;
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long interval;
-	struct sched_domain *sd;
+	struct sched_domain *sd, *last = NULL;
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 	int update_next_balance = 0;
@@ -4943,6 +4944,7 @@ static void rebalance_domains(int cpu, e
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
+		last = sd;
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
 
@@ -4987,6 +4989,9 @@ static void rebalance_domains(int cpu, e
 		if (!balance)
 			break;
 	}
+	for (sd = last; sd; sd = sd->child)
+		(void)cmpxchg(&sd->groups->balance_cpu, cpu, -1);
+
 	rcu_read_unlock();
 
 	/*



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC][PATCH 3/5] x86: Allow specifying node_distance() for numa=fake
  2012-05-01 18:14 [RFC][PATCH 0/5] various sched and numa bits Peter Zijlstra
  2012-05-01 18:14 ` [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group Peter Zijlstra
  2012-05-01 18:14 ` [RFC][PATCH 2/5] sched, fair: Add some serialization to the sched_domain load-balance walk Peter Zijlstra
@ 2012-05-01 18:14 ` Peter Zijlstra
  2012-05-01 18:14 ` [RFC][PATCH 4/5] x86: Hard partition cpu topology masks on node boundaries Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2012-05-01 18:14 UTC (permalink / raw)
  To: mingo, pjt, vatsa, suresh.b.siddha, efault
  Cc: linux-kernel, Tejun Heo, Yinghai Lu, x86, Peter Zijlstra

[-- Attachment #1: x86-numa-emulation.patch --]
[-- Type: text/plain, Size: 1592 bytes --]

Allows emulating more interesting NUMA configurations like a quad
socket AMD Magny-Cour:

 "numa=fake=8:10,16,16,22,16,22,16,22,
              16,10,22,16,22,16,22,16,
              16,22,10,16,16,22,16,22,
              22,16,16,10,22,16,22,16,
              16,22,16,22,10,16,16,22,
              22,16,22,16,16,10,22,16,
              16,22,16,22,16,22,10,16,
              22,16,22,16,22,16,16,10"

Which has a non-fully-connected topology.

Cc: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: x86@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/mm/numa_emulation.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/numa_emulation.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/numa_emulation.c
+++ linux-2.6/arch/x86/mm/numa_emulation.c
@@ -339,9 +339,11 @@ void __init numa_emulation(struct numa_m
 	} else {
 		unsigned long n;
 
-		n = simple_strtoul(emu_cmdline, NULL, 0);
+		n = simple_strtoul(emu_cmdline, &emu_cmdline, 0);
 		ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n);
 	}
+	if (*emu_cmdline == ':')
+		emu_cmdline++;
 
 	if (ret < 0)
 		goto no_emu;
@@ -418,7 +420,9 @@ void __init numa_emulation(struct numa_m
 			int physj = emu_nid_to_phys[j];
 			int dist;
 
-			if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
+			if (get_option(&emu_cmdline, &dist) == 2)
+				;
+			else if (physi >= numa_dist_cnt || physj >= numa_dist_cnt)
 				dist = physi == physj ?
 					LOCAL_DISTANCE : REMOTE_DISTANCE;
 			else



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC][PATCH 4/5] x86: Hard partition cpu topology masks on node boundaries
  2012-05-01 18:14 [RFC][PATCH 0/5] various sched and numa bits Peter Zijlstra
                   ` (2 preceding siblings ...)
  2012-05-01 18:14 ` [RFC][PATCH 3/5] x86: Allow specifying node_distance() for numa=fake Peter Zijlstra
@ 2012-05-01 18:14 ` Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2012-05-01 18:14 UTC (permalink / raw)
  To: mingo, pjt, vatsa, suresh.b.siddha, efault
  Cc: linux-kernel, Tejun Heo, Yinghai Lu, x86, Peter Zijlstra

[-- Attachment #1: x86-numa-emulation-2.patch --]
[-- Type: text/plain, Size: 1417 bytes --]

When using numa=fake= you can get weird topologies where LLCs can span
nodes and other such nonsense. Cure this by hard partitioning these
masks on node boundaries.

Cc: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: x86@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/x86/kernel/smpboot.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -337,6 +337,11 @@ void __cpuinit set_cpu_sibling_map(int c
 		for_each_cpu(i, cpu_sibling_setup_mask) {
 			struct cpuinfo_x86 *o = &cpu_data(i);
 
+#ifdef CONFIG_NUMA_EMU
+			if (cpu_to_node(cpu) != cpu_to_node(i))
+				continue;
+#endif
+
 			if (cpu_has(c, X86_FEATURE_TOPOEXT)) {
 				if (c->phys_proc_id == o->phys_proc_id &&
 				    per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i) &&
@@ -360,11 +365,17 @@ void __cpuinit set_cpu_sibling_map(int c
 	}
 
 	for_each_cpu(i, cpu_sibling_setup_mask) {
+#ifdef CONFIG_NUMA_EMU
+		if (cpu_to_node(cpu) != cpu_to_node(i))
+			continue;
+#endif
+
 		if (per_cpu(cpu_llc_id, cpu) != BAD_APICID &&
 		    per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) {
 			cpumask_set_cpu(i, cpu_llc_shared_mask(cpu));
 			cpumask_set_cpu(cpu, cpu_llc_shared_mask(i));
 		}
+
 		if (c->phys_proc_id == cpu_data(i).phys_proc_id) {
 			cpumask_set_cpu(i, cpu_core_mask(cpu));
 			cpumask_set_cpu(cpu, cpu_core_mask(i));



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group
  2012-05-01 18:14 ` [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group Peter Zijlstra
@ 2012-05-02 10:25   ` Srivatsa Vaddagiri
  2012-05-02 10:31     ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Srivatsa Vaddagiri @ 2012-05-02 10:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, pjt, suresh.b.siddha, efault, linux-kernel

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2012-05-01 20:14:31]:

> @@ -3795,12 +3796,11 @@ static inline void update_sg_lb_stats(st
> 
>  		/* Bias balancing toward cpus of our domain */
>  		if (local_group) {
> -			if (idle_cpu(i) && !first_idle_cpu) {
> -				first_idle_cpu = 1;
> +			load = target_load(i, load_idx);
> +			if (load < balance_load || idle_cpu(i)) {
> +				balance_load = load;

Let's say load_idx != 0 (ex: a busy cpu doing this load balance). In
that case, for a idle cpu, we could return non-zero load and hence this
would fail to select such a idle cpu? IOW :

		balance_load = 0 iff idle_cpu(i) ??

>  				balance_cpu = i;
>  			}
> -
> -			load = target_load(i, load_idx);
>  		} else {
>  			load = source_load(i, load_idx);
>  			if (load > max_cpu_load) {
> 
> 

- vatsa


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group
  2012-05-02 10:25   ` Srivatsa Vaddagiri
@ 2012-05-02 10:31     ` Peter Zijlstra
  2012-05-02 10:34       ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2012-05-02 10:31 UTC (permalink / raw)
  To: Srivatsa Vaddagiri; +Cc: mingo, pjt, suresh.b.siddha, efault, linux-kernel

On Wed, 2012-05-02 at 15:55 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2012-05-01 20:14:31]:
> 
> > @@ -3795,12 +3796,11 @@ static inline void update_sg_lb_stats(st
> > 
> >  		/* Bias balancing toward cpus of our domain */
> >  		if (local_group) {
> > -			if (idle_cpu(i) && !first_idle_cpu) {
> > -				first_idle_cpu = 1;
> > +			load = target_load(i, load_idx);
> > +			if (load < balance_load || idle_cpu(i)) {
> > +				balance_load = load;
> 
> Let's say load_idx != 0 (ex: a busy cpu doing this load balance). In
> that case, for a idle cpu, we could return non-zero load and hence this
> would fail to select such a idle cpu? 

Yep, such is the nature of !0 load_idx.

> IOW :
> 
> 		balance_load = 0 iff idle_cpu(i) ??

I think so, even for !0 load_idx, load will only reach zero when we're
idle, just takes longer.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group
  2012-05-02 10:31     ` Peter Zijlstra
@ 2012-05-02 10:34       ` Srivatsa Vaddagiri
  2012-05-04  0:05         ` Suresh Siddha
  0 siblings, 1 reply; 10+ messages in thread
From: Srivatsa Vaddagiri @ 2012-05-02 10:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, pjt, suresh.b.siddha, efault, linux-kernel

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2012-05-02 12:31:30]:

> > IOW :
> > 
> > 		balance_load = 0 iff idle_cpu(i) ??
> 
> I think so, even for !0 load_idx, load will only reach zero when we're
> idle, just takes longer.

Right ...so should we force it to select a idle_cpu by having
balance_load = 0 for a idle cpu (ignoring what target_load(i, load_idx)
told us as its load?

- vatsa


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group
  2012-05-02 10:34       ` Srivatsa Vaddagiri
@ 2012-05-04  0:05         ` Suresh Siddha
  2012-05-04 16:09           ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Suresh Siddha @ 2012-05-04  0:05 UTC (permalink / raw)
  To: Srivatsa Vaddagiri; +Cc: Peter Zijlstra, mingo, pjt, efault, linux-kernel

On Wed, 2012-05-02 at 16:04 +0530, Srivatsa Vaddagiri wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2012-05-02 12:31:30]:
> 
> > > IOW :
> > > 
> > > 		balance_load = 0 iff idle_cpu(i) ??
> > 
> > I think so, even for !0 load_idx, load will only reach zero when we're
> > idle, just takes longer.
> 
> Right ...so should we force it to select a idle_cpu by having
> balance_load = 0 for a idle cpu (ignoring what target_load(i, load_idx)
> told us as its load?

I think Peter is trying to find the leastly loaded among idle cpu's (in
other words the longest idle cpu ;)

should be ok, isn't it?

thanks,
suresh




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group
  2012-05-04  0:05         ` Suresh Siddha
@ 2012-05-04 16:09           ` Peter Zijlstra
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2012-05-04 16:09 UTC (permalink / raw)
  To: Suresh Siddha; +Cc: Srivatsa Vaddagiri, mingo, pjt, efault, linux-kernel

On Thu, 2012-05-03 at 17:05 -0700, Suresh Siddha wrote:
> On Wed, 2012-05-02 at 16:04 +0530, Srivatsa Vaddagiri wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2012-05-02 12:31:30]:
> > 
> > > > IOW :
> > > > 
> > > > 		balance_load = 0 iff idle_cpu(i) ??
> > > 
> > > I think so, even for !0 load_idx, load will only reach zero when we're
> > > idle, just takes longer.
> > 
> > Right ...so should we force it to select a idle_cpu by having
> > balance_load = 0 for a idle cpu (ignoring what target_load(i, load_idx)
> > told us as its load?
> 
> I think Peter is trying to find the leastly loaded among idle cpu's (in
> other words the longest idle cpu ;)

Nah, Peter isn't trying to do anything smart like that, he's just trying
to pick the least loaded when they're all busy or any idle otherwise.

Afaict the code as it is today is the worst possible choice, always
picking the same (first) will result in that one being the busiest at
all times.

I mean anything will converge (eventually) due to the lower levels
spreading load again, but by pulling to the idlest it should converge
faster.

Picking a random cpu would also work.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-05-04 16:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-01 18:14 [RFC][PATCH 0/5] various sched and numa bits Peter Zijlstra
2012-05-01 18:14 ` [RFC][PATCH 1/5] sched, fair: Let minimally loaded cpu balance the group Peter Zijlstra
2012-05-02 10:25   ` Srivatsa Vaddagiri
2012-05-02 10:31     ` Peter Zijlstra
2012-05-02 10:34       ` Srivatsa Vaddagiri
2012-05-04  0:05         ` Suresh Siddha
2012-05-04 16:09           ` Peter Zijlstra
2012-05-01 18:14 ` [RFC][PATCH 2/5] sched, fair: Add some serialization to the sched_domain load-balance walk Peter Zijlstra
2012-05-01 18:14 ` [RFC][PATCH 3/5] x86: Allow specifying node_distance() for numa=fake Peter Zijlstra
2012-05-01 18:14 ` [RFC][PATCH 4/5] x86: Hard partition cpu topology masks on node boundaries Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).