linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 2.6.0t4] 1 cpu/node scheduler fix
@ 2003-08-24 17:13 Erich Focht
  2003-08-25  8:13 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Erich Focht @ 2003-08-24 17:13 UTC (permalink / raw)
  To: Andi Kleen, mingo
  Cc: linux-kernel, LSE, Andrew Theurer, Martin J. Bligh, torvalds

[-- Attachment #1: Type: text/plain, Size: 1836 bytes --]

This is the 1 cpu/node fix of the NUMA scheduler rewritten for the new
cpumask handling. The previous version was a bit too aggressive with
cross node balancing so I changed the default timings a bit such that
the behavior is very similar to the old one.

Here is what the patch does:
- Links the frequency of cross-node balances to the number of failed
local balance attempts. This simplifies the code by removing the too
rigid cross-node balancing dependency of the timer interrupts.

- Fixes the 1 CPU/node issue, i.e. eliminates local balance attempts
for the nodes which have only one CPU. Can happen on any NUMA
platform (playing around with a 2 CPU/node box and have a flaky CPU,
so I have sometimes a node with only one CPU), is a major issue on
AMD64.

- Makes the cross-node balance frequency tunable by the parameter
NUMA_FACTOR_BONUS. Its default setting is such that the scheduler
behaves like before: cross node balance every 5 local node balances on
an idle CPU, every 2 local node balances on a busy CPU. This parameter
should be tuned for each platform depending on its NUMA factor.

This simple patch is not meant as opposition to Andrew's attempt to
NUMAize the whole scheduler. That one will definitely make NUMA
coders' lives easier but I fear that it is a bit too complex for
2.6. The attached small incremental change is sufficient to solve the
main problem. Besides, the change of the cross-node scheduling is
compatible with Andrew's scheduler structure. I really don't like the
timer-based cross-node balancing because it is too unflexible (no way
to have different balancing intervals for each node) and I'd really
like to get back to just one single point of entry for load balancing:
the routine load_balance(), no matter whether we balance inside a
timer interrupt or while the CPU is going idle.

Erich



[-- Attachment #2: 1cpufix-2.6.0t4.patch --]
[-- Type: text/x-diff, Size: 5270 bytes --]

diff -urNp 2.6.0-test4/include/linux/topology.h 2.6.0-test4-1cpuf/include/linux/topology.h
--- 2.6.0-test4/include/linux/topology.h	2003-08-23 01:57:55.000000000 +0200
+++ 2.6.0-test4-1cpuf/include/linux/topology.h	2003-08-23 19:29:15.000000000 +0200
@@ -54,4 +54,13 @@ static inline int __next_node_with_cpus(
 #define for_each_node_with_cpus(node) \
 	for (node = 0; node < numnodes; node = __next_node_with_cpus(node))
 
+#ifndef NUMA_FACTOR_BONUS
+/*
+ * High NUMA_FACTOR_BONUS means rare cross-node load balancing. The default
+ * value of 3 means idle_node_rebalance after 5 (failed) local balances,
+ * busy_node_rebalance after 2 failed local balances. Should be tuned for
+ * each platform in asm/topology.h.
+ */
+#define NUMA_FACTOR_BONUS 3
+#endif
 #endif /* _LINUX_TOPOLOGY_H */
diff -urNp 2.6.0-test4/kernel/sched.c 2.6.0-test4-1cpuf/kernel/sched.c
--- 2.6.0-test4/kernel/sched.c	2003-08-23 01:58:43.000000000 +0200
+++ 2.6.0-test4-1cpuf/kernel/sched.c	2003-08-23 21:07:34.000000000 +0200
@@ -164,6 +164,7 @@ struct runqueue {
 	prio_array_t *active, *expired, arrays[2];
 	int prev_cpu_load[NR_CPUS];
 #ifdef CONFIG_NUMA
+	unsigned int nr_lb_failed;
 	atomic_t *node_nr_running;
 	int prev_node_load[MAX_NUMNODES];
 #endif
@@ -873,6 +874,45 @@ static int find_busiest_node(int this_no
 	return node;
 }
 
+/*
+ * Decide whether the scheduler should balance locally (inside the same node)
+ * or globally depending on the number of failed local balance attempts.
+ * The number of failed local balance attempts depends on the number of cpus
+ * in the current node. In case it's just one, go immediately for global
+ * balancing. On a busy cpu the number of retries is smaller.
+ * NUMA_FACTOR_BONUS can be used to tune the frequency of global load
+ * balancing (in topology.h).
+ */
+static inline cpumask_t cpus_to_balance(int this_cpu, runqueue_t *this_rq)
+{
+	int node, retries, this_node = cpu_to_node(this_cpu);
+
+	if (nr_cpus_node(this_node) == 1) {
+		retries = 0;
+	} else {
+		retries = 2 + NUMA_FACTOR_BONUS;
+		/* less retries for busy CPUs */
+		if (this_rq->curr != this_rq->idle)
+			retries >>= 1;
+	}
+	if (this_rq->nr_lb_failed >= retries) {
+		node = find_busiest_node(this_node);
+		this_rq->nr_lb_failed = 0;
+		if (node >= 0) {
+			cpumask_t cpumask = node_to_cpumask(node);
+			cpu_set(this_cpu, cpumask);
+			return cpumask;
+		}
+	}
+	return node_to_cpumask(this_node);
+}
+
+#else /* !CONFIG_NUMA */
+
+static inline cpumask_t cpus_to_balance(int this_cpu, runqueue_t *this_rq)
+{
+	return cpu_online_map;
+}
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
@@ -977,6 +1017,12 @@ static inline runqueue_t *find_busiest_q
 		busiest = NULL;
 	}
 out:
+#ifdef CONFIG_NUMA
+	if (!busiest)
+		this_rq->nr_lb_failed++;
+	else
+		this_rq->nr_lb_failed = 0;
+#endif
 	return busiest;
 }
 
@@ -1012,7 +1058,7 @@ static inline void pull_task(runqueue_t 
  * We call this with the current runqueue locked,
  * irqs disabled.
  */
-static void load_balance(runqueue_t *this_rq, int idle, cpumask_t cpumask)
+static void load_balance(runqueue_t *this_rq, int idle)
 {
 	int imbalance, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
@@ -1020,7 +1066,8 @@ static void load_balance(runqueue_t *thi
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
+	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
+				     cpus_to_balance(this_cpu, this_rq));
 	if (!busiest)
 		goto out;
 
@@ -1102,29 +1149,9 @@ out:
  */
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 #define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
-#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 5)
-#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
-
-#ifdef CONFIG_NUMA
-static void balance_node(runqueue_t *this_rq, int idle, int this_cpu)
-{
-	int node = find_busiest_node(cpu_to_node(this_cpu));
-
-	if (node >= 0) {
-		cpumask_t cpumask = node_to_cpumask(node);
-		cpu_set(this_cpu, cpumask);
-		spin_lock(&this_rq->lock);
-		load_balance(this_rq, idle, cpumask);
-		spin_unlock(&this_rq->lock);
-	}
-}
-#endif
 
 static void rebalance_tick(runqueue_t *this_rq, int idle)
 {
-#ifdef CONFIG_NUMA
-	int this_cpu = smp_processor_id();
-#endif
 	unsigned long j = jiffies;
 
 	/*
@@ -1136,24 +1163,16 @@ static void rebalance_tick(runqueue_t *t
 	 * are not balanced.)
 	 */
 	if (idle) {
-#ifdef CONFIG_NUMA
-		if (!(j % IDLE_NODE_REBALANCE_TICK))
-			balance_node(this_rq, idle, this_cpu);
-#endif
 		if (!(j % IDLE_REBALANCE_TICK)) {
 			spin_lock(&this_rq->lock);
-			load_balance(this_rq, idle, cpu_to_node_mask(this_cpu));
+			load_balance(this_rq, idle);
 			spin_unlock(&this_rq->lock);
 		}
 		return;
 	}
-#ifdef CONFIG_NUMA
-	if (!(j % BUSY_NODE_REBALANCE_TICK))
-		balance_node(this_rq, idle, this_cpu);
-#endif
 	if (!(j % BUSY_REBALANCE_TICK)) {
 		spin_lock(&this_rq->lock);
-		load_balance(this_rq, idle, cpu_to_node_mask(this_cpu));
+		load_balance(this_rq, idle);
 		spin_unlock(&this_rq->lock);
 	}
 }
@@ -1331,7 +1350,7 @@ need_resched:
 pick_next_task:
 	if (unlikely(!rq->nr_running)) {
 #ifdef CONFIG_SMP
-		load_balance(rq, 1, cpu_to_node_mask(smp_processor_id()));
+		load_balance(rq, 1);
 		if (rq->nr_running)
 			goto pick_next_task;
 #endif

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch 2.6.0t4] 1 cpu/node scheduler fix
  2003-08-24 17:13 [patch 2.6.0t4] 1 cpu/node scheduler fix Erich Focht
@ 2003-08-25  8:13 ` Ingo Molnar
  2003-09-02 10:57   ` Erich Focht
  2003-08-25 15:54 ` Andrew Theurer
  2003-08-25 17:38 ` Martin J. Bligh
  2 siblings, 1 reply; 5+ messages in thread
From: Ingo Molnar @ 2003-08-25  8:13 UTC (permalink / raw)
  To: Erich Focht
  Cc: Andi Kleen, linux-kernel, LSE, Andrew Theurer, Martin J. Bligh, torvalds


On Sun, 24 Aug 2003, Erich Focht wrote:

> This simple patch is not meant as opposition to Andrew's attempt to
> NUMAize the whole scheduler. That one will definitely make NUMA coders'
> lives easier but I fear that it is a bit too complex for 2.6. The
> attached small incremental change is sufficient to solve the main
> problem. Besides, the change of the cross-node scheduling is compatible
> with Andrew's scheduler structure. I really don't like the timer-based
> cross-node balancing because it is too unflexible (no way to have
> different balancing intervals for each node) and I'd really like to get
> back to just one single point of entry for load balancing: the routine
> load_balance(), no matter whether we balance inside a timer interrupt or
> while the CPU is going idle.

your patch clearly simplifies things. Would you mind to also extend the
rebalance-frequency based balancing to the SMP scheduler, and see what
effect that has? Ie. to remove much of the 'tick' component from the SMP
scheduler as well, and make it purely frequency based.

I'm still afraid of balancing artifacts if we lose track of time (which
the tick thing is about, and which cache affinity is a function of), but
maybe not. It would certainly unify things more. If it doesnt work out
then we can still do your stuff for the cross-node balancing only.

	Ingo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch 2.6.0t4] 1 cpu/node scheduler fix
  2003-08-24 17:13 [patch 2.6.0t4] 1 cpu/node scheduler fix Erich Focht
  2003-08-25  8:13 ` Ingo Molnar
@ 2003-08-25 15:54 ` Andrew Theurer
  2003-08-25 17:38 ` Martin J. Bligh
  2 siblings, 0 replies; 5+ messages in thread
From: Andrew Theurer @ 2003-08-25 15:54 UTC (permalink / raw)
  To: Erich Focht, Andi Kleen, mingo
  Cc: linux-kernel, LSE, Martin J. Bligh, torvalds

> This simple patch is not meant as opposition to Andrew's attempt to
> NUMAize the whole scheduler. That one will definitely make NUMA
> coders' lives easier but I fear that it is a bit too complex for
> 2.6. 

I would agree, it's probably too much to change this late in 2.6.  Eventually 
(2.7) I think we should revisit this and try for a more unified approach.  

> The attached small incremental change is sufficient to solve the
> main problem. Besides, the change of the cross-node scheduling is
> compatible with Andrew's scheduler structure. I really don't like the
> timer-based cross-node balancing because it is too unflexible (no way
> to have different balancing intervals for each node) and I'd really
> like to get back to just one single point of entry for load balancing:
> the routine load_balance(), no matter whether we balance inside a
> timer interrupt or while the CPU is going idle.

Looks good to me.  One other thing your patch fixes:  Once in a while we 
called load_balance in rebalance_tick with the wrong 'idle' value.  
Occasionally we would be on an idle_node and idle_cpu rebalance tick, the 
idle cpu would [possibly] pull a task, become non-idle, then we would call 
load_balance again, but still have idle=1 for the intranode balance.  


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch 2.6.0t4] 1 cpu/node scheduler fix
  2003-08-24 17:13 [patch 2.6.0t4] 1 cpu/node scheduler fix Erich Focht
  2003-08-25  8:13 ` Ingo Molnar
  2003-08-25 15:54 ` Andrew Theurer
@ 2003-08-25 17:38 ` Martin J. Bligh
  2 siblings, 0 replies; 5+ messages in thread
From: Martin J. Bligh @ 2003-08-25 17:38 UTC (permalink / raw)
  To: Erich Focht, Andi Kleen, mingo
  Cc: linux-kernel, LSE, Andrew Theurer, torvalds

> This is the 1 cpu/node fix of the NUMA scheduler rewritten for the new
> cpumask handling. The previous version was a bit too aggressive with
> cross node balancing so I changed the default timings a bit such that
> the behavior is very similar to the old one.
> 
> Here is what the patch does:
> - Links the frequency of cross-node balances to the number of failed
> local balance attempts. This simplifies the code by removing the too
> rigid cross-node balancing dependency of the timer interrupts.
> 
> - Fixes the 1 CPU/node issue, i.e. eliminates local balance attempts
> for the nodes which have only one CPU. Can happen on any NUMA
> platform (playing around with a 2 CPU/node box and have a flaky CPU,
> so I have sometimes a node with only one CPU), is a major issue on
> AMD64.
> 
> - Makes the cross-node balance frequency tunable by the parameter
> NUMA_FACTOR_BONUS. Its default setting is such that the scheduler
> behaves like before: cross node balance every 5 local node balances on
> an idle CPU, every 2 local node balances on a busy CPU. This parameter
> should be tuned for each platform depending on its NUMA factor.

This seems to clear up the low end stuff I was seeing before - thanks.

Did you (or anyone else) get a chance to test this on AMD? Would
be nice to confirm that's fixed ...

M.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch 2.6.0t4] 1 cpu/node scheduler fix
  2003-08-25  8:13 ` Ingo Molnar
@ 2003-09-02 10:57   ` Erich Focht
  0 siblings, 0 replies; 5+ messages in thread
From: Erich Focht @ 2003-09-02 10:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, linux-kernel, LSE, Andrew Theurer, Martin J. Bligh, torvalds

Hi Ingo,

thanks for the reply and sorry for the late reaction. I've been away
for a week and (unexpectedly) didn't have email/internet connectivity
during this time.

On Monday 25 August 2003 10:13, Ingo Molnar wrote:
> On Sun, 24 Aug 2003, Erich Focht wrote:
> > different balancing intervals for each node) and I'd really like to get
> > back to just one single point of entry for load balancing: the routine
> > load_balance(), no matter whether we balance inside a timer interrupt or
> > while the CPU is going idle.
>
> your patch clearly simplifies things. Would you mind to also extend the
> rebalance-frequency based balancing to the SMP scheduler, and see what
> effect that has? Ie. to remove much of the 'tick' component from the SMP
> scheduler as well, and make it purely frequency based.

Actually I didn't remove the tick-based load-balancing, just
simplified it and wanted to have only one load_balance() call instead of
separate load_balance(), node_balance(),
load_balance_in_timer_interrupt_when_idle() and
load_balance_in_timer_interrupt_when_busy().

The current change keeps track of the number of failed load_balance
attempts and does cross-node balancing after a certain number of
failed attempts. If I understand you correctly you'd like to test this
concept also on a lower level? So keep track of the number of
schedules and context switches and call the SMP load balancer after a
certain number of schedules? This could work, though on idle CPUs I'm
more comfortable with the timer tick... The logic of cpu_idle() would
need a change, too. Do you expect an impact on the latency issue?
Would you mind doing these changes in two steps: first the simple NUMA
one, later the radical SMP change?

> I'm still afraid of balancing artifacts if we lose track of time (which
> the tick thing is about, and which cache affinity is a function of), but
> maybe not. It would certainly unify things more. If it doesnt work out
> then we can still do your stuff for the cross-node balancing only.

With the small fix for NUMA there's no problem because the
CAN_MIGRATE_TASK macro really compares times/jiffies for getting an
idea about the cache coolness. This wouldn't change even with the
elimination of the timer based load_balance calls.

Regards,
Erich


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Hi Ingo,

thanks for the reply and sorry for the late reaction. I've been away
for a week and (unexpectedly) didn't have email/internet connectivity
during this time.

On Monday 25 August 2003 10:13, Ingo Molnar wrote:
> On Sun, 24 Aug 2003, Erich Focht wrote:
> > different balancing intervals for each node) and I'd really like to get
> > back to just one single point of entry for load balancing: the routine
> > load_balance(), no matter whether we balance inside a timer interrupt or
> > while the CPU is going idle.
>
> your patch clearly simplifies things. Would you mind to also extend the
> rebalance-frequency based balancing to the SMP scheduler, and see what
> effect that has? Ie. to remove much of the 'tick' component from the SMP
> scheduler as well, and make it purely frequency based.

Actually I didn't remove the tick-based load-balancing, just
simplified it and wanted to have only one load_balance() call instead of
separate load_balance(), node_balance(),
load_balance_in_timer_interrupt_when_idle() and
load_balance_in_timer_interrupt_when_busy().

The current change keeps track of the number of failed load_balance
attempts and does cross-node balancing after a certain number of
failed attempts. If I understand you correctly you'd like to test this
concept also on a lower level? So keep track of the number of
schedules and context switches and call the SMP load balancer after a
certain number of schedules? This could work, though on idle CPUs I'm
more comfortable with the timer tick... The logic of cpu_idle() would
need a change, too. Do you expect an impact on the latency issue?
Would you mind doing these changes in two steps: first the simple NUMA
one, later the radical SMP change?

> I'm still afraid of balancing artifacts if we lose track of time (which
> the tick thing is about, and which cache affinity is a function of), but
> maybe not. It would certainly unify things more. If it doesnt work out
> then we can still do your stuff for the cross-node balancing only.

With the small fix for NUMA there's no problem because the
CAN_MIGRATE_TASK macro really compares times/jiffies for getting an
idea about the cache coolness. This wouldn't change even with the
elimination of the timer based load_balance calls.

Regards,
Erich



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2003-09-02 10:58 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-08-24 17:13 [patch 2.6.0t4] 1 cpu/node scheduler fix Erich Focht
2003-08-25  8:13 ` Ingo Molnar
2003-09-02 10:57   ` Erich Focht
2003-08-25 15:54 ` Andrew Theurer
2003-08-25 17:38 ` Martin J. Bligh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).