linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Simple NUMA scheduler patch
       [not found] <Pine.LNX.4.44.0209050905180.8086-100000@localhost.localdomain>
@ 2002-10-01 23:55 ` Michael Hohnbaum
  2002-10-02 13:11   ` Christoph Hellwig
  2002-10-02 17:54   ` Erich Focht
  0 siblings, 2 replies; 11+ messages in thread
From: Michael Hohnbaum @ 2002-10-01 23:55 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel; +Cc: Erich Focht

[-- Attachment #1: Type: text/plain, Size: 1120 bytes --]

Attached is a patch which provides a rudimentary NUMA scheduler.
This patch basically does two things:

* at exec() it finds the least loaded CPU to assign a task to;
* at load_balance() (find_busiest_queue() actually) it favors 
  cpus on the same node for taking tasks from.

This has been tested on the IA32 based NUMAQ platform and shows
performance gains for kernbench.  Various microbenchmarks also 
show improvements and stickiness of processes to nodes.  Profiles
show that the kernbench performance improvements are directly 
attributable to the use of local memory caused by tasks running
on the same node through their lifetime.

I will be doing much more testing of this, and likely will be
tweaking some of the algorithms based upon test results.  Any
comments, suggestions, flames are welcome.

Patch applies to 2.5.40 and makes use of the new NUMA topology
API.  This scheduler change should work on other NUMA platforms
with just the definition of the architecture specific macros in
topology.h.

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486

[-- Attachment #2: sched40.patch --]
[-- Type: text/plain, Size: 7024 bytes --]

--- clean-2.5.40/kernel/sched.c	Tue Oct  1 13:48:34 2002
+++ linux-2.5.40/kernel/sched.c	Tue Oct  1 13:27:46 2002
@@ -30,6 +30,9 @@
 #include <linux/notifier.h>
 #include <linux/delay.h>
 #include <linux/timer.h>
+#if CONFIG_NUMA
+#include <asm/topology.h>
+#endif
 
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -632,6 +635,121 @@ static inline unsigned int double_lock_b
 	return nr_running;
 }
 
+#if CONFIG_NUMA
+/*
+ * find_busiest_queue - find the busiest runqueue.
+ */
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+{
+	int nr_running, load, max_load_on_node, max_load_off_node, i;
+	runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src;
+
+	/*
+	 * We must look at each runqueue to update prev_nr_running.
+	 * As we do so, we find the busiest runqueue on the same
+	 * node, and the busiest runqueue off the node.  After
+	 * determining the busiest from each we first see if the
+	 * busiest runqueue on node meets the imbalance criteria.
+	 * If not, then we check the off queue runqueue.  Note that
+	 * we require a higher level of imbalance for choosing an
+	 * off node runqueue.
+	 *
+	 * For off-node, we only move at most one process, so imbalance
+	 * is always set to one for off-node runqueues.
+	 * 
+	 * We do this lockless to reduce cache-bouncing overhead,
+	 * we re-check the 'best' source CPU later on again, with
+	 * the lock held.
+	 *
+	 * Note that this routine is called with the current runqueue
+	 * locked, and if a runqueue is found (return != NULL), then
+	 * that runqueue is returned locked, also.
+	 *
+	 * We fend off statistical fluctuations in runqueue lengths by
+	 * saving the runqueue length during the previous load-balancing
+	 * operation and using the smaller one the current and saved lengths.
+	 * If a runqueue is long enough for a longer amount of time then
+	 * we recognize it and pull tasks from it.
+	 *
+	 * The 'current runqueue length' is a statistical maximum variable,
+	 * for that one we take the longer one - to avoid fluctuations in
+	 * the other direction. So for a load-balance to happen it needs
+	 * stable long runqueue on the target CPU and stable short runqueue
+	 * on the local runqueue.
+	 *
+	 * We make an exception if this CPU is about to become idle - in
+	 * that case we are less picky about moving a task across CPUs and
+	 * take what can be taken.
+	 */
+	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+		nr_running = this_rq->nr_running;
+	else
+		nr_running = this_rq->prev_nr_running[this_cpu];
+	if (nr_running < 1)
+		nr_running = 1;
+
+	busiest_on_node = NULL;
+	busiest_off_node = NULL;
+	busiest = NULL;
+	max_load_on_node = 1;
+	max_load_off_node = 1;
+	
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+
+		rq_src = cpu_rq(i);
+		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+			load = rq_src->nr_running;
+		else
+			load = this_rq->prev_nr_running[i];
+
+		this_rq->prev_nr_running[i] = rq_src->nr_running;
+
+		if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) {
+			if ((load > max_load_on_node) && (rq_src != this_rq)) {
+				busiest_on_node = rq_src;
+				max_load_on_node = load;
+			}
+		} else {
+			if (load > max_load_off_node)  {
+				busiest_off_node = rq_src;
+				max_load_off_node = load;
+			}
+		}
+	}
+	if (busiest_on_node) {
+		/* on node balancing happens if there is > 125% difference */
+		if (idle || ((nr_running*5)/4 < max_load_on_node)) {
+			busiest = busiest_on_node;
+			*imbalance = (max_load_on_node - nr_running) / 2;
+		}
+	}
+	if (busiest_off_node && !busiest) {
+		/* off node balancing requires 200% difference */
+		if (nr_running*2 >= max_load_off_node) 
+			return NULL;
+		busiest = busiest_off_node; 
+		*imbalance = 1;
+	} 
+	if (busiest) {
+		if (busiest == this_rq) {
+			return NULL;
+		}
+		nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+		/*
+		 * Make sure nothing changed since we checked the
+		 * runqueue length.
+		 */
+		if (busiest->nr_running <= nr_running) {
+			spin_unlock(&busiest->lock);
+			busiest = NULL;
+		}
+	}
+	return busiest;
+}
+#else
+
 /*
  * find_busiest_queue - find the busiest runqueue.
  */
@@ -709,6 +827,7 @@ static inline runqueue_t *find_busiest_q
 out:
 	return busiest;
 }
+#endif /* CONFIG_NUMA */
 
 /*
  * pull_task - move a task from a remote runqueue to the local runqueue.
@@ -2104,6 +2223,65 @@ __init int migration_init(void)
 spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
 #endif
 
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  If the current runqueue is short enough
+ * just use it.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, best_cpu;
+	best_cpu = task_cpu(p);
+	
+	minload = cpu_rq(best_cpu)->nr_running;
+	/* if the current runqueue is only 2 long, don't look elsewhere */
+	if (minload <= 2)
+		return best_cpu;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+
+		if (minload > cpu_rq(i)->nr_running) {
+			minload = cpu_rq(i)->nr_running;
+			best_cpu = i;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+#endif /* CONFIG_NUMA */
+
 extern void init_timers(void);
 
 void __init sched_init(void)
--- clean-2.5.40/fs/exec.c	Tue Oct  1 13:47:58 2002
+++ linux-2.5.40/fs/exec.c	Tue Oct  1 12:55:50 2002
@@ -1001,6 +1001,8 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
--- clean-2.5.40/include/linux/sched.h	Tue Oct  1 13:48:30 2002
+++ linux-2.5.40/include/linux/sched.h	Tue Oct  1 12:55:51 2002
@@ -166,6 +166,11 @@ extern void update_one_process(struct ta
 			       unsigned long system, int cpu);
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+#else
+#define sched_balance_exec() {}
+#endif
 
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-01 23:55 ` [RFC] Simple NUMA scheduler patch Michael Hohnbaum
@ 2002-10-02 13:11   ` Christoph Hellwig
  2002-10-02 18:56     ` Matthew Dobson
  2002-10-02 22:08     ` Michael Hohnbaum
  2002-10-02 17:54   ` Erich Focht
  1 sibling, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2002-10-02 13:11 UTC (permalink / raw)
  To: Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel, Erich Focht

On Tue, Oct 01, 2002 at 04:55:35PM -0700, Michael Hohnbaum wrote:
> --- clean-2.5.40/kernel/sched.c	Tue Oct  1 13:48:34 2002
> +++ linux-2.5.40/kernel/sched.c	Tue Oct  1 13:27:46 2002
> @@ -30,6 +30,9 @@
>  #include <linux/notifier.h>
>  #include <linux/delay.h>
>  #include <linux/timer.h>
> +#if CONFIG_NUMA
> +#include <asm/topology.h>
> +#endif

Please make this inlcude unconditional, okay?

> +/*
> + * find_busiest_queue - find the busiest runqueue.
> + */
> +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
> +{
> +	int nr_running, load, max_load_on_node, max_load_off_node, i;
> +	runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src;

You're new find_busiest_queue is to 80 or 90% the same as the non-NUMA one.
At least add the #ifdefs only where needed, but as cpu_to_node() optimizes
away for the non-NUMA case I think you could just make it unconditional.

> +		if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) {

I think it should be cpu_to_node, not __cpu_to_node.

> +#if CONFIG_NUMA
> +/*
> + * If dest_cpu is allowed for this process, migrate the task to it.
> + * This is accomplished by forcing the cpu_allowed mask to only
> + * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
> + * the cpu_allowed mask is restored.
> + */
> +void sched_migrate_task(task_t *p, int dest_cpu)

Looks like this one could be static?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-01 23:55 ` [RFC] Simple NUMA scheduler patch Michael Hohnbaum
  2002-10-02 13:11   ` Christoph Hellwig
@ 2002-10-02 17:54   ` Erich Focht
  2002-10-02 18:26     ` Michael Hohnbaum
                       ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Erich Focht @ 2002-10-02 17:54 UTC (permalink / raw)
  To: Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

Hi Michael,

On Wednesday 02 October 2002 01:55, Michael Hohnbaum wrote:
> Attached is a patch which provides a rudimentary NUMA scheduler.
> This patch basically does two things:
>
> * at exec() it finds the least loaded CPU to assign a task to;
> * at load_balance() (find_busiest_queue() actually) it favors
>   cpus on the same node for taking tasks from.

it's a start. But I'm afraid a full solution will need much more code
(which is one of the problems with my NUMA scheduler patch).

The ideas behind your patch are:
1. Do initial load balancing, choose the least loaded CPU at the
beginning of exec().
2. Favor own node for stealing if any CPU on the own node is >25%
more loaded. Otherwise steal from another CPU if that one is >100%
more loaded.

1. is fine but should ideally aim for equal load among nodes. In
the current form I believe that the original load balancer does the
job right after fork() (which must have happened before exec()). As
you changed the original load balancer, you really need this initial
balancing.

2. is ok as it makes it harder to change the node. But again, you don't
aim at equally balanced nodes. And: if the task gets away from the node
on which it got its memory, it has no reason to ever come back to it.

For a final solution I believe that we will need targets like:
(a) equally balance nodes
(b) return tasks to the nodes where their memory is
(c) make nodes "sticky" for tasks which have their memory on them,
"repulsive" for other tasks.
But for a first attempt to get the scheduler more NUMA aware all this
might be just too much.

With simple benchmarks you will most probably beat the plain O(1)
scheduler on NUMA if you implement (a) in just 1. and 2. as your node
is already somewhat "sticky". In complicated benchmarks (like a kernel
compile ;-) it could already be too difficult to understand when the
load balancer did what and why...

It would be nice to see some numbers.

Best regards,
Erich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-02 17:54   ` Erich Focht
@ 2002-10-02 18:26     ` Michael Hohnbaum
  2002-10-02 18:30     ` Michael Hohnbaum
  2002-10-02 20:25     ` Martin J. Bligh
  2 siblings, 0 replies; 11+ messages in thread
From: Michael Hohnbaum @ 2002-10-02 18:26 UTC (permalink / raw)
  To: Erich Focht; +Cc: Ingo Molnar, linux-kernel

On Wed, 2002-10-02 at 10:54, Erich Focht wrote:
> Hi Michael,
> 
> On Wednesday 02 October 2002 01:55, Michael Hohnbaum wrote:
> > Attached is a patch which provides a rudimentary NUMA scheduler.
> > This patch basically does two things:
> >
> > * at exec() it finds the least loaded CPU to assign a task to;
> > * at load_balance() (find_busiest_queue() actually) it favors
> >   cpus on the same node for taking tasks from.
> 
> it's a start. But I'm afraid a full solution will need much more code
> (which is one of the problems with my NUMA scheduler patch).

Yes, I agree that a full solution needs much more, however I think that
what is in this patch is a good start which can be extended over time.
Making a process sticky to a node is the next feature that I would like
to add.
> 
> The ideas behind your patch are:
> 1. Do initial load balancing, choose the least loaded CPU at the
> beginning of exec().
> 2. Favor own node for stealing if any CPU on the own node is >25%
> more loaded. Otherwise steal from another CPU if that one is >100%
> more loaded.
> 
> 1. is fine but should ideally aim for equal load among nodes. In
> the current form I believe that the original load balancer does the
> job right after fork() (which must have happened before exec()). As
> you changed the original load balancer, you really need this initial
> balancing.

Actually, what is implemented now will on a loaded system result in 
a reasonable balance between nodes.  On a lightly loaded system tasks
will tend to fill one node before new processes get distributed to
other nodes.  This actually benefits keeping memory accesses local 
on the same node assuming that the small number of processes are 
inter-related.
> 
> 2. is ok as it makes it harder to change the node. But again, you don't
> aim at equally balanced nodes. And: if the task gets away from the node
> on which it got its memory, it has no reason to ever come back to it.

Again, I believe that as a system becomes loaded, the nodes get
balanced.  Running the numa_test that you sent me, shows thats
once the number of processes is greater than 2*NCPU we get a fairly
equal distribution across nodes.  I've made a few tweaks since I
sent the patch out that helps on this (which I'll repost later today).
> 
> For a final solution I believe that we will need targets like:
> (a) equally balance nodes
> (b) return tasks to the nodes where their memory is
> (c) make nodes "sticky" for tasks which have their memory on them,
> "repulsive" for other tasks.
> But for a first attempt to get the scheduler more NUMA aware all this
> might be just too much.

I agree completely on b and c.  a is a reasonable goal that I think is
fairly close already with this patch.
> 
> With simple benchmarks you will most probably beat the plain O(1)
> scheduler on NUMA if you implement (a) in just 1. and 2. as your node
> is already somewhat "sticky". In complicated benchmarks (like a kernel
> compile ;-) it could already be too difficult to understand when the
> load balancer did what and why...
> 
> It would be nice to see some numbers.

Yep.  What sort of numbers would you like?  I've got kernbench numbers
and profile results from 2.5.38-mm1 which I'll put in a followup post.
These show a modest performance benefit, but don't provide detail on 
process to node mapping.  I've also have various test programs you've
sent that provide a good mapping of process to nodes.

                   Michael

 
> 
> Best regards,
> Erich
> 
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-02 17:54   ` Erich Focht
  2002-10-02 18:26     ` Michael Hohnbaum
@ 2002-10-02 18:30     ` Michael Hohnbaum
  2002-10-02 20:25     ` Martin J. Bligh
  2 siblings, 0 replies; 11+ messages in thread
From: Michael Hohnbaum @ 2002-10-02 18:30 UTC (permalink / raw)
  To: Erich Focht; +Cc: Ingo Molnar, linux-kernel

On Wed, 2002-10-02 at 10:54, Erich Focht wrote:

> With simple benchmarks you will most probably beat the plain O(1)
> scheduler on NUMA if you implement (a) in just 1. and 2. as your node
> is already somewhat "sticky". In complicated benchmarks (like a kernel
> compile ;-) it could already be too difficult to understand when the
> load balancer did what and why...
> 
> It would be nice to see some numbers.

Here are kernbench results with profiles and a brief analysis provided
by Martin Bligh:

2.5.38-mm1 + per-cpu hot pages
Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%

2.5.38-mm1 + per-cpu hot pages + sched
Elapsed: 19.528s User: 189.088s System: 40.488s CPU: 1175.6%

Much improved - user, system, and elapsed are all down.

Some diffs, only things over 50 change printed.

diffprofile nosched sched2

827 default_idle
294 .text.lock.file_table
138 get_empty_filp
124 __fput
97 do_softirq
80 schedule
70 strnlen_user
60 atomic_dec_and_lock
51 path_lookup
-54 __generic_copy_to_user
-55 find_get_page
-61 release_pages
-62 d_lookup
-66 do_wp_page
-75 __set_page_dirty_buffers
-86 file_read_actor
-88 pte_alloc_one
-94 free_percpu_page
-124 clear_page_tables
-160 vm_enough_memory
-224 page_remove_rmap
-253 zap_pte_range
-292 alloc_percpu_page
-940 do_anonymous_page
-967 __generic_copy_from_user
-1900 total

As you can see, all the VM operations take a diet, very cool.
(do_anonymous_page does all the page zeroing for new pages).
Head of new profile now looks like this:

83010 default_idle
5194 do_anonymous_page
4207 page_remove_rmap
2306 page_add_rmap
2226 d_lookup
1761 vm_enough_memory
1675 .text.lock.file_table
1480 file_read_actor
1254 get_empty_filp
1113 find_get_page
937 do_no_page
916 __generic_copy_from_user
875 atomic_dec_and_lock
764 do_page_fault
744 zap_pte_range
668 alloc_percpu_page
662 follow_mount
622 __fput
581 do_softirq
554 path_lookup
520 schedule

M.

> 
> Best regards,
> Erich
> 
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-02 13:11   ` Christoph Hellwig
@ 2002-10-02 18:56     ` Matthew Dobson
  2002-10-02 22:08     ` Michael Hohnbaum
  1 sibling, 0 replies; 11+ messages in thread
From: Matthew Dobson @ 2002-10-02 18:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Michael Hohnbaum, Ingo Molnar, linux-kernel, Erich Focht

Christoph Hellwig wrote:
> On Tue, Oct 01, 2002 at 04:55:35PM -0700, Michael Hohnbaum wrote:
>>--- clean-2.5.40/kernel/sched.c	Tue Oct  1 13:48:34 2002
>>+++ linux-2.5.40/kernel/sched.c	Tue Oct  1 13:27:46 2002
>>@@ -30,6 +30,9 @@
>> #include <linux/notifier.h>
>> #include <linux/delay.h>
>> #include <linux/timer.h>
>>+#if CONFIG_NUMA
>>+#include <asm/topology.h>
>>+#endif
> 
> Please make this inlcude unconditional, okay?
Agreed...  The topology macros are designed to work for *any* 
architecture, so there's no need to selectively include them.

>>+/*
>>+ * find_busiest_queue - find the busiest runqueue.
>>+ */
>>+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
>>+{
>>+	int nr_running, load, max_load_on_node, max_load_off_node, i;
>>+	runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src;
> 
> You're new find_busiest_queue is to 80 or 90% the same as the non-NUMA one.
> At least add the #ifdefs only where needed, but as cpu_to_node() optimizes
> away for the non-NUMA case I think you could just make it unconditional.
Looking over the code... I think I agree with Christoph here.  I think 
that most of the new code won't even get touched in the non-NUMA case. 
Of course, let me know if I'm wrong! ;)


>>+		if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) {
> 
> I think it should be cpu_to_node, not __cpu_to_node.
Actually, the non-double-underbar versions are not in the kernel...  I 
have a patch for them, though...  They just do some simple bound/error 
checking as wrappers around the double-underbar versions.  As long as 
you aren't calling the macros with bizarre values (ie 0<=i<=NR_CPUS), 
the double-underbar versions will work just fine, and will be mildly 
quicker.

Other than that, it looks good to me!

Cheers!

-Matt


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-02 17:54   ` Erich Focht
  2002-10-02 18:26     ` Michael Hohnbaum
  2002-10-02 18:30     ` Michael Hohnbaum
@ 2002-10-02 20:25     ` Martin J. Bligh
  2002-10-05 22:32       ` Erich Focht
  2 siblings, 1 reply; 11+ messages in thread
From: Martin J. Bligh @ 2002-10-02 20:25 UTC (permalink / raw)
  To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

> it's a start. But I'm afraid a full solution will need much more code
> (which is one of the problems with my NUMA scheduler patch).

Right, but a sequence of smaller patches will be easier to get in.
I see this as a first step towards your larger patch ... if we can
do something simpler like Michael has, with enough view to your
later plans to make them merge together cleanly, I think we have
the best of both worlds ... Erich, what would it take to make this
a first stepping stone towards what you have? Or is it there already?

The fact that Michael's patch seems to have better performance (at
least for the very simple tests I've done) seems to reinforce the
"many small steps" approach in my mind - it's easier to debug and
analyse like that.
 
> The ideas behind your patch are:
> 2. Favor own node for stealing if any CPU on the own node is >25%
> more loaded. Otherwise steal from another CPU if that one is >100%
> more loaded.
> ...
> 2. is ok as it makes it harder to change the node. But again, you don't
> aim at equally balanced nodes. And: if the task gets away from the node
> on which it got its memory, it has no reason to ever come back to it.

I don't think we should aim too hard for exactly equal balancing,
it may well result in small peturbations causing task bouncing between
nodes. 

> For a final solution I believe that we will need targets like:
> (a) equally balance nodes
> (b) return tasks to the nodes where their memory is
> (c) make nodes "sticky" for tasks which have their memory on them,
> "repulsive" for other tasks.

I'd add:

(d) take into account the RSS when migrating tasks across nodes
    (ie stickiness is proportional to amount of RAM used on nodes)

> But for a first attempt to get the scheduler more NUMA aware all this
> might be just too much.

Agreed ;-)

M.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-02 13:11   ` Christoph Hellwig
  2002-10-02 18:56     ` Matthew Dobson
@ 2002-10-02 22:08     ` Michael Hohnbaum
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Hohnbaum @ 2002-10-02 22:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ingo Molnar, linux-kernel, Erich Focht

[-- Attachment #1: Type: text/plain, Size: 2282 bytes --]

Christoph,

Thanks for the feedback.  I've made the suggested changes and rolled
a new patch.  Also, upped the threshold to move a task between nodes
based on results from a test Erich provided.  This seems to have helped
both on Erich's test and on kernbench.  Latest kernbench numbers:

Averages of 10 runs each:

stock-2.5.40      numa_sched-2.5.40a

real 0m34.181s    0m32.533s
user 5m11.986s    4m50.662s
sys  2m11.267     2m4.677s

I need to test this patch on a non-NUMA box to ensure that the
NUMA changes in find_busies_queue() don't affect SMP.  That 
testing will take place this week.

             Michael

On Wed, 2002-10-02 at 06:11, Christoph Hellwig wrote:
> On Tue, Oct 01, 2002 at 04:55:35PM -0700, Michael Hohnbaum wrote:
> > --- clean-2.5.40/kernel/sched.c	Tue Oct  1 13:48:34 2002
> > +++ linux-2.5.40/kernel/sched.c	Tue Oct  1 13:27:46 2002
> > @@ -30,6 +30,9 @@
> >  #include <linux/notifier.h>
> >  #include <linux/delay.h>
> >  #include <linux/timer.h>
> > +#if CONFIG_NUMA
> > +#include <asm/topology.h>
> > +#endif
> 
> Please make this inlcude unconditional, okay?
> 
> > +/*
> > + * find_busiest_queue - find the busiest runqueue.
> > + */
> > +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
> > +{
> > +	int nr_running, load, max_load_on_node, max_load_off_node, i;
> > +	runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src;
> 
> You're new find_busiest_queue is to 80 or 90% the same as the non-NUMA one.
> At least add the #ifdefs only where needed, but as cpu_to_node() optimizes
> away for the non-NUMA case I think you could just make it unconditional.
> 
> > +		if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) {
> 
> I think it should be cpu_to_node, not __cpu_to_node.
> 
> > +#if CONFIG_NUMA
> > +/*
> > + * If dest_cpu is allowed for this process, migrate the task to it.
> > + * This is accomplished by forcing the cpu_allowed mask to only
> > + * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
> > + * the cpu_allowed mask is restored.
> > + */
> > +void sched_migrate_task(task_t *p, int dest_cpu)
> 
> Looks like this one could be static?
> 
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486

[-- Attachment #2: sched40a.patch --]
[-- Type: text/plain, Size: 6960 bytes --]

--- clean-2.5.40/kernel/sched.c	Wed Oct  2 10:41:45 2002
+++ linux-2.5.40/kernel/sched.c	Wed Oct  2 13:27:42 2002
@@ -30,6 +30,7 @@
 #include <linux/notifier.h>
 #include <linux/delay.h>
 #include <linux/timer.h>
+#include <asm/topology.h>
 
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -637,15 +638,35 @@ static inline unsigned int double_lock_b
  */
 static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
 {
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
+	int nr_running, load, max_load_on_node, max_load_off_node, i;
+	runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src;
 
 	/*
-	 * We search all runqueues to find the most busy one.
+	 * This routine is designed to work on NUMA systems.  For
+	 * non-NUMA systems, everything ends up being on the same
+	 * node and most of the NUMA specific logic is optimized
+	 * away by the compiler.
+	 * 
+	 * We must look at each runqueue to update prev_nr_running.
+	 * As we do so, we find the busiest runqueue on the same
+	 * node, and the busiest runqueue off the node.  After
+	 * determining the busiest from each we first see if the
+	 * busiest runqueue on node meets the imbalance criteria.
+	 * If not, then we check the off queue runqueue.  Note that
+	 * we require a higher level of imbalance for choosing an
+	 * off node runqueue.
+	 *
+	 * For off-node, we only move at most one process, so imbalance
+	 * is always set to one for off-node runqueues.
+	 * 
 	 * We do this lockless to reduce cache-bouncing overhead,
 	 * we re-check the 'best' source CPU later on again, with
 	 * the lock held.
 	 *
+	 * Note that this routine is called with the current runqueue
+	 * locked, and if a runqueue is found (return != NULL), then
+	 * that runqueue is returned locked, also.
+	 *
 	 * We fend off statistical fluctuations in runqueue lengths by
 	 * saving the runqueue length during the previous load-balancing
 	 * operation and using the smaller one the current and saved lengths.
@@ -666,9 +687,15 @@ static inline runqueue_t *find_busiest_q
 		nr_running = this_rq->nr_running;
 	else
 		nr_running = this_rq->prev_nr_running[this_cpu];
+	if (nr_running < 1)
+		nr_running = 1;
 
+	busiest_on_node = NULL;
+	busiest_off_node = NULL;
 	busiest = NULL;
-	max_load = 1;
+	max_load_on_node = 1;
+	max_load_off_node = 1;
+	
 	for (i = 0; i < NR_CPUS; i++) {
 		if (!cpu_online(i))
 			continue;
@@ -678,35 +705,49 @@ static inline runqueue_t *find_busiest_q
 			load = rq_src->nr_running;
 		else
 			load = this_rq->prev_nr_running[i];
+
 		this_rq->prev_nr_running[i] = rq_src->nr_running;
 
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+		if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) {
+			if ((load > max_load_on_node) && (rq_src != this_rq)) {
+				busiest_on_node = rq_src;
+				max_load_on_node = load;
+			}
+		} else {
+			if (load > max_load_off_node)  {
+				busiest_off_node = rq_src;
+				max_load_off_node = load;
+			}
 		}
 	}
-
-	if (likely(!busiest))
-		goto out;
-
-	*imbalance = (max_load - nr_running) / 2;
-
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
-		goto out;
+	if (busiest_on_node) {
+		/* on node balancing happens if there is > 125% difference */
+		if (idle || ((nr_running*5)/4 < max_load_on_node)) {
+			busiest = busiest_on_node;
+			*imbalance = (max_load_on_node - nr_running) / 2;
+		}
 	}
-
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
+	if (busiest_off_node && !busiest) {
+		/* off node balancing requires 200% difference */
+		if (nr_running*4 >= max_load_off_node) 
+			return NULL;
+		busiest = busiest_off_node; 
+		*imbalance = 1;
+	} 
+	if (busiest) {
+		if (busiest == this_rq) {
+			return NULL;
+		}
+		nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+		/*
+		 * Make sure nothing changed since we checked the
+		 * runqueue length.
+		 */
+		if (busiest->nr_running <= nr_running) {
+			spin_unlock(&busiest->lock);
+			busiest = NULL;
+		}
 	}
-out:
 	return busiest;
 }
 
@@ -2104,6 +2145,65 @@ __init int migration_init(void)
 spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
 #endif
 
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  If the current runqueue is short enough
+ * just use it.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, best_cpu;
+	best_cpu = task_cpu(p);
+	
+	minload = cpu_rq(best_cpu)->nr_running;
+	/* if the current runqueue is only 2 long, don't look elsewhere */
+	if (minload <= 2)
+		return best_cpu;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+
+		if (minload > cpu_rq(i)->nr_running) {
+			minload = cpu_rq(i)->nr_running;
+			best_cpu = i;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+#endif /* CONFIG_NUMA */
+
 extern void init_timers(void);
 
 void __init sched_init(void)
--- clean-2.5.40/fs/exec.c	Wed Oct  2 10:42:35 2002
+++ linux-2.5.40/fs/exec.c	Tue Oct  1 12:55:50 2002
@@ -1001,6 +1001,8 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
--- clean-2.5.40/include/linux/sched.h	Wed Oct  2 10:42:22 2002
+++ linux-2.5.40/include/linux/sched.h	Tue Oct  1 12:55:51 2002
@@ -166,6 +166,11 @@ extern void update_one_process(struct ta
 			       unsigned long system, int cpu);
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+#else
+#define sched_balance_exec() {}
+#endif
 
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-02 20:25     ` Martin J. Bligh
@ 2002-10-05 22:32       ` Erich Focht
  2002-10-07 23:37         ` Michael Hohnbaum
  0 siblings, 1 reply; 11+ messages in thread
From: Erich Focht @ 2002-10-05 22:32 UTC (permalink / raw)
  To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel

Hi Martin & Michael,

thanks for the mails and the results.

On Wednesday 02 October 2002 22:25, Martin J. Bligh wrote:
> > it's a start. But I'm afraid a full solution will need much more code
> > (which is one of the problems with my NUMA scheduler patch).
>
> Right, but a sequence of smaller patches will be easier to get in.
> I see this as a first step towards your larger patch ... if we can
> do something simpler like Michael has, with enough view to your
> later plans to make them merge together cleanly, I think we have
> the best of both worlds ... Erich, what would it take to make this
> a first stepping stone towards what you have? Or is it there already?

It would probably save some time if we reuse parts of the code which
is already there (as with sched_migrate_task). I'd like to mention that
the node affine scheduler is already in production use at several HPC
sites and thus pretty well tested. The performance degradation you've
hit on NUMAQ must be specific to the architecture and doesn't hit Azusa
that much (we just don't see that). I think that could result from
the idle tasks having homenode 0, this should be fixed in sched_init. 
I'll post some numbers comparing O(1), pooling scheduler, node affine
scheduler and RSS based affinity in a separate email. That should help
to decide on the direction we should move. 

Right now the approaches are far from each other because there is no
concept of looping over the CPUs in a node. I can't imagine that we
can live without that when the patch gets more complex. I suggest to use
a sorted list of CPUs (sorted by node number) and a pointer into that
list (what I was calling pool_cpus, pool_ptr):

in topology.h:
#ifdef CONFIG_NUMA_SCHED
int _node_cpus[NR_CPUS];
int node_ptr[MAX_NUMNODES+1];
#define node_cpus(i) _node_cpus[i]
#define loop_over_node(i,n) for(i=node_ptr[n];i<node_ptr[n+1];i++)
#else
#define node_cpus(i) (i)
#define loop_over_node(i,n) for(i=0;i<NR_CPUS;i++)
#endif

Looping over nodes would be:
	loop_over_node(i,node) {
		cpu=node_cpus(i);
		...
	}

For non-NUMA systems there's only one node containing all CPUs
and the "cpu=code_cpus(i)" line will be optimized away. The
initialization would simply be:

{
	int n, cpu, ptr;
	unsigned long mask;
	ptr=0;
	for (n=0; n<numnodes; n++) {
		mask = __node_to_cpu_mask(n);
		node_ptr[i] = ptr;
		for (cpu=0; cpu<NR_CPUS; cpu++)
			if (mask  & (1UL << cpu))
				_node_cpus[ptr++] = cpu;
	}
	node_ptr[numnodes]=ptr;
}



> The fact that Michael's patch seems to have better performance (at
> least for the very simple tests I've done) seems to reinforce the
> "many small steps" approach in my mind - it's easier to debug and
> analyse like that.

OK, I'll put the node-balanced initial load balancing on top of
Michael's patch. Let's see how it changes.

> > The ideas behind your patch are:
> > 2. Favor own node for stealing if any CPU on the own node is >25%
> > more loaded. Otherwise steal from another CPU if that one is >100%
> > more loaded.
> > ...
> > 2. is ok as it makes it harder to change the node. But again, you don't
> > aim at equally balanced nodes. And: if the task gets away from the node
> > on which it got its memory, it has no reason to ever come back to it.
>
> I don't think we should aim too hard for exactly equal balancing,
> it may well result in small peturbations causing task bouncing between
> nodes.

No, we shouldn't aim too hard. The histeresys introduced by the 25%
imbalance necessary to start load balancing (as in Ingo's scheduler)
is enough to protect us from the bouncing. But when running a small
number of tasks you'd like to get the best performance out of the machine.
For me that means the maximum memory bandwidth available for each task,
which you only get if you distribute the tasks equally among the nodes.
For large number of tasks you will always have good balance across the
nodes, the O(1) scheduler already inforces that.

> > For a final solution I believe that we will need targets like:
> > (a) equally balance nodes
> > (b) return tasks to the nodes where their memory is
> > (c) make nodes "sticky" for tasks which have their memory on them,
> > "repulsive" for other tasks.
>
> I'd add:
>
> (d) take into account the RSS when migrating tasks across nodes
>     (ie stickiness is proportional to amount of RAM used on nodes)
Yes. Though it sounds like a refinement of (c) ;-)

Best regards,
Erich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-05 22:32       ` Erich Focht
@ 2002-10-07 23:37         ` Michael Hohnbaum
  2002-10-14 17:19           ` Erich Focht
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Hohnbaum @ 2002-10-07 23:37 UTC (permalink / raw)
  To: Erich Focht; +Cc: Martin J. Bligh, Ingo Molnar, linux-kernel

On Sat, 2002-10-05 at 15:32, Erich Focht wrote:
> Hi Martin & Michael,
> 
> thanks for the mails and the results.

You are most welcome.  Thanks for the work on your scheduler and
numa_test.  I want to acknowledge that much of the work that
I have done on my scheduler patch has been inspired by work that
you have already done, and in a few cases taken directly from it.

> It would probably save some time if we reuse parts of the code which
> is already there (as with sched_migrate_task). I'd like to mention that
> the node affine scheduler is already in production use at several HPC
> sites and thus pretty well tested. The performance degradation you've
> hit on NUMAQ must be specific to the architecture and doesn't hit Azusa
> that much (we just don't see that). I think that could result from
> the idle tasks having homenode 0, this should be fixed in sched_init. 
> I'll post some numbers comparing O(1), pooling scheduler, node affine
> scheduler and RSS based affinity in a separate email. That should help
> to decide on the direction we should move. 

One other piece to factor in is the workload characteristics - I'm
guessing that Azusa is beng used more for scientific workloads which
tend to be a bit more static and consumes large memory bandwidth.
 
> No, we shouldn't aim too hard. The histeresys introduced by the 25%
> imbalance necessary to start load balancing (as in Ingo's scheduler)
> is enough to protect us from the bouncing. But when running a small
> number of tasks you'd like to get the best performance out of the machine.
> For me that means the maximum memory bandwidth available for each task,
> which you only get if you distribute the tasks equally among the nodes.

Depends on the type of job.  Some actually benefit from being on the
same node as other tasks as locality is more important than bandwidth.
I am seeing some of this - when I get better distribution of load across
nodes, performance goes down for sdet and kernbench.

> For large number of tasks you will always have good balance across the
> nodes, the O(1) scheduler already inforces that.

Yes, the main shortcoming in this case is that it has nothing to keep
a task on the same node.  That is one of the two main goals I'm 
targeting.  (The other being to select the best cpu/node to begin
with so that there will be no need to move a task off node.)

> Best regards,
> Erich
> 

I plan to post another version of my patch in the next day or so along
with numa_test and kernbench results.  I've made a few mods that have
significantly helped the distribution across nodes, but need to buy back
some performance that is lost by the distribution.
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] Simple NUMA scheduler patch
  2002-10-07 23:37         ` Michael Hohnbaum
@ 2002-10-14 17:19           ` Erich Focht
  0 siblings, 0 replies; 11+ messages in thread
From: Erich Focht @ 2002-10-14 17:19 UTC (permalink / raw)
  To: Michael Hohnbaum; +Cc: Martin J. Bligh, Ingo Molnar, linux-kernel

Hi Michael,

On Tuesday 08 October 2002 01:37, Michael Hohnbaum wrote:
> > I'll post some numbers comparing O(1), pooling scheduler, node affine
> > scheduler and RSS based affinity in a separate email. That should help
> > to decide on the direction we should move.
>
> One other piece to factor in is the workload characteristics - I'm
> guessing that Azusa is beng used more for scientific workloads which
> tend to be a bit more static and consumes large memory bandwidth.

Yes, I agree. We're trying to keep HPC tasks on their nodes because we
KNOW that they are memory bandwidth and latency hungry. Therefore I
believe HPC like jobs are good benchmarks. And easier to set up than
database tests (which can also demand very high bandwidths).

> > machine. For me that means the maximum memory bandwidth available for
> > each task, which you only get if you distribute the tasks equally among
> > the nodes.
>
> Depends on the type of job.  Some actually benefit from being on the
> same node as other tasks as locality is more important than bandwidth.
> I am seeing some of this - when I get better distribution of load across
> nodes, performance goes down for sdet and kernbench.

This is a difficult issue. Because we're trying to get higher
performance out of a multitude of benchmarks (ever tried AIM7? It never
execs... so good bye initial balancing). But we also try to find a
solution which is good for any kind of NUMA machine. Currently we
experiment on the very different architectures:
 - Azusa : remote/local latency ratio 1.6, but no node level cache
 - NUMAQ : remote/local latency ratio 20, additional node level cache.

My current approach to the tuning parameters for adapting to the
machine is over the steal delays. Yours is over the load imbalance
needed to trigger a steal from a remote node. Maybe we'll even need more
buttons to be able to make these fit to any NUMA machine...

Regards,
Erich



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2002-10-14 17:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.44.0209050905180.8086-100000@localhost.localdomain>
2002-10-01 23:55 ` [RFC] Simple NUMA scheduler patch Michael Hohnbaum
2002-10-02 13:11   ` Christoph Hellwig
2002-10-02 18:56     ` Matthew Dobson
2002-10-02 22:08     ` Michael Hohnbaum
2002-10-02 17:54   ` Erich Focht
2002-10-02 18:26     ` Michael Hohnbaum
2002-10-02 18:30     ` Michael Hohnbaum
2002-10-02 20:25     ` Martin J. Bligh
2002-10-05 22:32       ` Erich Focht
2002-10-07 23:37         ` Michael Hohnbaum
2002-10-14 17:19           ` Erich Focht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).