* [RFC] Simple NUMA scheduler patch [not found] <Pine.LNX.4.44.0209050905180.8086-100000@localhost.localdomain> @ 2002-10-01 23:55 ` Michael Hohnbaum 2002-10-02 13:11 ` Christoph Hellwig 2002-10-02 17:54 ` Erich Focht 0 siblings, 2 replies; 11+ messages in thread From: Michael Hohnbaum @ 2002-10-01 23:55 UTC (permalink / raw) To: Ingo Molnar, linux-kernel; +Cc: Erich Focht [-- Attachment #1: Type: text/plain, Size: 1120 bytes --] Attached is a patch which provides a rudimentary NUMA scheduler. This patch basically does two things: * at exec() it finds the least loaded CPU to assign a task to; * at load_balance() (find_busiest_queue() actually) it favors cpus on the same node for taking tasks from. This has been tested on the IA32 based NUMAQ platform and shows performance gains for kernbench. Various microbenchmarks also show improvements and stickiness of processes to nodes. Profiles show that the kernbench performance improvements are directly attributable to the use of local memory caused by tasks running on the same node through their lifetime. I will be doing much more testing of this, and likely will be tweaking some of the algorithms based upon test results. Any comments, suggestions, flames are welcome. Patch applies to 2.5.40 and makes use of the new NUMA topology API. This scheduler change should work on other NUMA platforms with just the definition of the architecture specific macros in topology.h. -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 [-- Attachment #2: sched40.patch --] [-- Type: text/plain, Size: 7024 bytes --] --- clean-2.5.40/kernel/sched.c Tue Oct 1 13:48:34 2002 +++ linux-2.5.40/kernel/sched.c Tue Oct 1 13:27:46 2002 @@ -30,6 +30,9 @@ #include <linux/notifier.h> #include <linux/delay.h> #include <linux/timer.h> +#if CONFIG_NUMA +#include <asm/topology.h> +#endif /* * Convert user-nice values [ -20 ... 0 ... 19 ] @@ -632,6 +635,121 @@ static inline unsigned int double_lock_b return nr_running; } +#if CONFIG_NUMA +/* + * find_busiest_queue - find the busiest runqueue. + */ +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) +{ + int nr_running, load, max_load_on_node, max_load_off_node, i; + runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src; + + /* + * We must look at each runqueue to update prev_nr_running. + * As we do so, we find the busiest runqueue on the same + * node, and the busiest runqueue off the node. After + * determining the busiest from each we first see if the + * busiest runqueue on node meets the imbalance criteria. + * If not, then we check the off queue runqueue. Note that + * we require a higher level of imbalance for choosing an + * off node runqueue. + * + * For off-node, we only move at most one process, so imbalance + * is always set to one for off-node runqueues. + * + * We do this lockless to reduce cache-bouncing overhead, + * we re-check the 'best' source CPU later on again, with + * the lock held. + * + * Note that this routine is called with the current runqueue + * locked, and if a runqueue is found (return != NULL), then + * that runqueue is returned locked, also. + * + * We fend off statistical fluctuations in runqueue lengths by + * saving the runqueue length during the previous load-balancing + * operation and using the smaller one the current and saved lengths. + * If a runqueue is long enough for a longer amount of time then + * we recognize it and pull tasks from it. + * + * The 'current runqueue length' is a statistical maximum variable, + * for that one we take the longer one - to avoid fluctuations in + * the other direction. So for a load-balance to happen it needs + * stable long runqueue on the target CPU and stable short runqueue + * on the local runqueue. + * + * We make an exception if this CPU is about to become idle - in + * that case we are less picky about moving a task across CPUs and + * take what can be taken. + */ + if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu])) + nr_running = this_rq->nr_running; + else + nr_running = this_rq->prev_nr_running[this_cpu]; + if (nr_running < 1) + nr_running = 1; + + busiest_on_node = NULL; + busiest_off_node = NULL; + busiest = NULL; + max_load_on_node = 1; + max_load_off_node = 1; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + + rq_src = cpu_rq(i); + if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) + load = rq_src->nr_running; + else + load = this_rq->prev_nr_running[i]; + + this_rq->prev_nr_running[i] = rq_src->nr_running; + + if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) { + if ((load > max_load_on_node) && (rq_src != this_rq)) { + busiest_on_node = rq_src; + max_load_on_node = load; + } + } else { + if (load > max_load_off_node) { + busiest_off_node = rq_src; + max_load_off_node = load; + } + } + } + if (busiest_on_node) { + /* on node balancing happens if there is > 125% difference */ + if (idle || ((nr_running*5)/4 < max_load_on_node)) { + busiest = busiest_on_node; + *imbalance = (max_load_on_node - nr_running) / 2; + } + } + if (busiest_off_node && !busiest) { + /* off node balancing requires 200% difference */ + if (nr_running*2 >= max_load_off_node) + return NULL; + busiest = busiest_off_node; + *imbalance = 1; + } + if (busiest) { + if (busiest == this_rq) { + return NULL; + } + nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); + /* + * Make sure nothing changed since we checked the + * runqueue length. + */ + if (busiest->nr_running <= nr_running) { + spin_unlock(&busiest->lock); + busiest = NULL; + } + } + return busiest; +} +#else + /* * find_busiest_queue - find the busiest runqueue. */ @@ -709,6 +827,7 @@ static inline runqueue_t *find_busiest_q out: return busiest; } +#endif /* CONFIG_NUMA */ /* * pull_task - move a task from a remote runqueue to the local runqueue. @@ -2104,6 +2223,65 @@ __init int migration_init(void) spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED; #endif +#if CONFIG_NUMA +/* + * If dest_cpu is allowed for this process, migrate the task to it. + * This is accomplished by forcing the cpu_allowed mask to only + * allow dest_cpu, which will force the cpu onto dest_cpu. Then + * the cpu_allowed mask is restored. + */ +void sched_migrate_task(task_t *p, int dest_cpu) +{ + unsigned long old_mask; + + old_mask = p->cpus_allowed; + if (!(old_mask & (1UL << dest_cpu))) + return; + /* force the process onto the specified CPU */ + set_cpus_allowed(p, 1UL << dest_cpu); + + /* restore the cpus allowed mask */ + set_cpus_allowed(p, old_mask); +} + +/* + * Find the least loaded CPU. If the current runqueue is short enough + * just use it. + */ +static int sched_best_cpu(struct task_struct *p) +{ + int i, minload, best_cpu; + best_cpu = task_cpu(p); + + minload = cpu_rq(best_cpu)->nr_running; + /* if the current runqueue is only 2 long, don't look elsewhere */ + if (minload <= 2) + return best_cpu; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + + if (minload > cpu_rq(i)->nr_running) { + minload = cpu_rq(i)->nr_running; + best_cpu = i; + } + } + return best_cpu; +} + +void sched_balance_exec(void) +{ + int new_cpu; + + if (numnodes > 1) { + new_cpu = sched_best_cpu(current); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); + } +} +#endif /* CONFIG_NUMA */ + extern void init_timers(void); void __init sched_init(void) --- clean-2.5.40/fs/exec.c Tue Oct 1 13:47:58 2002 +++ linux-2.5.40/fs/exec.c Tue Oct 1 12:55:50 2002 @@ -1001,6 +1001,8 @@ int do_execve(char * filename, char ** a int retval; int i; + sched_balance_exec(); + file = open_exec(filename); retval = PTR_ERR(file); --- clean-2.5.40/include/linux/sched.h Tue Oct 1 13:48:30 2002 +++ linux-2.5.40/include/linux/sched.h Tue Oct 1 12:55:51 2002 @@ -166,6 +166,11 @@ extern void update_one_process(struct ta unsigned long system, int cpu); extern void scheduler_tick(int user_tick, int system); extern unsigned long cache_decay_ticks; +#ifdef CONFIG_NUMA +extern void sched_balance_exec(void); +#else +#define sched_balance_exec() {} +#endif #define MAX_SCHEDULE_TIMEOUT LONG_MAX ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-01 23:55 ` [RFC] Simple NUMA scheduler patch Michael Hohnbaum @ 2002-10-02 13:11 ` Christoph Hellwig 2002-10-02 18:56 ` Matthew Dobson 2002-10-02 22:08 ` Michael Hohnbaum 2002-10-02 17:54 ` Erich Focht 1 sibling, 2 replies; 11+ messages in thread From: Christoph Hellwig @ 2002-10-02 13:11 UTC (permalink / raw) To: Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel, Erich Focht On Tue, Oct 01, 2002 at 04:55:35PM -0700, Michael Hohnbaum wrote: > --- clean-2.5.40/kernel/sched.c Tue Oct 1 13:48:34 2002 > +++ linux-2.5.40/kernel/sched.c Tue Oct 1 13:27:46 2002 > @@ -30,6 +30,9 @@ > #include <linux/notifier.h> > #include <linux/delay.h> > #include <linux/timer.h> > +#if CONFIG_NUMA > +#include <asm/topology.h> > +#endif Please make this inlcude unconditional, okay? > +/* > + * find_busiest_queue - find the busiest runqueue. > + */ > +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) > +{ > + int nr_running, load, max_load_on_node, max_load_off_node, i; > + runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src; You're new find_busiest_queue is to 80 or 90% the same as the non-NUMA one. At least add the #ifdefs only where needed, but as cpu_to_node() optimizes away for the non-NUMA case I think you could just make it unconditional. > + if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) { I think it should be cpu_to_node, not __cpu_to_node. > +#if CONFIG_NUMA > +/* > + * If dest_cpu is allowed for this process, migrate the task to it. > + * This is accomplished by forcing the cpu_allowed mask to only > + * allow dest_cpu, which will force the cpu onto dest_cpu. Then > + * the cpu_allowed mask is restored. > + */ > +void sched_migrate_task(task_t *p, int dest_cpu) Looks like this one could be static? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-02 13:11 ` Christoph Hellwig @ 2002-10-02 18:56 ` Matthew Dobson 2002-10-02 22:08 ` Michael Hohnbaum 1 sibling, 0 replies; 11+ messages in thread From: Matthew Dobson @ 2002-10-02 18:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Michael Hohnbaum, Ingo Molnar, linux-kernel, Erich Focht Christoph Hellwig wrote: > On Tue, Oct 01, 2002 at 04:55:35PM -0700, Michael Hohnbaum wrote: >>--- clean-2.5.40/kernel/sched.c Tue Oct 1 13:48:34 2002 >>+++ linux-2.5.40/kernel/sched.c Tue Oct 1 13:27:46 2002 >>@@ -30,6 +30,9 @@ >> #include <linux/notifier.h> >> #include <linux/delay.h> >> #include <linux/timer.h> >>+#if CONFIG_NUMA >>+#include <asm/topology.h> >>+#endif > > Please make this inlcude unconditional, okay? Agreed... The topology macros are designed to work for *any* architecture, so there's no need to selectively include them. >>+/* >>+ * find_busiest_queue - find the busiest runqueue. >>+ */ >>+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) >>+{ >>+ int nr_running, load, max_load_on_node, max_load_off_node, i; >>+ runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src; > > You're new find_busiest_queue is to 80 or 90% the same as the non-NUMA one. > At least add the #ifdefs only where needed, but as cpu_to_node() optimizes > away for the non-NUMA case I think you could just make it unconditional. Looking over the code... I think I agree with Christoph here. I think that most of the new code won't even get touched in the non-NUMA case. Of course, let me know if I'm wrong! ;) >>+ if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) { > > I think it should be cpu_to_node, not __cpu_to_node. Actually, the non-double-underbar versions are not in the kernel... I have a patch for them, though... They just do some simple bound/error checking as wrappers around the double-underbar versions. As long as you aren't calling the macros with bizarre values (ie 0<=i<=NR_CPUS), the double-underbar versions will work just fine, and will be mildly quicker. Other than that, it looks good to me! Cheers! -Matt ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-02 13:11 ` Christoph Hellwig 2002-10-02 18:56 ` Matthew Dobson @ 2002-10-02 22:08 ` Michael Hohnbaum 1 sibling, 0 replies; 11+ messages in thread From: Michael Hohnbaum @ 2002-10-02 22:08 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Ingo Molnar, linux-kernel, Erich Focht [-- Attachment #1: Type: text/plain, Size: 2282 bytes --] Christoph, Thanks for the feedback. I've made the suggested changes and rolled a new patch. Also, upped the threshold to move a task between nodes based on results from a test Erich provided. This seems to have helped both on Erich's test and on kernbench. Latest kernbench numbers: Averages of 10 runs each: stock-2.5.40 numa_sched-2.5.40a real 0m34.181s 0m32.533s user 5m11.986s 4m50.662s sys 2m11.267 2m4.677s I need to test this patch on a non-NUMA box to ensure that the NUMA changes in find_busies_queue() don't affect SMP. That testing will take place this week. Michael On Wed, 2002-10-02 at 06:11, Christoph Hellwig wrote: > On Tue, Oct 01, 2002 at 04:55:35PM -0700, Michael Hohnbaum wrote: > > --- clean-2.5.40/kernel/sched.c Tue Oct 1 13:48:34 2002 > > +++ linux-2.5.40/kernel/sched.c Tue Oct 1 13:27:46 2002 > > @@ -30,6 +30,9 @@ > > #include <linux/notifier.h> > > #include <linux/delay.h> > > #include <linux/timer.h> > > +#if CONFIG_NUMA > > +#include <asm/topology.h> > > +#endif > > Please make this inlcude unconditional, okay? > > > +/* > > + * find_busiest_queue - find the busiest runqueue. > > + */ > > +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) > > +{ > > + int nr_running, load, max_load_on_node, max_load_off_node, i; > > + runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src; > > You're new find_busiest_queue is to 80 or 90% the same as the non-NUMA one. > At least add the #ifdefs only where needed, but as cpu_to_node() optimizes > away for the non-NUMA case I think you could just make it unconditional. > > > + if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) { > > I think it should be cpu_to_node, not __cpu_to_node. > > > +#if CONFIG_NUMA > > +/* > > + * If dest_cpu is allowed for this process, migrate the task to it. > > + * This is accomplished by forcing the cpu_allowed mask to only > > + * allow dest_cpu, which will force the cpu onto dest_cpu. Then > > + * the cpu_allowed mask is restored. > > + */ > > +void sched_migrate_task(task_t *p, int dest_cpu) > > Looks like this one could be static? > -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 [-- Attachment #2: sched40a.patch --] [-- Type: text/plain, Size: 6960 bytes --] --- clean-2.5.40/kernel/sched.c Wed Oct 2 10:41:45 2002 +++ linux-2.5.40/kernel/sched.c Wed Oct 2 13:27:42 2002 @@ -30,6 +30,7 @@ #include <linux/notifier.h> #include <linux/delay.h> #include <linux/timer.h> +#include <asm/topology.h> /* * Convert user-nice values [ -20 ... 0 ... 19 ] @@ -637,15 +638,35 @@ static inline unsigned int double_lock_b */ static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) { - int nr_running, load, max_load, i; - runqueue_t *busiest, *rq_src; + int nr_running, load, max_load_on_node, max_load_off_node, i; + runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src; /* - * We search all runqueues to find the most busy one. + * This routine is designed to work on NUMA systems. For + * non-NUMA systems, everything ends up being on the same + * node and most of the NUMA specific logic is optimized + * away by the compiler. + * + * We must look at each runqueue to update prev_nr_running. + * As we do so, we find the busiest runqueue on the same + * node, and the busiest runqueue off the node. After + * determining the busiest from each we first see if the + * busiest runqueue on node meets the imbalance criteria. + * If not, then we check the off queue runqueue. Note that + * we require a higher level of imbalance for choosing an + * off node runqueue. + * + * For off-node, we only move at most one process, so imbalance + * is always set to one for off-node runqueues. + * * We do this lockless to reduce cache-bouncing overhead, * we re-check the 'best' source CPU later on again, with * the lock held. * + * Note that this routine is called with the current runqueue + * locked, and if a runqueue is found (return != NULL), then + * that runqueue is returned locked, also. + * * We fend off statistical fluctuations in runqueue lengths by * saving the runqueue length during the previous load-balancing * operation and using the smaller one the current and saved lengths. @@ -666,9 +687,15 @@ static inline runqueue_t *find_busiest_q nr_running = this_rq->nr_running; else nr_running = this_rq->prev_nr_running[this_cpu]; + if (nr_running < 1) + nr_running = 1; + busiest_on_node = NULL; + busiest_off_node = NULL; busiest = NULL; - max_load = 1; + max_load_on_node = 1; + max_load_off_node = 1; + for (i = 0; i < NR_CPUS; i++) { if (!cpu_online(i)) continue; @@ -678,35 +705,49 @@ static inline runqueue_t *find_busiest_q load = rq_src->nr_running; else load = this_rq->prev_nr_running[i]; + this_rq->prev_nr_running[i] = rq_src->nr_running; - if ((load > max_load) && (rq_src != this_rq)) { - busiest = rq_src; - max_load = load; + if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) { + if ((load > max_load_on_node) && (rq_src != this_rq)) { + busiest_on_node = rq_src; + max_load_on_node = load; + } + } else { + if (load > max_load_off_node) { + busiest_off_node = rq_src; + max_load_off_node = load; + } } } - - if (likely(!busiest)) - goto out; - - *imbalance = (max_load - nr_running) / 2; - - /* It needs an at least ~25% imbalance to trigger balancing. */ - if (!idle && (*imbalance < (max_load + 3)/4)) { - busiest = NULL; - goto out; + if (busiest_on_node) { + /* on node balancing happens if there is > 125% difference */ + if (idle || ((nr_running*5)/4 < max_load_on_node)) { + busiest = busiest_on_node; + *imbalance = (max_load_on_node - nr_running) / 2; + } } - - nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); - /* - * Make sure nothing changed since we checked the - * runqueue length. - */ - if (busiest->nr_running <= nr_running + 1) { - spin_unlock(&busiest->lock); - busiest = NULL; + if (busiest_off_node && !busiest) { + /* off node balancing requires 200% difference */ + if (nr_running*4 >= max_load_off_node) + return NULL; + busiest = busiest_off_node; + *imbalance = 1; + } + if (busiest) { + if (busiest == this_rq) { + return NULL; + } + nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); + /* + * Make sure nothing changed since we checked the + * runqueue length. + */ + if (busiest->nr_running <= nr_running) { + spin_unlock(&busiest->lock); + busiest = NULL; + } } -out: return busiest; } @@ -2104,6 +2145,65 @@ __init int migration_init(void) spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED; #endif +#if CONFIG_NUMA +/* + * If dest_cpu is allowed for this process, migrate the task to it. + * This is accomplished by forcing the cpu_allowed mask to only + * allow dest_cpu, which will force the cpu onto dest_cpu. Then + * the cpu_allowed mask is restored. + */ +static void sched_migrate_task(task_t *p, int dest_cpu) +{ + unsigned long old_mask; + + old_mask = p->cpus_allowed; + if (!(old_mask & (1UL << dest_cpu))) + return; + /* force the process onto the specified CPU */ + set_cpus_allowed(p, 1UL << dest_cpu); + + /* restore the cpus allowed mask */ + set_cpus_allowed(p, old_mask); +} + +/* + * Find the least loaded CPU. If the current runqueue is short enough + * just use it. + */ +static int sched_best_cpu(struct task_struct *p) +{ + int i, minload, best_cpu; + best_cpu = task_cpu(p); + + minload = cpu_rq(best_cpu)->nr_running; + /* if the current runqueue is only 2 long, don't look elsewhere */ + if (minload <= 2) + return best_cpu; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + + if (minload > cpu_rq(i)->nr_running) { + minload = cpu_rq(i)->nr_running; + best_cpu = i; + } + } + return best_cpu; +} + +void sched_balance_exec(void) +{ + int new_cpu; + + if (numnodes > 1) { + new_cpu = sched_best_cpu(current); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); + } +} +#endif /* CONFIG_NUMA */ + extern void init_timers(void); void __init sched_init(void) --- clean-2.5.40/fs/exec.c Wed Oct 2 10:42:35 2002 +++ linux-2.5.40/fs/exec.c Tue Oct 1 12:55:50 2002 @@ -1001,6 +1001,8 @@ int do_execve(char * filename, char ** a int retval; int i; + sched_balance_exec(); + file = open_exec(filename); retval = PTR_ERR(file); --- clean-2.5.40/include/linux/sched.h Wed Oct 2 10:42:22 2002 +++ linux-2.5.40/include/linux/sched.h Tue Oct 1 12:55:51 2002 @@ -166,6 +166,11 @@ extern void update_one_process(struct ta unsigned long system, int cpu); extern void scheduler_tick(int user_tick, int system); extern unsigned long cache_decay_ticks; +#ifdef CONFIG_NUMA +extern void sched_balance_exec(void); +#else +#define sched_balance_exec() {} +#endif #define MAX_SCHEDULE_TIMEOUT LONG_MAX ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-01 23:55 ` [RFC] Simple NUMA scheduler patch Michael Hohnbaum 2002-10-02 13:11 ` Christoph Hellwig @ 2002-10-02 17:54 ` Erich Focht 2002-10-02 18:26 ` Michael Hohnbaum ` (2 more replies) 1 sibling, 3 replies; 11+ messages in thread From: Erich Focht @ 2002-10-02 17:54 UTC (permalink / raw) To: Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel Hi Michael, On Wednesday 02 October 2002 01:55, Michael Hohnbaum wrote: > Attached is a patch which provides a rudimentary NUMA scheduler. > This patch basically does two things: > > * at exec() it finds the least loaded CPU to assign a task to; > * at load_balance() (find_busiest_queue() actually) it favors > cpus on the same node for taking tasks from. it's a start. But I'm afraid a full solution will need much more code (which is one of the problems with my NUMA scheduler patch). The ideas behind your patch are: 1. Do initial load balancing, choose the least loaded CPU at the beginning of exec(). 2. Favor own node for stealing if any CPU on the own node is >25% more loaded. Otherwise steal from another CPU if that one is >100% more loaded. 1. is fine but should ideally aim for equal load among nodes. In the current form I believe that the original load balancer does the job right after fork() (which must have happened before exec()). As you changed the original load balancer, you really need this initial balancing. 2. is ok as it makes it harder to change the node. But again, you don't aim at equally balanced nodes. And: if the task gets away from the node on which it got its memory, it has no reason to ever come back to it. For a final solution I believe that we will need targets like: (a) equally balance nodes (b) return tasks to the nodes where their memory is (c) make nodes "sticky" for tasks which have their memory on them, "repulsive" for other tasks. But for a first attempt to get the scheduler more NUMA aware all this might be just too much. With simple benchmarks you will most probably beat the plain O(1) scheduler on NUMA if you implement (a) in just 1. and 2. as your node is already somewhat "sticky". In complicated benchmarks (like a kernel compile ;-) it could already be too difficult to understand when the load balancer did what and why... It would be nice to see some numbers. Best regards, Erich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-02 17:54 ` Erich Focht @ 2002-10-02 18:26 ` Michael Hohnbaum 2002-10-02 18:30 ` Michael Hohnbaum 2002-10-02 20:25 ` Martin J. Bligh 2 siblings, 0 replies; 11+ messages in thread From: Michael Hohnbaum @ 2002-10-02 18:26 UTC (permalink / raw) To: Erich Focht; +Cc: Ingo Molnar, linux-kernel On Wed, 2002-10-02 at 10:54, Erich Focht wrote: > Hi Michael, > > On Wednesday 02 October 2002 01:55, Michael Hohnbaum wrote: > > Attached is a patch which provides a rudimentary NUMA scheduler. > > This patch basically does two things: > > > > * at exec() it finds the least loaded CPU to assign a task to; > > * at load_balance() (find_busiest_queue() actually) it favors > > cpus on the same node for taking tasks from. > > it's a start. But I'm afraid a full solution will need much more code > (which is one of the problems with my NUMA scheduler patch). Yes, I agree that a full solution needs much more, however I think that what is in this patch is a good start which can be extended over time. Making a process sticky to a node is the next feature that I would like to add. > > The ideas behind your patch are: > 1. Do initial load balancing, choose the least loaded CPU at the > beginning of exec(). > 2. Favor own node for stealing if any CPU on the own node is >25% > more loaded. Otherwise steal from another CPU if that one is >100% > more loaded. > > 1. is fine but should ideally aim for equal load among nodes. In > the current form I believe that the original load balancer does the > job right after fork() (which must have happened before exec()). As > you changed the original load balancer, you really need this initial > balancing. Actually, what is implemented now will on a loaded system result in a reasonable balance between nodes. On a lightly loaded system tasks will tend to fill one node before new processes get distributed to other nodes. This actually benefits keeping memory accesses local on the same node assuming that the small number of processes are inter-related. > > 2. is ok as it makes it harder to change the node. But again, you don't > aim at equally balanced nodes. And: if the task gets away from the node > on which it got its memory, it has no reason to ever come back to it. Again, I believe that as a system becomes loaded, the nodes get balanced. Running the numa_test that you sent me, shows thats once the number of processes is greater than 2*NCPU we get a fairly equal distribution across nodes. I've made a few tweaks since I sent the patch out that helps on this (which I'll repost later today). > > For a final solution I believe that we will need targets like: > (a) equally balance nodes > (b) return tasks to the nodes where their memory is > (c) make nodes "sticky" for tasks which have their memory on them, > "repulsive" for other tasks. > But for a first attempt to get the scheduler more NUMA aware all this > might be just too much. I agree completely on b and c. a is a reasonable goal that I think is fairly close already with this patch. > > With simple benchmarks you will most probably beat the plain O(1) > scheduler on NUMA if you implement (a) in just 1. and 2. as your node > is already somewhat "sticky". In complicated benchmarks (like a kernel > compile ;-) it could already be too difficult to understand when the > load balancer did what and why... > > It would be nice to see some numbers. Yep. What sort of numbers would you like? I've got kernbench numbers and profile results from 2.5.38-mm1 which I'll put in a followup post. These show a modest performance benefit, but don't provide detail on process to node mapping. I've also have various test programs you've sent that provide a good mapping of process to nodes. Michael > > Best regards, > Erich > -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-02 17:54 ` Erich Focht 2002-10-02 18:26 ` Michael Hohnbaum @ 2002-10-02 18:30 ` Michael Hohnbaum 2002-10-02 20:25 ` Martin J. Bligh 2 siblings, 0 replies; 11+ messages in thread From: Michael Hohnbaum @ 2002-10-02 18:30 UTC (permalink / raw) To: Erich Focht; +Cc: Ingo Molnar, linux-kernel On Wed, 2002-10-02 at 10:54, Erich Focht wrote: > With simple benchmarks you will most probably beat the plain O(1) > scheduler on NUMA if you implement (a) in just 1. and 2. as your node > is already somewhat "sticky". In complicated benchmarks (like a kernel > compile ;-) it could already be too difficult to understand when the > load balancer did what and why... > > It would be nice to see some numbers. Here are kernbench results with profiles and a brief analysis provided by Martin Bligh: 2.5.38-mm1 + per-cpu hot pages Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4% 2.5.38-mm1 + per-cpu hot pages + sched Elapsed: 19.528s User: 189.088s System: 40.488s CPU: 1175.6% Much improved - user, system, and elapsed are all down. Some diffs, only things over 50 change printed. diffprofile nosched sched2 827 default_idle 294 .text.lock.file_table 138 get_empty_filp 124 __fput 97 do_softirq 80 schedule 70 strnlen_user 60 atomic_dec_and_lock 51 path_lookup -54 __generic_copy_to_user -55 find_get_page -61 release_pages -62 d_lookup -66 do_wp_page -75 __set_page_dirty_buffers -86 file_read_actor -88 pte_alloc_one -94 free_percpu_page -124 clear_page_tables -160 vm_enough_memory -224 page_remove_rmap -253 zap_pte_range -292 alloc_percpu_page -940 do_anonymous_page -967 __generic_copy_from_user -1900 total As you can see, all the VM operations take a diet, very cool. (do_anonymous_page does all the page zeroing for new pages). Head of new profile now looks like this: 83010 default_idle 5194 do_anonymous_page 4207 page_remove_rmap 2306 page_add_rmap 2226 d_lookup 1761 vm_enough_memory 1675 .text.lock.file_table 1480 file_read_actor 1254 get_empty_filp 1113 find_get_page 937 do_no_page 916 __generic_copy_from_user 875 atomic_dec_and_lock 764 do_page_fault 744 zap_pte_range 668 alloc_percpu_page 662 follow_mount 622 __fput 581 do_softirq 554 path_lookup 520 schedule M. > > Best regards, > Erich > -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-02 17:54 ` Erich Focht 2002-10-02 18:26 ` Michael Hohnbaum 2002-10-02 18:30 ` Michael Hohnbaum @ 2002-10-02 20:25 ` Martin J. Bligh 2002-10-05 22:32 ` Erich Focht 2 siblings, 1 reply; 11+ messages in thread From: Martin J. Bligh @ 2002-10-02 20:25 UTC (permalink / raw) To: Erich Focht, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel > it's a start. But I'm afraid a full solution will need much more code > (which is one of the problems with my NUMA scheduler patch). Right, but a sequence of smaller patches will be easier to get in. I see this as a first step towards your larger patch ... if we can do something simpler like Michael has, with enough view to your later plans to make them merge together cleanly, I think we have the best of both worlds ... Erich, what would it take to make this a first stepping stone towards what you have? Or is it there already? The fact that Michael's patch seems to have better performance (at least for the very simple tests I've done) seems to reinforce the "many small steps" approach in my mind - it's easier to debug and analyse like that. > The ideas behind your patch are: > 2. Favor own node for stealing if any CPU on the own node is >25% > more loaded. Otherwise steal from another CPU if that one is >100% > more loaded. > ... > 2. is ok as it makes it harder to change the node. But again, you don't > aim at equally balanced nodes. And: if the task gets away from the node > on which it got its memory, it has no reason to ever come back to it. I don't think we should aim too hard for exactly equal balancing, it may well result in small peturbations causing task bouncing between nodes. > For a final solution I believe that we will need targets like: > (a) equally balance nodes > (b) return tasks to the nodes where their memory is > (c) make nodes "sticky" for tasks which have their memory on them, > "repulsive" for other tasks. I'd add: (d) take into account the RSS when migrating tasks across nodes (ie stickiness is proportional to amount of RAM used on nodes) > But for a first attempt to get the scheduler more NUMA aware all this > might be just too much. Agreed ;-) M. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-02 20:25 ` Martin J. Bligh @ 2002-10-05 22:32 ` Erich Focht 2002-10-07 23:37 ` Michael Hohnbaum 0 siblings, 1 reply; 11+ messages in thread From: Erich Focht @ 2002-10-05 22:32 UTC (permalink / raw) To: Martin J. Bligh, Michael Hohnbaum; +Cc: Ingo Molnar, linux-kernel Hi Martin & Michael, thanks for the mails and the results. On Wednesday 02 October 2002 22:25, Martin J. Bligh wrote: > > it's a start. But I'm afraid a full solution will need much more code > > (which is one of the problems with my NUMA scheduler patch). > > Right, but a sequence of smaller patches will be easier to get in. > I see this as a first step towards your larger patch ... if we can > do something simpler like Michael has, with enough view to your > later plans to make them merge together cleanly, I think we have > the best of both worlds ... Erich, what would it take to make this > a first stepping stone towards what you have? Or is it there already? It would probably save some time if we reuse parts of the code which is already there (as with sched_migrate_task). I'd like to mention that the node affine scheduler is already in production use at several HPC sites and thus pretty well tested. The performance degradation you've hit on NUMAQ must be specific to the architecture and doesn't hit Azusa that much (we just don't see that). I think that could result from the idle tasks having homenode 0, this should be fixed in sched_init. I'll post some numbers comparing O(1), pooling scheduler, node affine scheduler and RSS based affinity in a separate email. That should help to decide on the direction we should move. Right now the approaches are far from each other because there is no concept of looping over the CPUs in a node. I can't imagine that we can live without that when the patch gets more complex. I suggest to use a sorted list of CPUs (sorted by node number) and a pointer into that list (what I was calling pool_cpus, pool_ptr): in topology.h: #ifdef CONFIG_NUMA_SCHED int _node_cpus[NR_CPUS]; int node_ptr[MAX_NUMNODES+1]; #define node_cpus(i) _node_cpus[i] #define loop_over_node(i,n) for(i=node_ptr[n];i<node_ptr[n+1];i++) #else #define node_cpus(i) (i) #define loop_over_node(i,n) for(i=0;i<NR_CPUS;i++) #endif Looping over nodes would be: loop_over_node(i,node) { cpu=node_cpus(i); ... } For non-NUMA systems there's only one node containing all CPUs and the "cpu=code_cpus(i)" line will be optimized away. The initialization would simply be: { int n, cpu, ptr; unsigned long mask; ptr=0; for (n=0; n<numnodes; n++) { mask = __node_to_cpu_mask(n); node_ptr[i] = ptr; for (cpu=0; cpu<NR_CPUS; cpu++) if (mask & (1UL << cpu)) _node_cpus[ptr++] = cpu; } node_ptr[numnodes]=ptr; } > The fact that Michael's patch seems to have better performance (at > least for the very simple tests I've done) seems to reinforce the > "many small steps" approach in my mind - it's easier to debug and > analyse like that. OK, I'll put the node-balanced initial load balancing on top of Michael's patch. Let's see how it changes. > > The ideas behind your patch are: > > 2. Favor own node for stealing if any CPU on the own node is >25% > > more loaded. Otherwise steal from another CPU if that one is >100% > > more loaded. > > ... > > 2. is ok as it makes it harder to change the node. But again, you don't > > aim at equally balanced nodes. And: if the task gets away from the node > > on which it got its memory, it has no reason to ever come back to it. > > I don't think we should aim too hard for exactly equal balancing, > it may well result in small peturbations causing task bouncing between > nodes. No, we shouldn't aim too hard. The histeresys introduced by the 25% imbalance necessary to start load balancing (as in Ingo's scheduler) is enough to protect us from the bouncing. But when running a small number of tasks you'd like to get the best performance out of the machine. For me that means the maximum memory bandwidth available for each task, which you only get if you distribute the tasks equally among the nodes. For large number of tasks you will always have good balance across the nodes, the O(1) scheduler already inforces that. > > For a final solution I believe that we will need targets like: > > (a) equally balance nodes > > (b) return tasks to the nodes where their memory is > > (c) make nodes "sticky" for tasks which have their memory on them, > > "repulsive" for other tasks. > > I'd add: > > (d) take into account the RSS when migrating tasks across nodes > (ie stickiness is proportional to amount of RAM used on nodes) Yes. Though it sounds like a refinement of (c) ;-) Best regards, Erich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-05 22:32 ` Erich Focht @ 2002-10-07 23:37 ` Michael Hohnbaum 2002-10-14 17:19 ` Erich Focht 0 siblings, 1 reply; 11+ messages in thread From: Michael Hohnbaum @ 2002-10-07 23:37 UTC (permalink / raw) To: Erich Focht; +Cc: Martin J. Bligh, Ingo Molnar, linux-kernel On Sat, 2002-10-05 at 15:32, Erich Focht wrote: > Hi Martin & Michael, > > thanks for the mails and the results. You are most welcome. Thanks for the work on your scheduler and numa_test. I want to acknowledge that much of the work that I have done on my scheduler patch has been inspired by work that you have already done, and in a few cases taken directly from it. > It would probably save some time if we reuse parts of the code which > is already there (as with sched_migrate_task). I'd like to mention that > the node affine scheduler is already in production use at several HPC > sites and thus pretty well tested. The performance degradation you've > hit on NUMAQ must be specific to the architecture and doesn't hit Azusa > that much (we just don't see that). I think that could result from > the idle tasks having homenode 0, this should be fixed in sched_init. > I'll post some numbers comparing O(1), pooling scheduler, node affine > scheduler and RSS based affinity in a separate email. That should help > to decide on the direction we should move. One other piece to factor in is the workload characteristics - I'm guessing that Azusa is beng used more for scientific workloads which tend to be a bit more static and consumes large memory bandwidth. > No, we shouldn't aim too hard. The histeresys introduced by the 25% > imbalance necessary to start load balancing (as in Ingo's scheduler) > is enough to protect us from the bouncing. But when running a small > number of tasks you'd like to get the best performance out of the machine. > For me that means the maximum memory bandwidth available for each task, > which you only get if you distribute the tasks equally among the nodes. Depends on the type of job. Some actually benefit from being on the same node as other tasks as locality is more important than bandwidth. I am seeing some of this - when I get better distribution of load across nodes, performance goes down for sdet and kernbench. > For large number of tasks you will always have good balance across the > nodes, the O(1) scheduler already inforces that. Yes, the main shortcoming in this case is that it has nothing to keep a task on the same node. That is one of the two main goals I'm targeting. (The other being to select the best cpu/node to begin with so that there will be no need to move a task off node.) > Best regards, > Erich > I plan to post another version of my patch in the next day or so along with numa_test and kernbench results. I've made a few mods that have significantly helped the distribution across nodes, but need to buy back some performance that is lost by the distribution. -- Michael Hohnbaum 503-578-5486 hohnbaum@us.ibm.com T/L 775-5486 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] Simple NUMA scheduler patch 2002-10-07 23:37 ` Michael Hohnbaum @ 2002-10-14 17:19 ` Erich Focht 0 siblings, 0 replies; 11+ messages in thread From: Erich Focht @ 2002-10-14 17:19 UTC (permalink / raw) To: Michael Hohnbaum; +Cc: Martin J. Bligh, Ingo Molnar, linux-kernel Hi Michael, On Tuesday 08 October 2002 01:37, Michael Hohnbaum wrote: > > I'll post some numbers comparing O(1), pooling scheduler, node affine > > scheduler and RSS based affinity in a separate email. That should help > > to decide on the direction we should move. > > One other piece to factor in is the workload characteristics - I'm > guessing that Azusa is beng used more for scientific workloads which > tend to be a bit more static and consumes large memory bandwidth. Yes, I agree. We're trying to keep HPC tasks on their nodes because we KNOW that they are memory bandwidth and latency hungry. Therefore I believe HPC like jobs are good benchmarks. And easier to set up than database tests (which can also demand very high bandwidths). > > machine. For me that means the maximum memory bandwidth available for > > each task, which you only get if you distribute the tasks equally among > > the nodes. > > Depends on the type of job. Some actually benefit from being on the > same node as other tasks as locality is more important than bandwidth. > I am seeing some of this - when I get better distribution of load across > nodes, performance goes down for sdet and kernbench. This is a difficult issue. Because we're trying to get higher performance out of a multitude of benchmarks (ever tried AIM7? It never execs... so good bye initial balancing). But we also try to find a solution which is good for any kind of NUMA machine. Currently we experiment on the very different architectures: - Azusa : remote/local latency ratio 1.6, but no node level cache - NUMAQ : remote/local latency ratio 20, additional node level cache. My current approach to the tuning parameters for adapting to the machine is over the steal delays. Yours is over the load imbalance needed to trigger a steal from a remote node. Maybe we'll even need more buttons to be able to make these fit to any NUMA machine... Regards, Erich ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2002-10-14 17:14 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <Pine.LNX.4.44.0209050905180.8086-100000@localhost.localdomain> 2002-10-01 23:55 ` [RFC] Simple NUMA scheduler patch Michael Hohnbaum 2002-10-02 13:11 ` Christoph Hellwig 2002-10-02 18:56 ` Matthew Dobson 2002-10-02 22:08 ` Michael Hohnbaum 2002-10-02 17:54 ` Erich Focht 2002-10-02 18:26 ` Michael Hohnbaum 2002-10-02 18:30 ` Michael Hohnbaum 2002-10-02 20:25 ` Martin J. Bligh 2002-10-05 22:32 ` Erich Focht 2002-10-07 23:37 ` Michael Hohnbaum 2002-10-14 17:19 ` Erich Focht
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).