From: Michael Hohnbaum <hohnbaum@us.ibm.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>,
linux-kernel@vger.kernel.org, Erich Focht <efocht@ess.nec.de>
Subject: Re: [RFC] Simple NUMA scheduler patch
Date: 02 Oct 2002 15:08:39 -0700 [thread overview]
Message-ID: <1033596520.24872.48.camel@dyn9-47-17-164.beaverton.ibm.com> (raw)
In-Reply-To: <20021002141121.C2141@infradead.org>
[-- Attachment #1: Type: text/plain, Size: 2282 bytes --]
Christoph,
Thanks for the feedback. I've made the suggested changes and rolled
a new patch. Also, upped the threshold to move a task between nodes
based on results from a test Erich provided. This seems to have helped
both on Erich's test and on kernbench. Latest kernbench numbers:
Averages of 10 runs each:
stock-2.5.40 numa_sched-2.5.40a
real 0m34.181s 0m32.533s
user 5m11.986s 4m50.662s
sys 2m11.267 2m4.677s
I need to test this patch on a non-NUMA box to ensure that the
NUMA changes in find_busies_queue() don't affect SMP. That
testing will take place this week.
Michael
On Wed, 2002-10-02 at 06:11, Christoph Hellwig wrote:
> On Tue, Oct 01, 2002 at 04:55:35PM -0700, Michael Hohnbaum wrote:
> > --- clean-2.5.40/kernel/sched.c Tue Oct 1 13:48:34 2002
> > +++ linux-2.5.40/kernel/sched.c Tue Oct 1 13:27:46 2002
> > @@ -30,6 +30,9 @@
> > #include <linux/notifier.h>
> > #include <linux/delay.h>
> > #include <linux/timer.h>
> > +#if CONFIG_NUMA
> > +#include <asm/topology.h>
> > +#endif
>
> Please make this inlcude unconditional, okay?
>
> > +/*
> > + * find_busiest_queue - find the busiest runqueue.
> > + */
> > +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
> > +{
> > + int nr_running, load, max_load_on_node, max_load_off_node, i;
> > + runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src;
>
> You're new find_busiest_queue is to 80 or 90% the same as the non-NUMA one.
> At least add the #ifdefs only where needed, but as cpu_to_node() optimizes
> away for the non-NUMA case I think you could just make it unconditional.
>
> > + if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) {
>
> I think it should be cpu_to_node, not __cpu_to_node.
>
> > +#if CONFIG_NUMA
> > +/*
> > + * If dest_cpu is allowed for this process, migrate the task to it.
> > + * This is accomplished by forcing the cpu_allowed mask to only
> > + * allow dest_cpu, which will force the cpu onto dest_cpu. Then
> > + * the cpu_allowed mask is restored.
> > + */
> > +void sched_migrate_task(task_t *p, int dest_cpu)
>
> Looks like this one could be static?
>
--
Michael Hohnbaum 503-578-5486
hohnbaum@us.ibm.com T/L 775-5486
[-- Attachment #2: sched40a.patch --]
[-- Type: text/plain, Size: 6960 bytes --]
--- clean-2.5.40/kernel/sched.c Wed Oct 2 10:41:45 2002
+++ linux-2.5.40/kernel/sched.c Wed Oct 2 13:27:42 2002
@@ -30,6 +30,7 @@
#include <linux/notifier.h>
#include <linux/delay.h>
#include <linux/timer.h>
+#include <asm/topology.h>
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -637,15 +638,35 @@ static inline unsigned int double_lock_b
*/
static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
{
- int nr_running, load, max_load, i;
- runqueue_t *busiest, *rq_src;
+ int nr_running, load, max_load_on_node, max_load_off_node, i;
+ runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src;
/*
- * We search all runqueues to find the most busy one.
+ * This routine is designed to work on NUMA systems. For
+ * non-NUMA systems, everything ends up being on the same
+ * node and most of the NUMA specific logic is optimized
+ * away by the compiler.
+ *
+ * We must look at each runqueue to update prev_nr_running.
+ * As we do so, we find the busiest runqueue on the same
+ * node, and the busiest runqueue off the node. After
+ * determining the busiest from each we first see if the
+ * busiest runqueue on node meets the imbalance criteria.
+ * If not, then we check the off queue runqueue. Note that
+ * we require a higher level of imbalance for choosing an
+ * off node runqueue.
+ *
+ * For off-node, we only move at most one process, so imbalance
+ * is always set to one for off-node runqueues.
+ *
* We do this lockless to reduce cache-bouncing overhead,
* we re-check the 'best' source CPU later on again, with
* the lock held.
*
+ * Note that this routine is called with the current runqueue
+ * locked, and if a runqueue is found (return != NULL), then
+ * that runqueue is returned locked, also.
+ *
* We fend off statistical fluctuations in runqueue lengths by
* saving the runqueue length during the previous load-balancing
* operation and using the smaller one the current and saved lengths.
@@ -666,9 +687,15 @@ static inline runqueue_t *find_busiest_q
nr_running = this_rq->nr_running;
else
nr_running = this_rq->prev_nr_running[this_cpu];
+ if (nr_running < 1)
+ nr_running = 1;
+ busiest_on_node = NULL;
+ busiest_off_node = NULL;
busiest = NULL;
- max_load = 1;
+ max_load_on_node = 1;
+ max_load_off_node = 1;
+
for (i = 0; i < NR_CPUS; i++) {
if (!cpu_online(i))
continue;
@@ -678,35 +705,49 @@ static inline runqueue_t *find_busiest_q
load = rq_src->nr_running;
else
load = this_rq->prev_nr_running[i];
+
this_rq->prev_nr_running[i] = rq_src->nr_running;
- if ((load > max_load) && (rq_src != this_rq)) {
- busiest = rq_src;
- max_load = load;
+ if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) {
+ if ((load > max_load_on_node) && (rq_src != this_rq)) {
+ busiest_on_node = rq_src;
+ max_load_on_node = load;
+ }
+ } else {
+ if (load > max_load_off_node) {
+ busiest_off_node = rq_src;
+ max_load_off_node = load;
+ }
}
}
-
- if (likely(!busiest))
- goto out;
-
- *imbalance = (max_load - nr_running) / 2;
-
- /* It needs an at least ~25% imbalance to trigger balancing. */
- if (!idle && (*imbalance < (max_load + 3)/4)) {
- busiest = NULL;
- goto out;
+ if (busiest_on_node) {
+ /* on node balancing happens if there is > 125% difference */
+ if (idle || ((nr_running*5)/4 < max_load_on_node)) {
+ busiest = busiest_on_node;
+ *imbalance = (max_load_on_node - nr_running) / 2;
+ }
}
-
- nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
- /*
- * Make sure nothing changed since we checked the
- * runqueue length.
- */
- if (busiest->nr_running <= nr_running + 1) {
- spin_unlock(&busiest->lock);
- busiest = NULL;
+ if (busiest_off_node && !busiest) {
+ /* off node balancing requires 200% difference */
+ if (nr_running*4 >= max_load_off_node)
+ return NULL;
+ busiest = busiest_off_node;
+ *imbalance = 1;
+ }
+ if (busiest) {
+ if (busiest == this_rq) {
+ return NULL;
+ }
+ nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+ /*
+ * Make sure nothing changed since we checked the
+ * runqueue length.
+ */
+ if (busiest->nr_running <= nr_running) {
+ spin_unlock(&busiest->lock);
+ busiest = NULL;
+ }
}
-out:
return busiest;
}
@@ -2104,6 +2145,65 @@ __init int migration_init(void)
spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
#endif
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu. Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask;
+
+ old_mask = p->cpus_allowed;
+ if (!(old_mask & (1UL << dest_cpu)))
+ return;
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU. If the current runqueue is short enough
+ * just use it.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, best_cpu;
+ best_cpu = task_cpu(p);
+
+ minload = cpu_rq(best_cpu)->nr_running;
+ /* if the current runqueue is only 2 long, don't look elsewhere */
+ if (minload <= 2)
+ return best_cpu;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_online(i))
+ continue;
+
+ if (minload > cpu_rq(i)->nr_running) {
+ minload = cpu_rq(i)->nr_running;
+ best_cpu = i;
+ }
+ }
+ return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+ int new_cpu;
+
+ if (numnodes > 1) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+ }
+}
+#endif /* CONFIG_NUMA */
+
extern void init_timers(void);
void __init sched_init(void)
--- clean-2.5.40/fs/exec.c Wed Oct 2 10:42:35 2002
+++ linux-2.5.40/fs/exec.c Tue Oct 1 12:55:50 2002
@@ -1001,6 +1001,8 @@ int do_execve(char * filename, char ** a
int retval;
int i;
+ sched_balance_exec();
+
file = open_exec(filename);
retval = PTR_ERR(file);
--- clean-2.5.40/include/linux/sched.h Wed Oct 2 10:42:22 2002
+++ linux-2.5.40/include/linux/sched.h Tue Oct 1 12:55:51 2002
@@ -166,6 +166,11 @@ extern void update_one_process(struct ta
unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+#else
+#define sched_balance_exec() {}
+#endif
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
next prev parent reply other threads:[~2002-10-02 22:07 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <Pine.LNX.4.44.0209050905180.8086-100000@localhost.localdomain>
2002-10-01 23:55 ` [RFC] Simple NUMA scheduler patch Michael Hohnbaum
2002-10-02 13:11 ` Christoph Hellwig
2002-10-02 18:56 ` Matthew Dobson
2002-10-02 22:08 ` Michael Hohnbaum [this message]
2002-10-02 17:54 ` Erich Focht
2002-10-02 18:26 ` Michael Hohnbaum
2002-10-02 18:30 ` Michael Hohnbaum
2002-10-02 20:25 ` Martin J. Bligh
2002-10-05 22:32 ` Erich Focht
2002-10-07 23:37 ` Michael Hohnbaum
2002-10-14 17:19 ` Erich Focht
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1033596520.24872.48.camel@dyn9-47-17-164.beaverton.ibm.com \
--to=hohnbaum@us.ibm.com \
--cc=efocht@ess.nec.de \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).