linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/13] Multiprocessor CPU scheduler patches
@ 2005-02-24  7:14 Nick Piggin
  2005-02-24  7:16 ` [PATCH 1/13] timestamp fixes Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Hi,

I hope that you can include the following set of CPU scheduler
patches in -mm soon, if you have no other significant performance
work going on.

There are some fairly significant changes, with a few basic aims:
* Improve SMT behaviour
* Improve CMP behaviour, CMP/NUMA scheduling (ie. Opteron)
* Reduce task movement, esp over NUMA nodes.

They are not going to be very well tuned for most usages at the
moment (unfortunately dbt2/3-pgsql on OSDL isn't working, which
is a good one). So hopefully I can address regressions as they
come up.

There are a few problems with the scheduler currently:

Problem #1:
It has _very_ aggressive idle CPU pulling. Not only does it not
really obey imbalances, it is also wrong for eg. an SMT CPU
who's sibling is not idle. The reason this was done really is to
bring down idle time on some workloads (dbt2-pgsql, other
database stuff).

So I address this in the following ways; reduce special casing
for idle balancing, revert some of the recent moves toward even
more aggressive balancing.

Then provide a range of averaging levels for CPU "load averages",
and we choose which to use in which situation on a sched-domain
basis. This allows idle balancing to use a more instantaneous value
for calculating load, so idle CPUs need not wait many timer ticks
for the load averages to catch up. This can hopefully solve our
idle time problems.

Also, further moderate "affine wakeups", which can tend to move
most tasks to one CPU on some workloads and cause idle problems.

Problem #2:
The second problem is that balance-on-exec is not sched-domains
aware. This means it will tend to (for example) fill up two cores
of a CPU on one socket, then fill up two cores on the next socket,
etc. What we want is to try to spread load evenly across memory
controllers.

So make that sched-domains aware following the same pattern as
find_busiest_group / find_busiest_queue.

Problem #3:
Lastly, implement balance-on-fork/clone again. I have come to the
realisation that for NUMA, this is probably the best solution.
Run-cloned-child-last has run out of steam on CMP systems. What
it was supposed to do was provide a period where the child could
be pulled to another CPU before it starts running and allocating
memory. Unfortunately on CMP systems, this tends to just be to the
other sibling.

Also, having such a difference between thread and process creation
was not really ideal, so we balance on all types of fork/clone.
This really helps some things (like STREAM) on CMP Opterons, but
also hurts others, so naturally it is settable per-domain.

Problem #4:
Sched domains isn't very useful to me in its current form. Bring
it up to date with what I've been using. I don't think anyone other
than myself uses it so that should be OK.

Nick




^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/13] timestamp fixes
  2005-02-24  7:14 [PATCH 0/13] Multiprocessor CPU scheduler patches Nick Piggin
@ 2005-02-24  7:16 ` Nick Piggin
  2005-02-24  7:16   ` [PATCH 2/13] improve pinned task handling Nick Piggin
  2005-02-24  7:46   ` [PATCH 1/13] timestamp fixes Ingo Molnar
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]

1/13


[-- Attachment #2: sched-timestamp-fixes.patch --]
[-- Type: text/x-patch, Size: 1377 bytes --]

Some fixes for unsynchronised TSCs. A task's timestamp may have been set
by another CPU. Although we try to adjust this correctly with the
timestamp_last_tick field, there is no guarantee this will be exactly right.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:31:25.384986289 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:39.356379395 +1100
@@ -648,6 +648,7 @@
 
 static void recalc_task_prio(task_t *p, unsigned long long now)
 {
+	/* Caller must always ensure 'now >= p->timestamp' */
 	unsigned long long __sleep_time = now - p->timestamp;
 	unsigned long sleep_time;
 
@@ -2703,8 +2704,10 @@
 
 	schedstat_inc(rq, sched_cnt);
 	now = sched_clock();
-	if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
+	if (likely((long long)now - prev->timestamp < NS_MAX_SLEEP_AVG))
 		run_time = now - prev->timestamp;
+		if (unlikely((long long)now - prev->timestamp < 0))
+			run_time = 0;
 	else
 		run_time = NS_MAX_SLEEP_AVG;
 
@@ -2782,6 +2785,8 @@
 
 	if (!rt_task(next) && next->activated > 0) {
 		unsigned long long delta = now - next->timestamp;
+		if (unlikely((long long)now - next->timestamp < 0))
+			delta = 0;
 
 		if (next->activated == 1)
 			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 2/13] improve pinned task handling
  2005-02-24  7:16 ` [PATCH 1/13] timestamp fixes Nick Piggin
@ 2005-02-24  7:16   ` Nick Piggin
  2005-02-24  7:18     ` [PATCH 3/13] rework schedstats Nick Piggin
  2005-02-24  8:04     ` [PATCH 2/13] improve pinned task handling Ingo Molnar
  2005-02-24  7:46   ` [PATCH 1/13] timestamp fixes Ingo Molnar
  1 sibling, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]

2/13


[-- Attachment #2: sched-lb-pinned.patch --]
[-- Type: text/x-patch, Size: 4342 bytes --]

John Hawkes explained the problem best:
	A large number of processes that are pinned to a single CPU results
	in every other CPU's load_balance() seeing this overloaded CPU as
	"busiest", yet move_tasks() never finds a task to pull-migrate.  This
	condition occurs during module unload, but can also occur as a
	denial-of-service using sys_sched_setaffinity().  Several hundred
	CPUs performing this fruitless load_balance() will livelock on the
	busiest CPU's runqueue lock.  A smaller number of CPUs will livelock
	if the pinned task count gets high.
	
Expanding slightly on John's patch, this one attempts to work out
whether the balancing failure has been due to too many tasks pinned
on the runqueue. This allows it to be basically invisible to the
regular blancing paths (ie. when there are no pinned tasks). We can
use this extra knowledge to shut down the balancing faster, and ensure
the migration threads don't start running which is another problem
observed in the wild.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:31:27.042781371 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:39.180401105 +1100
@@ -1650,7 +1650,7 @@
  */
 static inline
 int can_migrate_task(task_t *p, runqueue_t *rq, int this_cpu,
-		     struct sched_domain *sd, enum idle_type idle)
+		     struct sched_domain *sd, enum idle_type idle, int *pinned)
 {
 	/*
 	 * We do not migrate tasks that are:
@@ -1660,8 +1660,10 @@
 	 */
 	if (task_running(rq, p))
 		return 0;
-	if (!cpu_isset(this_cpu, p->cpus_allowed))
+	if (!cpu_isset(this_cpu, p->cpus_allowed)) {
+		*pinned++;
 		return 0;
+	}
 
 	/*
 	 * Aggressive migration if:
@@ -1687,11 +1689,11 @@
  */
 static int move_tasks(runqueue_t *this_rq, int this_cpu, runqueue_t *busiest,
 		      unsigned long max_nr_move, struct sched_domain *sd,
-		      enum idle_type idle)
+		      enum idle_type idle, int *all_pinned)
 {
 	prio_array_t *array, *dst_array;
 	struct list_head *head, *curr;
-	int idx, pulled = 0;
+	int idx, pulled = 0, pinned = 0;
 	task_t *tmp;
 
 	if (max_nr_move <= 0 || busiest->nr_running <= 1)
@@ -1735,7 +1737,7 @@
 
 	curr = curr->prev;
 
-	if (!can_migrate_task(tmp, busiest, this_cpu, sd, idle)) {
+	if (!can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
 		if (curr != head)
 			goto skip_queue;
 		idx++;
@@ -1761,6 +1763,9 @@
 		goto skip_bitmap;
 	}
 out:
+	*all_pinned = 0;
+	if (unlikely(pinned >= max_nr_move) && pulled == 0)
+		*all_pinned = 1;
 	return pulled;
 }
 
@@ -1935,7 +1940,7 @@
 	struct sched_group *group;
 	runqueue_t *busiest;
 	unsigned long imbalance;
-	int nr_moved;
+	int nr_moved, all_pinned;
 
 	spin_lock(&this_rq->lock);
 	schedstat_inc(sd, lb_cnt[idle]);
@@ -1974,9 +1979,14 @@
 		 */
 		double_lock_balance(this_rq, busiest);
 		nr_moved = move_tasks(this_rq, this_cpu, busiest,
-						imbalance, sd, idle);
+						imbalance, sd, idle,
+						&all_pinned);
 		spin_unlock(&busiest->lock);
 	}
+	/* All tasks on this runqueue were pinned by CPU affinity */
+	if (unlikely(all_pinned))
+		goto out_balanced;
+
 	spin_unlock(&this_rq->lock);
 
 	if (!nr_moved) {
@@ -2041,7 +2051,7 @@
 	struct sched_group *group;
 	runqueue_t *busiest = NULL;
 	unsigned long imbalance;
-	int nr_moved = 0;
+	int nr_moved = 0, all_pinned;
 
 	schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
 	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
@@ -2061,7 +2071,7 @@
 
 	schedstat_add(sd, lb_imbalance[NEWLY_IDLE], imbalance);
 	nr_moved = move_tasks(this_rq, this_cpu, busiest,
-					imbalance, sd, NEWLY_IDLE);
+			imbalance, sd, NEWLY_IDLE, &all_pinned);
 	if (!nr_moved)
 		schedstat_inc(sd, lb_failed[NEWLY_IDLE]);
 
@@ -2119,6 +2129,7 @@
 		cpu_group = sd->groups;
 		do {
 			for_each_cpu_mask(cpu, cpu_group->cpumask) {
+				int all_pinned;
 				if (busiest_rq->nr_running <= 1)
 					/* no more tasks left to move */
 					return;
@@ -2139,7 +2150,7 @@
 				/* move a task from busiest_rq to target_rq */
 				double_lock_balance(busiest_rq, target_rq);
 				if (move_tasks(target_rq, cpu, busiest_rq,
-						1, sd, SCHED_IDLE)) {
+					1, sd, SCHED_IDLE, &all_pinned)) {
 					schedstat_inc(busiest_rq, alb_lost);
 					schedstat_inc(target_rq, alb_gained);
 				} else {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/13] rework schedstats
  2005-02-24  7:16   ` [PATCH 2/13] improve pinned task handling Nick Piggin
@ 2005-02-24  7:18     ` Nick Piggin
  2005-02-24  7:19       ` [PATCH 4/13] find_busiest_group fixlets Nick Piggin
                         ` (2 more replies)
  2005-02-24  8:04     ` [PATCH 2/13] improve pinned task handling Ingo Molnar
  1 sibling, 3 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 103 bytes --]

3/13

I have an updated userspace parser for this thing, if you
are still keeping it on your website.


[-- Attachment #2: sched-stat.patch --]
[-- Type: text/x-patch, Size: 9612 bytes --]

Move balancing fields into struct sched_domain, so we can get more
useful results on systems with multiple domains (eg SMT+SMP, CMP+NUMA,
SMP+NUMA, etc).

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2005-02-24 17:31:24.598083557 +1100
+++ linux-2.6/include/linux/sched.h	2005-02-24 17:43:38.161526805 +1100
@@ -462,17 +462,26 @@
 	/* load_balance() stats */
 	unsigned long lb_cnt[MAX_IDLE_TYPES];
 	unsigned long lb_failed[MAX_IDLE_TYPES];
+	unsigned long lb_balanced[MAX_IDLE_TYPES];
 	unsigned long lb_imbalance[MAX_IDLE_TYPES];
+	unsigned long lb_gained[MAX_IDLE_TYPES];
+	unsigned long lb_hot_gained[MAX_IDLE_TYPES];
 	unsigned long lb_nobusyg[MAX_IDLE_TYPES];
 	unsigned long lb_nobusyq[MAX_IDLE_TYPES];
 
+	/* Active load balancing */
+	unsigned long alb_cnt;
+	unsigned long alb_failed;
+	unsigned long alb_pushed;
+
 	/* sched_balance_exec() stats */
 	unsigned long sbe_attempts;
 	unsigned long sbe_pushed;
 
 	/* try_to_wake_up() stats */
-	unsigned long ttwu_wake_affine;
-	unsigned long ttwu_wake_balance;
+	unsigned long ttwu_wake_remote;
+	unsigned long ttwu_move_affine;
+	unsigned long ttwu_move_balance;
 #endif
 };
 
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:31:27.503724395 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:38.983425407 +1100
@@ -246,35 +246,13 @@
 	unsigned long yld_cnt;
 
 	/* schedule() stats */
-	unsigned long sched_noswitch;
 	unsigned long sched_switch;
 	unsigned long sched_cnt;
 	unsigned long sched_goidle;
 
-	/* pull_task() stats */
-	unsigned long pt_gained[MAX_IDLE_TYPES];
-	unsigned long pt_lost[MAX_IDLE_TYPES];
-
-	/* active_load_balance() stats */
-	unsigned long alb_cnt;
-	unsigned long alb_lost;
-	unsigned long alb_gained;
-	unsigned long alb_failed;
-
 	/* try_to_wake_up() stats */
 	unsigned long ttwu_cnt;
-	unsigned long ttwu_attempts;
-	unsigned long ttwu_moved;
-
-	/* wake_up_new_task() stats */
-	unsigned long wunt_cnt;
-	unsigned long wunt_moved;
-
-	/* sched_migrate_task() stats */
-	unsigned long smt_cnt;
-
-	/* sched_balance_exec() stats */
-	unsigned long sbe_cnt;
+	unsigned long ttwu_local;
 #endif
 };
 
@@ -329,7 +307,7 @@
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 10
+#define SCHEDSTAT_VERSION 11
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -347,22 +325,14 @@
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu "
-		    "%lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
+		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
 		    cpu, rq->yld_both_empty,
-		    rq->yld_act_empty, rq->yld_exp_empty,
-		    rq->yld_cnt, rq->sched_noswitch,
+		    rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt,
 		    rq->sched_switch, rq->sched_cnt, rq->sched_goidle,
-		    rq->alb_cnt, rq->alb_gained, rq->alb_lost,
-		    rq->alb_failed,
-		    rq->ttwu_cnt, rq->ttwu_moved, rq->ttwu_attempts,
-		    rq->wunt_cnt, rq->wunt_moved,
-		    rq->smt_cnt, rq->sbe_cnt, rq->rq_sched_info.cpu_time,
+		    rq->ttwu_cnt, rq->ttwu_local,
+		    rq->rq_sched_info.cpu_time,
 		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt);
 
-		for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES; itype++)
-			seq_printf(seq, " %lu %lu", rq->pt_gained[itype],
-						    rq->pt_lost[itype]);
 		seq_printf(seq, "\n");
 
 #ifdef CONFIG_SMP
@@ -373,17 +343,21 @@
 			cpumask_scnprintf(mask_str, NR_CPUS, sd->span);
 			seq_printf(seq, "domain%d %s", dcnt++, mask_str);
 			for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES;
-						itype++) {
-				seq_printf(seq, " %lu %lu %lu %lu %lu",
+					itype++) {
+				seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu",
 				    sd->lb_cnt[itype],
+				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
 				    sd->lb_imbalance[itype],
+				    sd->lb_gained[itype],
+				    sd->lb_hot_gained[itype],
 				    sd->lb_nobusyq[itype],
 				    sd->lb_nobusyg[itype]);
 			}
-			seq_printf(seq, " %lu %lu %lu %lu\n",
+			seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu\n",
+			    sd->alb_cnt, sd->alb_failed, sd->alb_pushed,
 			    sd->sbe_pushed, sd->sbe_attempts,
-			    sd->ttwu_wake_affine, sd->ttwu_wake_balance);
+			    sd->ttwu_wake_remote, sd->ttwu_move_affine, sd->ttwu_move_balance);
 		}
 #endif
 	}
@@ -996,7 +970,6 @@
 #endif
 
 	rq = task_rq_lock(p, &flags);
-	schedstat_inc(rq, ttwu_cnt);
 	old_state = p->state;
 	if (!(old_state & state))
 		goto out;
@@ -1011,8 +984,21 @@
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-	new_cpu = cpu;
+#ifdef CONFIG_SCHEDSTATS
+	schedstat_inc(rq, ttwu_cnt);
+	if (cpu == this_cpu) {
+		schedstat_inc(rq, ttwu_local);
+	} else {
+		for_each_domain(this_cpu, sd) {
+			if (cpu_isset(cpu, sd->span)) {
+				schedstat_inc(sd, ttwu_wake_remote);
+				break;
+			}
+		}
+	}
+#endif
 
+	new_cpu = cpu;
 	if (cpu == this_cpu || unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
 		goto out_set_cpu;
 
@@ -1051,7 +1037,7 @@
 			 * in this domain.
 			 */
 			if (cpu_isset(cpu, sd->span)) {
-				schedstat_inc(sd, ttwu_wake_affine);
+				schedstat_inc(sd, ttwu_move_affine);
 				goto out_set_cpu;
 			}
 		} else if ((sd->flags & SD_WAKE_BALANCE) &&
@@ -1061,7 +1047,7 @@
 			 * an imbalance.
 			 */
 			if (cpu_isset(cpu, sd->span)) {
-				schedstat_inc(sd, ttwu_wake_balance);
+				schedstat_inc(sd, ttwu_move_balance);
 				goto out_set_cpu;
 			}
 		}
@@ -1069,10 +1055,8 @@
 
 	new_cpu = cpu; /* Could not wake to this_cpu. Wake to cpu instead */
 out_set_cpu:
-	schedstat_inc(rq, ttwu_attempts);
 	new_cpu = wake_idle(new_cpu, p);
 	if (new_cpu != cpu) {
-		schedstat_inc(rq, ttwu_moved);
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
 		/* might preempt at this point */
@@ -1215,7 +1199,6 @@
 
 	BUG_ON(p->state != TASK_RUNNING);
 
-	schedstat_inc(rq, wunt_cnt);
 	/*
 	 * We decrease the sleep average of forking parents
 	 * and children as well, to keep max-interactive tasks
@@ -1267,7 +1250,6 @@
 		if (TASK_PREEMPTS_CURR(p, rq))
 			resched_task(rq->curr);
 
-		schedstat_inc(rq, wunt_moved);
 		/*
 		 * Parent and child are on different CPUs, now get the
 		 * parent runqueue to update the parent's ->sleep_avg:
@@ -1571,7 +1553,6 @@
 	    || unlikely(cpu_is_offline(dest_cpu)))
 		goto out;
 
-	schedstat_inc(rq, smt_cnt);
 	/* force the process onto the specified CPU */
 	if (migrate_task(p, dest_cpu, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */
@@ -1599,7 +1580,6 @@
 	struct sched_domain *tmp, *sd = NULL;
 	int new_cpu, this_cpu = get_cpu();
 
-	schedstat_inc(this_rq(), sbe_cnt);
 	/* Prefer the current CPU if there's only this task running */
 	if (this_rq()->nr_running <= 1)
 		goto out;
@@ -1744,13 +1724,10 @@
 		goto skip_bitmap;
 	}
 
-	/*
-	 * Right now, this is the only place pull_task() is called,
-	 * so we can safely collect pull_task() stats here rather than
-	 * inside pull_task().
-	 */
-	schedstat_inc(this_rq, pt_gained[idle]);
-	schedstat_inc(busiest, pt_lost[idle]);
+#ifdef CONFIG_SCHEDSTATS
+	if (task_hot(tmp, busiest->timestamp_last_tick, sd))
+		schedstat_inc(sd, lb_hot_gained[idle]);
+#endif
 
 	pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
 	pulled++;
@@ -1766,6 +1743,14 @@
 	*all_pinned = 0;
 	if (unlikely(pinned >= max_nr_move) && pulled == 0)
 		*all_pinned = 1;
+
+	/*
+	 * Right now, this is the only place pull_task() is called,
+	 * so we can safely collect pull_task() stats here rather than
+	 * inside pull_task().
+	 */
+	schedstat_add(sd, lb_gained[idle], pulled);
+
 	return pulled;
 }
 
@@ -2031,6 +2016,8 @@
 out_balanced:
 	spin_unlock(&this_rq->lock);
 
+	schedstat_inc(sd, lb_balanced[idle]);
+
 	/* tune up the balancing interval */
 	if (sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
@@ -2056,12 +2043,14 @@
 	schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
 	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
 	if (!group) {
+		schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
 		schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
 		goto out;
 	}
 
 	busiest = find_busiest_queue(group);
 	if (!busiest || busiest == this_rq) {
+		schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
 		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
 		goto out;
 	}
@@ -2115,7 +2104,6 @@
 	cpumask_t visited_cpus;
 	int cpu;
 
-	schedstat_inc(busiest_rq, alb_cnt);
 	/*
 	 * Search for suitable CPUs to push tasks to in successively higher
 	 * domains with SD_LOAD_BALANCE set.
@@ -2126,6 +2114,8 @@
 			/* no more domains to search */
 			break;
 
+		schedstat_inc(sd, alb_cnt);
+
 		cpu_group = sd->groups;
 		do {
 			for_each_cpu_mask(cpu, cpu_group->cpumask) {
@@ -2151,10 +2141,9 @@
 				double_lock_balance(busiest_rq, target_rq);
 				if (move_tasks(target_rq, cpu, busiest_rq,
 					1, sd, SCHED_IDLE, &all_pinned)) {
-					schedstat_inc(busiest_rq, alb_lost);
-					schedstat_inc(target_rq, alb_gained);
+					schedstat_inc(sd, alb_pushed);
 				} else {
-					schedstat_inc(busiest_rq, alb_failed);
+					schedstat_inc(sd, alb_failed);
 				}
 				spin_unlock(&target_rq->lock);
 			}
@@ -2787,8 +2776,7 @@
 		array = rq->active;
 		rq->expired_timestamp = 0;
 		rq->best_expired_prio = MAX_PRIO;
-	} else
-		schedstat_inc(rq, sched_noswitch);
+	}
 
 	idx = sched_find_first_bit(array->bitmap);
 	queue = array->queue + idx;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/13] find_busiest_group fixlets
  2005-02-24  7:18     ` [PATCH 3/13] rework schedstats Nick Piggin
@ 2005-02-24  7:19       ` Nick Piggin
  2005-02-24  7:20         ` [PATCH 5/13] find_busiest_group cleanup Nick Piggin
  2005-02-24  8:36         ` [PATCH 4/13] find_busiest_group fixlets Ingo Molnar
  2005-02-24  8:07       ` [PATCH 3/13] rework schedstats Ingo Molnar
  2005-02-25 10:50       ` Rick Lindsley
  2 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]

4/13


[-- Attachment #2: sched-balance-fix.patch --]
[-- Type: text/x-patch, Size: 1964 bytes --]

Fix up a few small warts in the periodic multiprocessor rebalancing
code.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:31:28.431609701 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:38.806447240 +1100
@@ -1830,13 +1830,12 @@
 	 * by pulling tasks to us.  Be careful of negative numbers as they'll
 	 * appear as very large values with unsigned longs.
 	 */
-	*imbalance = min(max_load - avg_load, avg_load - this_load);
-
 	/* How much load to actually move to equalise the imbalance */
-	*imbalance = (*imbalance * min(busiest->cpu_power, this->cpu_power))
-				/ SCHED_LOAD_SCALE;
+	*imbalance = min((max_load - avg_load) * busiest->cpu_power,
+				(avg_load - this_load) * this->cpu_power)
+			/ SCHED_LOAD_SCALE;
 
-	if (*imbalance < SCHED_LOAD_SCALE - 1) {
+	if (*imbalance < SCHED_LOAD_SCALE) {
 		unsigned long pwr_now = 0, pwr_move = 0;
 		unsigned long tmp;
 
@@ -1862,14 +1861,16 @@
 							max_load - tmp);
 
 		/* Amount of load we'd add */
-		tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/this->cpu_power;
-		if (max_load < tmp)
-			tmp = max_load;
+		if (max_load*busiest->cpu_power <
+				SCHED_LOAD_SCALE*SCHED_LOAD_SCALE)
+			tmp = max_load*busiest->cpu_power/this->cpu_power;
+		else
+			tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/this->cpu_power;
 		pwr_move += this->cpu_power*min(SCHED_LOAD_SCALE, this_load + tmp);
 		pwr_move /= SCHED_LOAD_SCALE;
 
-		/* Move if we gain another 8th of a CPU worth of throughput */
-		if (pwr_move < pwr_now + SCHED_LOAD_SCALE / 8)
+		/* Move if we gain throughput */
+		if (pwr_move <= pwr_now)
 			goto out_balanced;
 
 		*imbalance = 1;
@@ -1877,7 +1878,7 @@
 	}
 
 	/* Get rid of the scaling factor, rounding down as we divide */
-	*imbalance = (*imbalance + 1) / SCHED_LOAD_SCALE;
+	*imbalance = *imbalance / SCHED_LOAD_SCALE;
 
 	return busiest;
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 5/13] find_busiest_group cleanup
  2005-02-24  7:19       ` [PATCH 4/13] find_busiest_group fixlets Nick Piggin
@ 2005-02-24  7:20         ` Nick Piggin
  2005-02-24  7:21           ` [PATCH 6/13] no aggressive idle balancing Nick Piggin
  2005-02-24  8:36         ` [PATCH 4/13] find_busiest_group fixlets Ingo Molnar
  1 sibling, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5 bytes --]

5/13

[-- Attachment #2: sched-cleanup-fbg.patch --]
[-- Type: text/x-patch, Size: 761 bytes --]

Cleanup find_busiest_group a bit. New sched-domains code
means we can't have groups without a CPU.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:31:29.298502546 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:38.629469074 +1100
@@ -1771,7 +1771,7 @@
 	do {
 		unsigned long load;
 		int local_group;
-		int i, nr_cpus = 0;
+		int i;
 
 		local_group = cpu_isset(this_cpu, group->cpumask);
 
@@ -1785,13 +1785,9 @@
 			else
 				load = source_load(i);
 
-			nr_cpus++;
 			avg_load += load;
 		}
 
-		if (!nr_cpus)
-			goto nextgroup;
-
 		total_load += avg_load;
 		total_pwr += group->cpu_power;
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 6/13] no aggressive idle balancing
  2005-02-24  7:20         ` [PATCH 5/13] find_busiest_group cleanup Nick Piggin
@ 2005-02-24  7:21           ` Nick Piggin
  2005-02-24  7:22             ` [PATCH 7/13] better active balancing heuristic Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7 bytes --]


6/13


[-- Attachment #2: sched-lessaggressive-idle.patch --]
[-- Type: text/x-patch, Size: 860 bytes --]

Remove the special casing for idle CPU balancing. Things like this are
hurting for example on SMT, where are single sibling being idle doesn't
really warrant a really aggressive pull over the NUMA domain, for example.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:31:43.537742489 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:38.340504724 +1100
@@ -1875,15 +1875,9 @@
 
 	/* Get rid of the scaling factor, rounding down as we divide */
 	*imbalance = *imbalance / SCHED_LOAD_SCALE;
-
 	return busiest;
 
 out_balanced:
-	if (busiest && (idle == NEWLY_IDLE ||
-			(idle == SCHED_IDLE && max_load > SCHED_LOAD_SCALE)) ) {
-		*imbalance = 1;
-		return busiest;
-	}
 
 	*imbalance = 0;
 	return NULL;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 7/13] better active balancing heuristic
  2005-02-24  7:21           ` [PATCH 6/13] no aggressive idle balancing Nick Piggin
@ 2005-02-24  7:22             ` Nick Piggin
  2005-02-24  7:24               ` [PATCH 8/13] generalised CPU load averaging Nick Piggin
  2005-02-24  8:39               ` [PATCH 7/13] better active balancing heuristic Ingo Molnar
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]

7/13


[-- Attachment #2: sched-less-alb.patch --]
[-- Type: text/x-patch, Size: 1630 bytes --]

Fix up active load balancing a bit so it doesn't get called when it shouldn't.
Reset the nr_balance_failed counter at more points where we have  found
conditions to be balanced. This reduces too aggressive active balancing seen
on some workloads.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:39:05.851128944 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:38.162526682 +1100
@@ -2009,6 +2009,7 @@
 
 	schedstat_inc(sd, lb_balanced[idle]);
 
+	sd->nr_balance_failed = 0;
 	/* tune up the balancing interval */
 	if (sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
@@ -2034,16 +2035,14 @@
 	schedstat_inc(sd, lb_cnt[NEWLY_IDLE]);
 	group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE);
 	if (!group) {
-		schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
 		schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]);
-		goto out;
+		goto out_balanced;
 	}
 
 	busiest = find_busiest_queue(group);
 	if (!busiest || busiest == this_rq) {
-		schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
 		schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]);
-		goto out;
+		goto out_balanced;
 	}
 
 	/* Attempt to move tasks */
@@ -2054,11 +2053,16 @@
 			imbalance, sd, NEWLY_IDLE, &all_pinned);
 	if (!nr_moved)
 		schedstat_inc(sd, lb_failed[NEWLY_IDLE]);
+	else
+                sd->nr_balance_failed = 0;
 
 	spin_unlock(&busiest->lock);
-
-out:
 	return nr_moved;
+
+out_balanced:
+	schedstat_inc(sd, lb_balanced[NEWLY_IDLE]);
+	sd->nr_balance_failed = 0;
+	return 0;
 }
 
 /*

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 8/13] generalised CPU load averaging
  2005-02-24  7:22             ` [PATCH 7/13] better active balancing heuristic Nick Piggin
@ 2005-02-24  7:24               ` Nick Piggin
  2005-02-24  7:25                 ` [PATCH 9/13] less affine wakups Nick Piggin
  2005-02-24  8:39               ` [PATCH 7/13] better active balancing heuristic Ingo Molnar
  1 sibling, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]

8/13


[-- Attachment #2: sched-balance-timers.patch --]
[-- Type: text/x-patch, Size: 10324 bytes --]

Do CPU load averaging over a number of different intervals. Allow
each interval to be chosen by sending a parameter to source_load
and target_load. 0 is instantaneous, idx > 0 returns a decaying average
with the most recent sample weighted at 2^(idx-1). To a maximum of 3
(could be easily increased).

So generally a higher number will result in more conservative balancing.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

Index: linux-2.6/include/asm-i386/topology.h
===================================================================
--- linux-2.6.orig/include/asm-i386/topology.h	2005-02-24 17:31:22.664322588 +1100
+++ linux-2.6/include/asm-i386/topology.h	2005-02-24 17:43:37.733579601 +1100
@@ -77,6 +77,10 @@
 	.imbalance_pct		= 125,			\
 	.cache_hot_time		= (10*1000000),		\
 	.cache_nice_tries	= 1,			\
+	.busy_idx		= 3,			\
+	.idle_idx		= 1,			\
+	.newidle_idx		= 2,			\
+	.wake_idx		= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
Index: linux-2.6/include/asm-x86_64/topology.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/topology.h	2005-02-24 17:31:22.664322588 +1100
+++ linux-2.6/include/asm-x86_64/topology.h	2005-02-24 17:43:37.733579601 +1100
@@ -49,7 +49,11 @@
 	.busy_factor		= 32,			\
 	.imbalance_pct		= 125,			\
 	.cache_hot_time		= (10*1000000),		\
-	.cache_nice_tries	= 1,			\
+	.cache_nice_tries	= 2,			\
+	.busy_idx		= 3,			\
+	.idle_idx		= 2,			\
+	.newidle_idx		= 1, 			\
+	.wake_idx		= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2005-02-24 17:31:28.428610071 +1100
+++ linux-2.6/include/linux/sched.h	2005-02-24 17:43:37.503607973 +1100
@@ -451,6 +451,10 @@
 	unsigned long long cache_hot_time; /* Task considered cache hot (ns) */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
 	unsigned int per_cpu_gain;	/* CPU % gained by adding domain cpus */
+	unsigned int busy_idx;
+	unsigned int idle_idx;
+	unsigned int newidle_idx;
+	unsigned int wake_idx;
 	int flags;			/* See SD_* */
 
 	/* Runtime fields. */
Index: linux-2.6/include/linux/topology.h
===================================================================
--- linux-2.6.orig/include/linux/topology.h	2005-02-24 17:31:22.665322464 +1100
+++ linux-2.6/include/linux/topology.h	2005-02-24 17:43:37.733579601 +1100
@@ -86,6 +86,10 @@
 	.cache_hot_time		= 0,			\
 	.cache_nice_tries	= 0,			\
 	.per_cpu_gain		= 25,			\
+	.busy_idx		= 0,			\
+	.idle_idx		= 0,			\
+	.newidle_idx		= 0,			\
+	.wake_idx		= 0,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
@@ -112,6 +116,10 @@
 	.cache_hot_time		= (5*1000000/2),	\
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
+	.busy_idx		= 2,			\
+	.idle_idx		= 0,			\
+	.newidle_idx		= 1,			\
+	.wake_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:39:06.530045151 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:37.913557397 +1100
@@ -204,7 +204,7 @@
 	 */
 	unsigned long nr_running;
 #ifdef CONFIG_SMP
-	unsigned long cpu_load;
+	unsigned long cpu_load[3];
 #endif
 	unsigned long long nr_switches;
 
@@ -884,23 +884,27 @@
  * We want to under-estimate the load of migration sources, to
  * balance conservatively.
  */
-static inline unsigned long source_load(int cpu)
+static inline unsigned long source_load(int cpu, int type)
 {
 	runqueue_t *rq = cpu_rq(cpu);
 	unsigned long load_now = rq->nr_running * SCHED_LOAD_SCALE;
+	if (type == 0)
+		return load_now;
 
-	return min(rq->cpu_load, load_now);
+	return min(rq->cpu_load[type-1], load_now);
 }
 
 /*
  * Return a high guess at the load of a migration-target cpu
  */
-static inline unsigned long target_load(int cpu)
+static inline unsigned long target_load(int cpu, int type)
 {
 	runqueue_t *rq = cpu_rq(cpu);
 	unsigned long load_now = rq->nr_running * SCHED_LOAD_SCALE;
+	if (type == 0)
+		return load_now;
 
-	return max(rq->cpu_load, load_now);
+	return max(rq->cpu_load[type-1], load_now);
 }
 
 #endif
@@ -965,7 +969,7 @@
 	runqueue_t *rq;
 #ifdef CONFIG_SMP
 	unsigned long load, this_load;
-	struct sched_domain *sd;
+	struct sched_domain *sd, *this_sd = NULL;
 	int new_cpu;
 #endif
 
@@ -984,72 +988,64 @@
 	if (unlikely(task_running(rq, p)))
 		goto out_activate;
 
-#ifdef CONFIG_SCHEDSTATS
+	new_cpu = cpu;
+
 	schedstat_inc(rq, ttwu_cnt);
 	if (cpu == this_cpu) {
 		schedstat_inc(rq, ttwu_local);
-	} else {
-		for_each_domain(this_cpu, sd) {
-			if (cpu_isset(cpu, sd->span)) {
-				schedstat_inc(sd, ttwu_wake_remote);
-				break;
-			}
+		goto out_set_cpu;
+	}
+
+	for_each_domain(this_cpu, sd) {
+		if (cpu_isset(cpu, sd->span)) {
+			schedstat_inc(sd, ttwu_wake_remote);
+			this_sd = sd;
+			break;
 		}
 	}
-#endif
 
-	new_cpu = cpu;
-	if (cpu == this_cpu || unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
+	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
 		goto out_set_cpu;
 
-	load = source_load(cpu);
-	this_load = target_load(this_cpu);
-
 	/*
-	 * If sync wakeup then subtract the (maximum possible) effect of
-	 * the currently running task from the load of the current CPU:
+	 * Check for affine wakeup and passive balancing possibilities.
 	 */
-	if (sync)
-		this_load -= SCHED_LOAD_SCALE;
-
-	/* Don't pull the task off an idle CPU to a busy one */
-	if (load < SCHED_LOAD_SCALE/2 && this_load > SCHED_LOAD_SCALE/2)
-		goto out_set_cpu;
+	if (this_sd) {
+		int idx = this_sd->wake_idx;
+		unsigned int imbalance;
 
-	new_cpu = this_cpu; /* Wake to this CPU if we can */
+		load = source_load(cpu, idx);
+		this_load = target_load(this_cpu, idx);
 
-	/*
-	 * Scan domains for affine wakeup and passive balancing
-	 * possibilities.
-	 */
-	for_each_domain(this_cpu, sd) {
-		unsigned int imbalance;
 		/*
-		 * Start passive balancing when half the imbalance_pct
-		 * limit is reached.
+		 * If sync wakeup then subtract the (maximum possible) effect of
+		 * the currently running task from the load of the current CPU:
 		 */
-		imbalance = sd->imbalance_pct + (sd->imbalance_pct - 100) / 2;
+		if (sync)
+			this_load -= SCHED_LOAD_SCALE;
 
-		if ((sd->flags & SD_WAKE_AFFINE) &&
-				!task_hot(p, rq->timestamp_last_tick, sd)) {
+		 /* Don't pull the task off an idle CPU to a busy one */
+		if (load < SCHED_LOAD_SCALE/2 && this_load > SCHED_LOAD_SCALE/2)
+			goto out_set_cpu;
+
+		new_cpu = this_cpu; /* Wake to this CPU if we can */
+
+		if ((this_sd->flags & SD_WAKE_AFFINE) &&
+			!task_hot(p, rq->timestamp_last_tick, this_sd)) {
 			/*
 			 * This domain has SD_WAKE_AFFINE and p is cache cold
 			 * in this domain.
 			 */
-			if (cpu_isset(cpu, sd->span)) {
-				schedstat_inc(sd, ttwu_move_affine);
-				goto out_set_cpu;
-			}
-		} else if ((sd->flags & SD_WAKE_BALANCE) &&
+			schedstat_inc(this_sd, ttwu_move_affine);
+			goto out_set_cpu;
+		} else if ((this_sd->flags & SD_WAKE_BALANCE) &&
 				imbalance*this_load <= 100*load) {
 			/*
 			 * This domain has SD_WAKE_BALANCE and there is
 			 * an imbalance.
 			 */
-			if (cpu_isset(cpu, sd->span)) {
-				schedstat_inc(sd, ttwu_move_balance);
-				goto out_set_cpu;
-			}
+			schedstat_inc(this_sd, ttwu_move_balance);
+			goto out_set_cpu;
 		}
 	}
 
@@ -1507,7 +1503,7 @@
 	cpus_and(mask, sd->span, p->cpus_allowed);
 
 	for_each_cpu_mask(i, mask) {
-		load = target_load(i);
+		load = target_load(i, sd->wake_idx);
 
 		if (load < min_load) {
 			min_cpu = i;
@@ -1520,7 +1516,7 @@
 	}
 
 	/* add +1 to account for the new task */
-	this_load = source_load(this_cpu) + SCHED_LOAD_SCALE;
+	this_load = source_load(this_cpu, sd->wake_idx) + SCHED_LOAD_SCALE;
 
 	/*
 	 * Would with the addition of the new task to the
@@ -1765,8 +1761,15 @@
 {
 	struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups;
 	unsigned long max_load, avg_load, total_load, this_load, total_pwr;
+	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
+	if (idle == NOT_IDLE)
+		load_idx = sd->busy_idx;
+	else if (idle == NEWLY_IDLE)
+		load_idx = sd->newidle_idx;
+	else
+		load_idx = sd->idle_idx;
 
 	do {
 		unsigned long load;
@@ -1781,9 +1784,9 @@
 		for_each_cpu_mask(i, group->cpumask) {
 			/* Bias balancing toward cpus of our domain */
 			if (local_group)
-				load = target_load(i);
+				load = target_load(i, load_idx);
 			else
-				load = source_load(i);
+				load = source_load(i, load_idx);
 
 			avg_load += load;
 		}
@@ -1893,7 +1896,7 @@
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
-		load = source_load(i);
+		load = source_load(i, 0);
 
 		if (load > max_load) {
 			max_load = load;
@@ -2165,18 +2168,23 @@
 	unsigned long old_load, this_load;
 	unsigned long j = jiffies + CPU_OFFSET(this_cpu);
 	struct sched_domain *sd;
+	int i;
 
-	/* Update our load */
-	old_load = this_rq->cpu_load;
 	this_load = this_rq->nr_running * SCHED_LOAD_SCALE;
-	/*
-	 * Round up the averaging division if load is increasing. This
-	 * prevents us from getting stuck on 9 if the load is 10, for
-	 * example.
-	 */
-	if (this_load > old_load)
-		old_load++;
-	this_rq->cpu_load = (old_load + this_load) / 2;
+	/* Update our load */
+	for (i = 0; i < 3; i++) {
+		unsigned long new_load = this_load;
+		int scale = 1 << i;
+		old_load = this_rq->cpu_load[i];
+		/*
+		 * Round up the averaging division if load is increasing. This
+		 * prevents us from getting stuck on 9 if the load is 10, for
+		 * example.
+		 */
+		if (new_load > old_load)
+			new_load += scale-1;
+		this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) / scale;
+	}
 
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
@@ -4958,13 +4966,15 @@
 
 		rq = cpu_rq(i);
 		spin_lock_init(&rq->lock);
+		rq->nr_running = 0;
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;
 		rq->best_expired_prio = MAX_PRIO;
 
 #ifdef CONFIG_SMP
 		rq->sd = &sched_domain_dummy;
-		rq->cpu_load = 0;
+		for (j = 1; j < 3; j++)
+			rq->cpu_load[j] = 0;
 		rq->active_balance = 0;
 		rq->push_cpu = 0;
 		rq->migration_thread = NULL;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 9/13] less affine wakups
  2005-02-24  7:24               ` [PATCH 8/13] generalised CPU load averaging Nick Piggin
@ 2005-02-24  7:25                 ` Nick Piggin
  2005-02-24  7:27                   ` [PATCH 10/13] remove aggressive idle balancing Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]

9/13


[-- Attachment #2: sched-tweak-wakeaffine.patch --]
[-- Type: text/x-patch, Size: 3051 bytes --]

Do less affine wakeups. We're trying to reduce dbt2-pgsql idle time
regressions here... make sure we don't don't move tasks the wrong way
in an imbalance condition. Also, remove the cache coldness requirement
from the calculation - this seems to induce sharp cutoff points where
behaviour will suddenly change on some workloads if the load creeps
slightly over or under some point. It is good for periodic balancing
because in that case have otherwise have no other context to determine
what task to move.

But also make a minor tweak to "wake balancing" - the imbalance
tolerance is now set at half the domain's imbalance, so we get the
opportunity to do wake balancing before the more random periodic
rebalancing gets preformed.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:39:06.808010844 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:37.734579478 +1100
@@ -1014,38 +1014,45 @@
 		int idx = this_sd->wake_idx;
 		unsigned int imbalance;
 
+		imbalance = 100 + (this_sd->imbalance_pct - 100) / 2;
+
 		load = source_load(cpu, idx);
 		this_load = target_load(this_cpu, idx);
 
-		/*
-		 * If sync wakeup then subtract the (maximum possible) effect of
-		 * the currently running task from the load of the current CPU:
-		 */
-		if (sync)
-			this_load -= SCHED_LOAD_SCALE;
-
-		 /* Don't pull the task off an idle CPU to a busy one */
-		if (load < SCHED_LOAD_SCALE/2 && this_load > SCHED_LOAD_SCALE/2)
-			goto out_set_cpu;
-
 		new_cpu = this_cpu; /* Wake to this CPU if we can */
 
-		if ((this_sd->flags & SD_WAKE_AFFINE) &&
-			!task_hot(p, rq->timestamp_last_tick, this_sd)) {
-			/*
-			 * This domain has SD_WAKE_AFFINE and p is cache cold
-			 * in this domain.
-			 */
-			schedstat_inc(this_sd, ttwu_move_affine);
-			goto out_set_cpu;
-		} else if ((this_sd->flags & SD_WAKE_BALANCE) &&
-				imbalance*this_load <= 100*load) {
+		if (this_sd->flags & SD_WAKE_AFFINE) {
+			unsigned long tl = this_load;
 			/*
-			 * This domain has SD_WAKE_BALANCE and there is
-			 * an imbalance.
+			 * If sync wakeup then subtract the (maximum possible)
+			 * effect of the currently running task from the load
+			 * of the current CPU:
 			 */
-			schedstat_inc(this_sd, ttwu_move_balance);
-			goto out_set_cpu;
+			if (sync)
+				tl -= SCHED_LOAD_SCALE;
+
+			if ((tl <= load &&
+				tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
+				100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
+				/*
+				 * This domain has SD_WAKE_AFFINE and
+				 * p is cache cold in this domain, and
+				 * there is no bad imbalance.
+				 */
+				schedstat_inc(this_sd, ttwu_move_affine);
+				goto out_set_cpu;
+			}
+		}
+
+		/*
+		 * Start passive balancing when half the imbalance_pct
+		 * limit is reached.
+		 */
+		if (this_sd->flags & SD_WAKE_BALANCE) {
+			if (imbalance*this_load <= 100*load) {
+				schedstat_inc(this_sd, ttwu_move_balance);
+				goto out_set_cpu;
+			}
 		}
 	}
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 10/13] remove aggressive idle balancing
  2005-02-24  7:25                 ` [PATCH 9/13] less affine wakups Nick Piggin
@ 2005-02-24  7:27                   ` Nick Piggin
  2005-02-24  7:28                     ` [PATCH 11/13] sched-domains aware balance-on-fork Nick Piggin
  2005-02-24  8:41                     ` [PATCH 10/13] remove aggressive idle balancing Ingo Molnar
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]

10/13

[-- Attachment #2: sched-noaggressive-idle.patch --]
[-- Type: text/x-patch, Size: 3024 bytes --]

Remove the very aggressive idle stuff that has recently gone into
2.6 - it is going against the direction we are trying to go. Hopefully
we can regain performance through other methods.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

Index: linux-2.6/include/asm-i386/topology.h
===================================================================
--- linux-2.6.orig/include/asm-i386/topology.h	2005-02-24 17:39:06.805011214 +1100
+++ linux-2.6/include/asm-i386/topology.h	2005-02-24 17:39:07.320947536 +1100
@@ -85,7 +85,6 @@
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_EXEC	\
 				| SD_BALANCE_NEWIDLE	\
-				| SD_WAKE_IDLE		\
 				| SD_WAKE_BALANCE,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
Index: linux-2.6/include/asm-x86_64/topology.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/topology.h	2005-02-24 17:39:06.805011214 +1100
+++ linux-2.6/include/asm-x86_64/topology.h	2005-02-24 17:43:37.503607973 +1100
@@ -58,7 +58,6 @@
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
-				| SD_WAKE_IDLE		\
 				| SD_WAKE_BALANCE,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
Index: linux-2.6/include/linux/topology.h
===================================================================
--- linux-2.6.orig/include/linux/topology.h	2005-02-24 17:39:06.806011090 +1100
+++ linux-2.6/include/linux/topology.h	2005-02-24 17:43:37.503607973 +1100
@@ -124,7 +124,6 @@
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
-				| SD_WAKE_IDLE		\
 				| SD_WAKE_BALANCE,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:39:07.057979992 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:37.504607850 +1100
@@ -412,22 +412,6 @@
 	return rq;
 }
 
-#ifdef CONFIG_SCHED_SMT
-static int cpu_and_siblings_are_idle(int cpu)
-{
-	int sib;
-	for_each_cpu_mask(sib, cpu_sibling_map[cpu]) {
-		if (idle_cpu(sib))
-			continue;
-		return 0;
-	}
-
-	return 1;
-}
-#else
-#define cpu_and_siblings_are_idle(A) idle_cpu(A)
-#endif
-
 #ifdef CONFIG_SCHEDSTATS
 /*
  * Called when a process is dequeued from the active array and given
@@ -1650,16 +1634,15 @@
 
 	/*
 	 * Aggressive migration if:
-	 * 1) the [whole] cpu is idle, or
+	 * 1) task is cache cold, or
 	 * 2) too many balance attempts have failed.
 	 */
 
-	if (cpu_and_siblings_are_idle(this_cpu) || \
-			sd->nr_balance_failed > sd->cache_nice_tries)
+	if (sd->nr_balance_failed > sd->cache_nice_tries)
 		return 1;
 
 	if (task_hot(p, rq->timestamp_last_tick, sd))
-			return 0;
+		return 0;
 	return 1;
 }
 
@@ -2131,7 +2114,7 @@
 				if (cpu_isset(cpu, visited_cpus))
 					continue;
 				cpu_set(cpu, visited_cpus);
-				if (!cpu_and_siblings_are_idle(cpu) || cpu == busiest_cpu)
+				if (cpu == busiest_cpu)
 					continue;
 
 				target_rq = cpu_rq(cpu);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 11/13] sched-domains aware balance-on-fork
  2005-02-24  7:27                   ` [PATCH 10/13] remove aggressive idle balancing Nick Piggin
@ 2005-02-24  7:28                     ` Nick Piggin
  2005-02-24  7:28                       ` [PATCH 12/13] schedstats additions for sched-balance-fork Nick Piggin
  2005-02-24  8:41                     ` [PATCH 10/13] remove aggressive idle balancing Ingo Molnar
  1 sibling, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7 bytes --]

11/13


[-- Attachment #2: sched-balance-fork.patch --]
[-- Type: text/x-patch, Size: 8382 bytes --]

Reimplement the balance on exec balancing to be sched-domains aware. Use
this to also do balance on fork balancing. Make x86_64 do balance on fork
over the NUMA domain.

The problem that the non sched domains aware blancing became apparent on
dual core, multi socket opterons. What we want is for the new tasks to be
sent to a different socket, but more often than not, we would first load
up our sibling core, or fill two cores of a single remote socket before
selecting a new one.

This gives large improvements to STREAM on such systems.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


Index: linux-2.6/include/asm-x86_64/topology.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/topology.h	2005-02-24 17:39:07.320947536 +1100
+++ linux-2.6/include/asm-x86_64/topology.h	2005-02-24 17:43:37.077660523 +1100
@@ -54,9 +54,11 @@
 	.idle_idx		= 2,			\
 	.newidle_idx		= 1, 			\
 	.wake_idx		= 1,			\
+	.forkexec_idx		= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
+				| SD_BALANCE_FORK	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_BALANCE,	\
 	.last_balance		= jiffies,		\
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2005-02-24 17:39:06.806011090 +1100
+++ linux-2.6/include/linux/sched.h	2005-02-24 17:43:37.274636222 +1100
@@ -423,10 +423,11 @@
 #define SD_LOAD_BALANCE		1	/* Do load balancing on this domain. */
 #define SD_BALANCE_NEWIDLE	2	/* Balance when about to become idle */
 #define SD_BALANCE_EXEC		4	/* Balance on exec */
-#define SD_WAKE_IDLE		8	/* Wake to idle CPU on task wakeup */
-#define SD_WAKE_AFFINE		16	/* Wake task to waking CPU */
-#define SD_WAKE_BALANCE		32	/* Perform balancing at task wakeup */
-#define SD_SHARE_CPUPOWER	64	/* Domain members share cpu power */
+#define SD_BALANCE_FORK		8	/* Balance on fork, clone */
+#define SD_WAKE_IDLE		16	/* Wake to idle CPU on task wakeup */
+#define SD_WAKE_AFFINE		32	/* Wake task to waking CPU */
+#define SD_WAKE_BALANCE		64	/* Perform balancing at task wakeup */
+#define SD_SHARE_CPUPOWER	128	/* Domain members share cpu power */
 
 struct sched_group {
 	struct sched_group *next;	/* Must be a circular list */
@@ -455,6 +456,7 @@
 	unsigned int idle_idx;
 	unsigned int newidle_idx;
 	unsigned int wake_idx;
+	unsigned int forkexec_idx;
 	int flags;			/* See SD_* */
 
 	/* Runtime fields. */
Index: linux-2.6/include/linux/topology.h
===================================================================
--- linux-2.6.orig/include/linux/topology.h	2005-02-24 17:39:07.320947536 +1100
+++ linux-2.6/include/linux/topology.h	2005-02-24 17:43:37.078660399 +1100
@@ -90,6 +90,7 @@
 	.idle_idx		= 0,			\
 	.newidle_idx		= 0,			\
 	.wake_idx		= 0,			\
+	.forkexec_idx		= 0,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
@@ -120,6 +121,7 @@
 	.idle_idx		= 0,			\
 	.newidle_idx		= 1,			\
 	.wake_idx		= 1,			\
+	.forkexec_idx		= 0,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:39:07.322947289 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:37.274636222 +1100
@@ -891,6 +891,79 @@
 	return max(rq->cpu_load[type-1], load_now);
 }
 
+/*
+ * find_idlest_group finds and returns the least busy CPU group within the
+ * domain.
+ */
+static struct sched_group *
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
+{
+	struct sched_group *idlest = NULL, *this = NULL, *group = sd->groups;
+	unsigned long min_load = ULONG_MAX, this_load = 0;
+	int load_idx = sd->forkexec_idx;
+	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+
+	do {
+		unsigned long load, avg_load;
+		int local_group;
+		int i;
+
+		local_group = cpu_isset(this_cpu, group->cpumask);
+		/* XXX: put a cpus allowed check */
+
+		/* Tally up the load of all CPUs in the group */
+		avg_load = 0;
+
+		for_each_cpu_mask(i, group->cpumask) {
+			/* Bias balancing toward cpus of our domain */
+			if (local_group)
+				load = source_load(i, load_idx);
+			else
+				load = target_load(i, load_idx);
+
+			avg_load += load;
+		}
+
+		/* Adjust by relative CPU power of the group */
+		avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
+
+		if (local_group) {
+			this_load = avg_load;
+			this = group;
+		} else if (avg_load < min_load) {
+			min_load = avg_load;
+			idlest = group;
+		}
+		group = group->next;
+	} while (group != sd->groups);
+
+	if (!idlest || 100*this_load < imbalance*min_load)
+		return NULL;
+	return idlest;
+}
+
+/*
+ * find_idlest_queue - find the idlest runqueue among the cpus in group.
+ */
+static int find_idlest_cpu(struct sched_group *group, int this_cpu)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int idlest = -1;
+	int i;
+
+	for_each_cpu_mask(i, group->cpumask) {
+		load = source_load(i, 0);
+
+		if (load < min_load || (load == min_load && i == this_cpu)) {
+			min_load = load;
+			idlest = i;
+		}
+	}
+
+	return idlest;
+}
+
+
 #endif
 
 /*
@@ -1105,11 +1178,6 @@
 	return try_to_wake_up(p, state, 0);
 }
 
-#ifdef CONFIG_SMP
-static int find_idlest_cpu(struct task_struct *p, int this_cpu,
-			   struct sched_domain *sd);
-#endif
-
 /*
  * Perform scheduler related setup for a newly forked process p.
  * p is forked by current.
@@ -1179,12 +1247,38 @@
 	unsigned long flags;
 	int this_cpu, cpu;
 	runqueue_t *rq, *this_rq;
+#ifdef CONFIG_SMP
+	struct sched_domain *tmp, *sd = NULL;
+#endif
 
 	rq = task_rq_lock(p, &flags);
-	cpu = task_cpu(p);
+	BUG_ON(p->state != TASK_RUNNING);
 	this_cpu = smp_processor_id();
+	cpu = task_cpu(p);
 
-	BUG_ON(p->state != TASK_RUNNING);
+#ifdef CONFIG_SMP
+	for_each_domain(cpu, tmp)
+		if (tmp->flags & SD_BALANCE_FORK)
+			sd = tmp;
+
+	if (sd) {
+		struct sched_group *group;
+
+		cpu = task_cpu(p);
+		group = find_idlest_group(sd, p, cpu);
+		if (group) {
+			int new_cpu;
+			new_cpu = find_idlest_cpu(group, cpu);
+			if (new_cpu != -1 && new_cpu != cpu &&
+					cpu_isset(new_cpu, p->cpus_allowed)) {
+				set_task_cpu(p, new_cpu);
+				task_rq_unlock(rq, &flags);
+				rq = task_rq_lock(p, &flags);
+				cpu = task_cpu(p);
+			}
+		}
+	}
+#endif
 
 	/*
 	 * We decrease the sleep average of forking parents
@@ -1479,51 +1573,6 @@
 }
 
 /*
- * find_idlest_cpu - find the least busy runqueue.
- */
-static int find_idlest_cpu(struct task_struct *p, int this_cpu,
-			   struct sched_domain *sd)
-{
-	unsigned long load, min_load, this_load;
-	int i, min_cpu;
-	cpumask_t mask;
-
-	min_cpu = UINT_MAX;
-	min_load = ULONG_MAX;
-
-	cpus_and(mask, sd->span, p->cpus_allowed);
-
-	for_each_cpu_mask(i, mask) {
-		load = target_load(i, sd->wake_idx);
-
-		if (load < min_load) {
-			min_cpu = i;
-			min_load = load;
-
-			/* break out early on an idle CPU: */
-			if (!min_load)
-				break;
-		}
-	}
-
-	/* add +1 to account for the new task */
-	this_load = source_load(this_cpu, sd->wake_idx) + SCHED_LOAD_SCALE;
-
-	/*
-	 * Would with the addition of the new task to the
-	 * current CPU there be an imbalance between this
-	 * CPU and the idlest CPU?
-	 *
-	 * Use half of the balancing threshold - new-context is
-	 * a good opportunity to balance.
-	 */
-	if (min_load*(100 + (sd->imbalance_pct-100)/2) < this_load*100)
-		return min_cpu;
-
-	return this_cpu;
-}
-
-/*
  * If dest_cpu is allowed for this process, migrate the task to it.
  * This is accomplished by forcing the cpu_allowed mask to only
  * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
@@ -1576,8 +1625,15 @@
 			sd = tmp;
 
 	if (sd) {
+		struct sched_group *group;
 		schedstat_inc(sd, sbe_attempts);
-		new_cpu = find_idlest_cpu(current, this_cpu, sd);
+		group = find_idlest_group(sd, current, this_cpu);
+		if (!group)
+			goto out;
+		new_cpu = find_idlest_cpu(group, this_cpu);
+		if (new_cpu == -1)
+			goto out;
+
 		if (new_cpu != this_cpu) {
 			schedstat_inc(sd, sbe_pushed);
 			put_cpu();
@@ -1790,12 +1846,10 @@
 		if (local_group) {
 			this_load = avg_load;
 			this = group;
-			goto nextgroup;
 		} else if (avg_load > max_load) {
 			max_load = avg_load;
 			busiest = group;
 		}
-nextgroup:
 		group = group->next;
 	} while (group != sd->groups);
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 12/13] schedstats additions for sched-balance-fork
  2005-02-24  7:28                     ` [PATCH 11/13] sched-domains aware balance-on-fork Nick Piggin
@ 2005-02-24  7:28                       ` Nick Piggin
  2005-02-24  7:30                         ` [PATCH 13/13] basic tuning Nick Piggin
  2005-02-24  8:46                         ` [PATCH 12/13] schedstats additions for sched-balance-fork Ingo Molnar
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7 bytes --]

12/13


[-- Attachment #2: sched-stat-sbf.patch --]
[-- Type: text/x-patch, Size: 3935 bytes --]

Add SCHEDSTAT statistics for sched-balance-fork.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>


Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2005-02-24 17:39:07.616911007 +1100
+++ linux-2.6/include/linux/sched.h	2005-02-24 17:39:07.819885956 +1100
@@ -480,10 +480,16 @@
 	unsigned long alb_failed;
 	unsigned long alb_pushed;
 
-	/* sched_balance_exec() stats */
-	unsigned long sbe_attempts;
+	/* SD_BALANCE_EXEC stats */
+	unsigned long sbe_cnt;
+	unsigned long sbe_balanced;
 	unsigned long sbe_pushed;
 
+	/* SD_BALANCE_FORK stats */
+	unsigned long sbf_cnt;
+	unsigned long sbf_balanced;
+	unsigned long sbf_pushed;
+
 	/* try_to_wake_up() stats */
 	unsigned long ttwu_wake_remote;
 	unsigned long ttwu_move_affine;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:39:07.618910761 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:36.887683960 +1100
@@ -307,7 +307,7 @@
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 11
+#define SCHEDSTAT_VERSION 12
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -354,9 +354,10 @@
 				    sd->lb_nobusyq[itype],
 				    sd->lb_nobusyg[itype]);
 			}
-			seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu\n",
+			seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu\n",
 			    sd->alb_cnt, sd->alb_failed, sd->alb_pushed,
-			    sd->sbe_pushed, sd->sbe_attempts,
+			    sd->sbe_cnt, sd->sbe_balanced, sd->sbe_pushed,
+			    sd->sbf_cnt, sd->sbf_balanced, sd->sbf_pushed,
 			    sd->ttwu_wake_remote, sd->ttwu_move_affine, sd->ttwu_move_balance);
 		}
 #endif
@@ -1262,24 +1263,34 @@
 			sd = tmp;
 
 	if (sd) {
+		int new_cpu;
 		struct sched_group *group;
 
+		schedstat_inc(sd, sbf_cnt);
 		cpu = task_cpu(p);
 		group = find_idlest_group(sd, p, cpu);
-		if (group) {
-			int new_cpu;
-			new_cpu = find_idlest_cpu(group, cpu);
-			if (new_cpu != -1 && new_cpu != cpu &&
-					cpu_isset(new_cpu, p->cpus_allowed)) {
-				set_task_cpu(p, new_cpu);
-				task_rq_unlock(rq, &flags);
-				rq = task_rq_lock(p, &flags);
-				cpu = task_cpu(p);
-			}
+		if (!group) {
+			schedstat_inc(sd, sbf_balanced);
+			goto no_forkbalance;
+		}
+
+		new_cpu = find_idlest_cpu(group, cpu);
+		if (new_cpu == -1 || new_cpu == cpu) {
+			schedstat_inc(sd, sbf_balanced);
+			goto no_forkbalance;
+		}
+
+		if (cpu_isset(new_cpu, p->cpus_allowed)) {
+			schedstat_inc(sd, sbf_pushed);
+			set_task_cpu(p, new_cpu);
+			task_rq_unlock(rq, &flags);
+			rq = task_rq_lock(p, &flags);
+			cpu = task_cpu(p);
 		}
 	}
-#endif
 
+no_forkbalance:
+#endif
 	/*
 	 * We decrease the sleep average of forking parents
 	 * and children as well, to keep max-interactive tasks
@@ -1616,30 +1627,28 @@
 	struct sched_domain *tmp, *sd = NULL;
 	int new_cpu, this_cpu = get_cpu();
 
-	/* Prefer the current CPU if there's only this task running */
-	if (this_rq()->nr_running <= 1)
-		goto out;
-
 	for_each_domain(this_cpu, tmp)
 		if (tmp->flags & SD_BALANCE_EXEC)
 			sd = tmp;
 
 	if (sd) {
 		struct sched_group *group;
-		schedstat_inc(sd, sbe_attempts);
+		schedstat_inc(sd, sbe_cnt);
 		group = find_idlest_group(sd, current, this_cpu);
-		if (!group)
+		if (!group) {
+			schedstat_inc(sd, sbe_balanced);
 			goto out;
+		}
 		new_cpu = find_idlest_cpu(group, this_cpu);
-		if (new_cpu == -1)
+		if (new_cpu == -1 || new_cpu == this_cpu) {
+			schedstat_inc(sd, sbe_balanced);
 			goto out;
-
-		if (new_cpu != this_cpu) {
-			schedstat_inc(sd, sbe_pushed);
-			put_cpu();
-			sched_migrate_task(current, new_cpu);
-			return;
 		}
+
+		schedstat_inc(sd, sbe_pushed);
+		put_cpu();
+		sched_migrate_task(current, new_cpu);
+		return;
 	}
 out:
 	put_cpu();

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 13/13] basic tuning
  2005-02-24  7:28                       ` [PATCH 12/13] schedstats additions for sched-balance-fork Nick Piggin
@ 2005-02-24  7:30                         ` Nick Piggin
  2005-02-24  8:46                         ` [PATCH 12/13] schedstats additions for sched-balance-fork Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7 bytes --]

13/13


[-- Attachment #2: sched-tune.patch --]
[-- Type: text/x-patch, Size: 1504 bytes --]

Do some basic initial tuning.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

Index: linux-2.6/include/asm-x86_64/topology.h
===================================================================
--- linux-2.6.orig/include/asm-x86_64/topology.h	2005-02-24 17:39:07.615911131 +1100
+++ linux-2.6/include/asm-x86_64/topology.h	2005-02-24 17:39:07.990864853 +1100
@@ -52,12 +52,11 @@
 	.cache_nice_tries	= 2,			\
 	.busy_idx		= 3,			\
 	.idle_idx		= 2,			\
-	.newidle_idx		= 1, 			\
+	.newidle_idx		= 0, 			\
 	.wake_idx		= 1,			\
 	.forkexec_idx		= 1,			\
 	.per_cpu_gain		= 100,			\
 	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_FORK	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_BALANCE,	\
Index: linux-2.6/include/linux/topology.h
===================================================================
--- linux-2.6.orig/include/linux/topology.h	2005-02-24 17:39:07.616911007 +1100
+++ linux-2.6/include/linux/topology.h	2005-02-24 17:39:07.991864730 +1100
@@ -118,15 +118,14 @@
 	.cache_nice_tries	= 1,			\
 	.per_cpu_gain		= 100,			\
 	.busy_idx		= 2,			\
-	.idle_idx		= 0,			\
-	.newidle_idx		= 1,			\
+	.idle_idx		= 1,			\
+	.newidle_idx		= 2,			\
 	.wake_idx		= 1,			\
-	.forkexec_idx		= 0,			\
+	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
-				| SD_WAKE_AFFINE	\
-				| SD_WAKE_BALANCE,	\
+				| SD_WAKE_AFFINE,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
 	.nr_balance_failed	= 0,			\

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/13] timestamp fixes
  2005-02-24  7:16 ` [PATCH 1/13] timestamp fixes Nick Piggin
  2005-02-24  7:16   ` [PATCH 2/13] improve pinned task handling Nick Piggin
@ 2005-02-24  7:46   ` Ingo Molnar
  2005-02-24  7:56     ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2005-02-24  7:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> 1/13
> 

ugh, has this been tested? It needs the patch below.

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -2704,11 +2704,11 @@ need_resched_nonpreemptible:
 
 	schedstat_inc(rq, sched_cnt);
 	now = sched_clock();
-	if (likely((long long)now - prev->timestamp < NS_MAX_SLEEP_AVG))
+	if (likely((long long)now - prev->timestamp < NS_MAX_SLEEP_AVG)) {
 		run_time = now - prev->timestamp;
 		if (unlikely((long long)now - prev->timestamp < 0))
 			run_time = 0;
-	else
+	} else
 		run_time = NS_MAX_SLEEP_AVG;
 
 	/*

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/13] timestamp fixes
  2005-02-24  7:46   ` [PATCH 1/13] timestamp fixes Ingo Molnar
@ 2005-02-24  7:56     ` Nick Piggin
  2005-02-24  8:34       ` Ingo Molnar
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-02-24  7:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel

On Thu, 2005-02-24 at 08:46 +0100, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > 1/13
> > 
> 
> ugh, has this been tested? It needs the patch below.
> 

Yes. Which might also explain why I didn't see -ve intervals :(
Thanks Ingo.

In the context of the whole patchset, testing has mainly been
based around multiprocessor behaviour so this doesn't invalidate
that.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/13] improve pinned task handling
  2005-02-24  7:16   ` [PATCH 2/13] improve pinned task handling Nick Piggin
  2005-02-24  7:18     ` [PATCH 3/13] rework schedstats Nick Piggin
@ 2005-02-24  8:04     ` Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2005-02-24  8:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> 2/13

yeah, we need this. (Eventually someone should explore a different way
to handle affine tasks as this is getting quirky, although it looks
algorithmically impossible in an O(1) way.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/13] rework schedstats
  2005-02-24  7:18     ` [PATCH 3/13] rework schedstats Nick Piggin
  2005-02-24  7:19       ` [PATCH 4/13] find_busiest_group fixlets Nick Piggin
@ 2005-02-24  8:07       ` Ingo Molnar
  2005-02-25 10:50       ` Rick Lindsley
  2 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2005-02-24  8:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> 3/13
> 
> I have an updated userspace parser for this thing, if you
> are still keeping it on your website.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/13] timestamp fixes
  2005-02-24  7:56     ` Nick Piggin
@ 2005-02-24  8:34       ` Ingo Molnar
  0 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2005-02-24  8:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Thu, 2005-02-24 at 08:46 +0100, Ingo Molnar wrote:
> > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> > > 1/13
> > > 
> > 
> > ugh, has this been tested? It needs the patch below.
> > 
> 
> Yes. Which might also explain why I didn't see -ve intervals :( Thanks
> Ingo.
> 
> In the context of the whole patchset, testing has mainly been based
> around multiprocessor behaviour so this doesn't invalidate that.

nono, by 'this' i only meant that patch. The other ones look mainly OK,
but obviously they need a _ton_ of testing.

these:

 [PATCH 1/13] timestamp fixes
   (+fix)
 [PATCH 2/13] improve pinned task handling
 [PATCH 3/13] rework schedstats

can go into BK right after 2.6.11 is released as they are fixes or
norisk-improvements. [lets call them 'group A'] These three:

 [PATCH 4/13] find_busiest_group fixlets
 [PATCH 5/13] find_busiest_group cleanup

 [PATCH 7/13] better active balancing heuristic

look pretty fine too and i'd suggest early BK integration too - but in
theory they could impact things negatively so that's where immediate BK
integration has to stop in the first phase, to get some feedback. [lets 
call them 'group B']

these:

 [PATCH 6/13] no aggressive idle balancing

 [PATCH 8/13] generalised CPU load averaging
 [PATCH 9/13] less affine wakups
 [PATCH 10/13] remove aggressive idle balancing
 [PATCH 11/13] sched-domains aware balance-on-fork
  [PATCH 12/13] schedstats additions for sched-balance-fork
 [PATCH 13/13] basic tuning

change things radically, and i'm uneasy about them even in the 2.6.12
timeframe. [lets call them 'group C'] I'd suggest we give them a go in
-mm and see how things go, so all of them get:

  Acked-by: Ingo Molnar <mingo@elte.hu>

If things dont stabilize quickly then we need to do it piecemail wise.
The only possible natural split seems to be to go for the running-task
balancing changes first:

 [PATCH 6/13] no aggressive idle balancing

 [PATCH 8/13] generalised CPU load averaging
 [PATCH 9/13] less affine wakups
 [PATCH 10/13] remove aggressive idle balancing

 [PATCH 13/13] basic tuning

perhaps #8 and relevant portions of #13 could be moved from group C into
group B and thus hit BK early, but that would need remerging.

and then for the fork/clone-balancing changes:

 [PATCH 11/13] sched-domains aware balance-on-fork
 [PATCH 12/13] schedstats additions for sched-balance-fork

a more finegrained splitup doesnt make much sense, as these groups are
pretty compact conceptually.

But i expect fork/clone balancing to be almost certainly a problem. (We
didnt get it right for all workloads in 2.6.7, and i think it cannot be
gotten right currently either, without userspace API help - but i'd be
happy to be proven wrong.)

(if you agree with my generic analysis then when you regenerate your
patches next time please reorder them according to the flow above, and
please try to insert future fixlets not end-of-stream but according to
the conceptual grouping.)

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/13] find_busiest_group fixlets
  2005-02-24  7:19       ` [PATCH 4/13] find_busiest_group fixlets Nick Piggin
  2005-02-24  7:20         ` [PATCH 5/13] find_busiest_group cleanup Nick Piggin
@ 2005-02-24  8:36         ` Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2005-02-24  8:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> 4/13
> 5/13

#insert <previous mail>

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 7/13] better active balancing heuristic
  2005-02-24  7:22             ` [PATCH 7/13] better active balancing heuristic Nick Piggin
  2005-02-24  7:24               ` [PATCH 8/13] generalised CPU load averaging Nick Piggin
@ 2005-02-24  8:39               ` Ingo Molnar
  1 sibling, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2005-02-24  8:39 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> 7/13

yeah, we need this one too.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-02-24  7:27                   ` [PATCH 10/13] remove aggressive idle balancing Nick Piggin
  2005-02-24  7:28                     ` [PATCH 11/13] sched-domains aware balance-on-fork Nick Piggin
@ 2005-02-24  8:41                     ` Ingo Molnar
  2005-02-24 12:13                       ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2005-02-24  8:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>  [PATCH 6/13] no aggressive idle balancing
>
>  [PATCH 8/13] generalised CPU load averaging
>  [PATCH 9/13] less affine wakups
>  [PATCH 10/13] remove aggressive idle balancing

they look fine, but these are the really scary ones :-) Maybe we could
do #8 and #9 first, then #6+#10. But it's probably pointless to look at
these in isolation.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 12/13] schedstats additions for sched-balance-fork
  2005-02-24  7:28                       ` [PATCH 12/13] schedstats additions for sched-balance-fork Nick Piggin
  2005-02-24  7:30                         ` [PATCH 13/13] basic tuning Nick Piggin
@ 2005-02-24  8:46                         ` Ingo Molnar
  2005-02-24 22:13                           ` Nick Piggin
  2005-02-25 11:07                           ` Rick Lindsley
  1 sibling, 2 replies; 38+ messages in thread
From: Ingo Molnar @ 2005-02-24  8:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, Andi Kleen


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>  [PATCH 11/13] sched-domains aware balance-on-fork
>   [PATCH 12/13] schedstats additions for sched-balance-fork
>  [PATCH 13/13] basic tuning

STREAMS numbers tricky. It's pretty much the only benchmark that 1)
relies on being able to allocate alot of RAM in a NUMA-friendly way 2)
does all of its memory allocation in the first timeslice of cloned
worker threads.

There is little help we get from userspace, and i'm not sure we want to
add scheduler overhead for this single benchmark - when something like a
_tiny_ bit of NUMAlib use within the OpenMP library would probably solve
things equally well!

Anyway, the code itself looks fine and it would be good if it improved
things, so:

 Acked-by: Ingo Molnar <mingo@elte.hu>

but this too needs alot of testing, and it the one that has the highest
likelyhood of actually not making it upstream.

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-02-24  8:41                     ` [PATCH 10/13] remove aggressive idle balancing Ingo Molnar
@ 2005-02-24 12:13                       ` Nick Piggin
  2005-02-24 12:16                         ` Ingo Molnar
  2005-03-06  5:43                         ` Siddha, Suresh B
  0 siblings, 2 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24 12:13 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>> [PATCH 6/13] no aggressive idle balancing
>>
>> [PATCH 8/13] generalised CPU load averaging
>> [PATCH 9/13] less affine wakups
>> [PATCH 10/13] remove aggressive idle balancing
> 
> 
> they look fine, but these are the really scary ones :-) Maybe we could
> do #8 and #9 first, then #6+#10. But it's probably pointless to look at
> these in isolation.
> 

Oh yes, they are very scary and I guarantee they'll cause
problems :P

I didn't have any plans to get these in for 2.6.12 (2.6.13 at the
very earliest). But it will be nice if Andrew can pick these up
early so we try to get as much regression testing as possible.

I pretty much agree with your ealier breakdown of the patches (ie.
some are fixes, others fairly straightfoward improvements that may
get into 2.6.12, of course). Thanks very much for the review.

I expect to rework the patches, and things will get tuned and
changed around a bit... Any problem with you taking these now
though Andrew?

Nick



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-02-24 12:13                       ` Nick Piggin
@ 2005-02-24 12:16                         ` Ingo Molnar
  2005-03-06  5:43                         ` Siddha, Suresh B
  1 sibling, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2005-02-24 12:16 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Ingo Molnar wrote:
> >* Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> >
> >>[PATCH 6/13] no aggressive idle balancing
> >>
> >>[PATCH 8/13] generalised CPU load averaging
> >>[PATCH 9/13] less affine wakups
> >>[PATCH 10/13] remove aggressive idle balancing
> >
> >
> >they look fine, but these are the really scary ones :-) Maybe we could
> >do #8 and #9 first, then #6+#10. But it's probably pointless to look at
> >these in isolation.
> >
> 
> Oh yes, they are very scary and I guarantee they'll cause
> problems :P

:-|

> I didn't have any plans to get these in for 2.6.12 (2.6.13 at the very
> earliest). But it will be nice if Andrew can pick these up early so we
> try to get as much regression testing as possible.
> 
> I pretty much agree with your ealier breakdown of the patches (ie.
> some are fixes, others fairly straightfoward improvements that may get
> into 2.6.12, of course). Thanks very much for the review.
> 
> I expect to rework the patches, and things will get tuned and changed
> around a bit... Any problem with you taking these now though Andrew?

sure, fine with me.

	Ingo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 12/13] schedstats additions for sched-balance-fork
  2005-02-24  8:46                         ` [PATCH 12/13] schedstats additions for sched-balance-fork Ingo Molnar
@ 2005-02-24 22:13                           ` Nick Piggin
  2005-02-25 11:07                           ` Rick Lindsley
  1 sibling, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-24 22:13 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, Andi Kleen

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>> [PATCH 11/13] sched-domains aware balance-on-fork
>>  [PATCH 12/13] schedstats additions for sched-balance-fork
>> [PATCH 13/13] basic tuning
> 
> 
> STREAMS numbers tricky. It's pretty much the only benchmark that 1)
> relies on being able to allocate alot of RAM in a NUMA-friendly way 2)
> does all of its memory allocation in the first timeslice of cloned
> worker threads.
> 

I know what you mean... but this is not _just_ for STREAM. Firstly,
if we start 4 tasks on one core (of a 4 socket / 8 core system), and
just let them be moved around by the periodic balancer, they will
tend to cluster on 2 or 3 CPUs, and that will be the steady state.

> There is little help we get from userspace, and i'm not sure we want to
> add scheduler overhead for this single benchmark - when something like a
> _tiny_ bit of NUMAlib use within the OpenMP library would probably solve
> things equally well!
> 

True, for OpenMP apps (and this work shouldn't stop that from happening).
But other threaded apps are also important, and fork()ed apps can be
important too.

What I hear from the NUMA guys (POWER5, AMD) is that they really want to
keep memory controllers busy. This seems to be the best way to do it.

There are a few differences between this and when we last tried it. The
main thing is that the balancer is now sched-domains aware. I hope we
can get it to do the right thing more often (at least it is a per domain
flag, so those who don't want it don't turn it on).

> Anyway, the code itself looks fine and it would be good if it improved
> things, so:
> 
>  Acked-by: Ingo Molnar <mingo@elte.hu>
> 
> but this too needs alot of testing, and it the one that has the highest
> likelyhood of actually not making it upstream.
> 

Thanks for reviewing.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/13] rework schedstats
  2005-02-24  7:18     ` [PATCH 3/13] rework schedstats Nick Piggin
  2005-02-24  7:19       ` [PATCH 4/13] find_busiest_group fixlets Nick Piggin
  2005-02-24  8:07       ` [PATCH 3/13] rework schedstats Ingo Molnar
@ 2005-02-25 10:50       ` Rick Lindsley
  2005-02-25 11:10         ` Nick Piggin
  2 siblings, 1 reply; 38+ messages in thread
From: Rick Lindsley @ 2005-02-25 10:50 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

    I have an updated userspace parser for this thing, if you are still
    keeping it on your website.

Sure, be happy to include it, thanks!  Send it along.  Is it for version
11 or version 12?

    Move balancing fields into struct sched_domain, so we can get more
    useful results on systems with multiple domains (eg SMT+SMP, CMP+NUMA,
    SMP+NUMA, etc).

It looks like you've also removed the stats for pull_task() and
wake_up_new_task().  Was this intentional, or accidental?

I can't quite get the patch to apply cleanly against several variants
of 2.6.10 or 2.6.11-rc*.  Which version is the patch for?

Rick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 12/13] schedstats additions for sched-balance-fork
  2005-02-24  8:46                         ` [PATCH 12/13] schedstats additions for sched-balance-fork Ingo Molnar
  2005-02-24 22:13                           ` Nick Piggin
@ 2005-02-25 11:07                           ` Rick Lindsley
  2005-02-25 11:21                             ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Rick Lindsley @ 2005-02-25 11:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nick Piggin, Andrew Morton, linux-kernel, Andi Kleen

    There is little help we get from userspace, and i'm not sure we want to
    add scheduler overhead for this single benchmark - when something like a
    _tiny_ bit of NUMAlib use within the OpenMP library would probably solve
    things equally well!

There's has been a general problem with sched domains and it trying to
meet two goals: "1) spread things around evenly within a domain and
balance across domains infrequently", and "2) load up cores before
loading up siblings, even at the expense of violating 1)".

We've had trouble getting both 1) and 2) implemented correctly in
the past.  If this patch gets us closer to that nirvana, it will be
valuable regardless of the benchmark it also happens to be improving.

Regardless, I agree it will need good testing, and we may need to
pick the wheat from the chaff.

Rick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/13] rework schedstats
  2005-02-25 10:50       ` Rick Lindsley
@ 2005-02-25 11:10         ` Nick Piggin
  2005-02-25 11:25           ` DHCP on multi homed host! Ravindra Nadgauda
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-02-25 11:10 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: Andrew Morton, linux-kernel

Rick Lindsley wrote:
>     I have an updated userspace parser for this thing, if you are still
>     keeping it on your website.
> 
> Sure, be happy to include it, thanks!  Send it along.  Is it for version
> 11 or version 12?
> 

Version 12. I can send it to you next week. This was actually
directed at Andrew, who was hosting my parser somewhere for a
while. But it would be probably better if it just goes on your
site.

Sorry, no update to your script because I can't write perl to
save my life, let alone read it :|


>     Move balancing fields into struct sched_domain, so we can get more
>     useful results on systems with multiple domains (eg SMT+SMP, CMP+NUMA,
>     SMP+NUMA, etc).
> 
> It looks like you've also removed the stats for pull_task() and
> wake_up_new_task().  Was this intentional, or accidental?
> 

I didn't really think wunt_cnt or wunt_moved were very interesting
with respect to the scheduler. wunt_cnt is more or less just the
total number of forks, wunt_moved is a very rare situation where
the new task is moved before being woken up.

The balance-on-fork schedstats code does add some more interesting
stuff here.

pull_task() lost the "lost" counter because that is tricky to put
into the context of sched-domains. It also isn't important in my
current line of analysis because I'm just doing summations over
all CPUs, so in that case your "gained" was always the same as
"lost".

If you have uses for other counters, by all means send patches and
we can discuss (not that I can imagine I'd have any objections).

> I can't quite get the patch to apply cleanly against several variants
> of 2.6.10 or 2.6.11-rc*.  Which version is the patch for?
> 

It was 11-rc4, but it should fit on -rc5 OK too.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 12/13] schedstats additions for sched-balance-fork
  2005-02-25 11:07                           ` Rick Lindsley
@ 2005-02-25 11:21                             ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2005-02-25 11:21 UTC (permalink / raw)
  To: Rick Lindsley; +Cc: Ingo Molnar, Andrew Morton, linux-kernel, Andi Kleen

Rick Lindsley wrote:
>     There is little help we get from userspace, and i'm not sure we want to
>     add scheduler overhead for this single benchmark - when something like a
>     _tiny_ bit of NUMAlib use within the OpenMP library would probably solve
>     things equally well!
> 
> There's has been a general problem with sched domains and it trying to
> meet two goals: "1) spread things around evenly within a domain and
> balance across domains infrequently", and "2) load up cores before
> loading up siblings, even at the expense of violating 1)".
> 

Yes, you hit the nail on the head. Well, the other (potentially
more problematic) part of the equation is "3) keep me close to
my parent and siblings, because we'll be likely to share memory
and/or communicate".

However: I'm hoping that on unloaded or lightly loaded NUMA
systems, it is usually the right choice to spread tasks across
nodes. Especially on the newer breed of low remote latency, high
bandwidth systems like Opterons and POWER5s.

When load ramps up a bit and we start saturating CPUs, the amount
of balance-on-forking should slow down, so we start to fulfil
requirement 3 for workloads that perhaps resemble more general
purpose server stuff.

That's the plan anyway. We'll see...


^ permalink raw reply	[flat|nested] 38+ messages in thread

* DHCP on multi homed host!
  2005-02-25 11:10         ` Nick Piggin
@ 2005-02-25 11:25           ` Ravindra Nadgauda
  0 siblings, 0 replies; 38+ messages in thread
From: Ravindra Nadgauda @ 2005-02-25 11:25 UTC (permalink / raw)
  To: linux-kernel



Hi,
   Just had a question on multi homed host.

   I want to have two network cards and two IPs. I want one IP to be
statically configured and the IP for card to be obtained by DHCP. Is this
possible. Any references??

Regards,
Ravindra N.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-02-24 12:13                       ` Nick Piggin
  2005-02-24 12:16                         ` Ingo Molnar
@ 2005-03-06  5:43                         ` Siddha, Suresh B
  2005-03-07  5:34                           ` Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Siddha, Suresh B @ 2005-03-06  5:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, Andrew Morton, linux-kernel

On Thu, Feb 24, 2005 at 11:13:14PM +1100, Nick Piggin wrote:
> Ingo Molnar wrote:
> > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> > 
> >> [PATCH 6/13] no aggressive idle balancing
> >>
> >> [PATCH 8/13] generalised CPU load averaging
> >> [PATCH 9/13] less affine wakups
> >> [PATCH 10/13] remove aggressive idle balancing
> > 
> > 
> > they look fine, but these are the really scary ones :-) Maybe we could
> > do #8 and #9 first, then #6+#10. But it's probably pointless to look at
> > these in isolation.
> > 
> 
> Oh yes, they are very scary and I guarantee they'll cause
> problems :P

By code inspection, I see an issue with this patch
	[PATCH 10/13] remove aggressive idle balancing

Why are we removing cpu_and_siblings_are_idle check from active_load_balance?
In case of SMT, we  want to give prioritization to an idle package while
doing active_load_balance(infact, active_load_balance will be kicked
mainly because there is an idle package) 

Just the re-addition of cpu_and_siblings_are_idle check to 
active_load_balance might not be enough. We somehow need to communicate 
this to move_tasks, otherwise can_migrate_task will fail and we will 
never be able to do active_load_balance.

thanks,
suresh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-03-06  5:43                         ` Siddha, Suresh B
@ 2005-03-07  5:34                           ` Nick Piggin
  2005-03-07  8:04                             ` Siddha, Suresh B
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-03-07  5:34 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Ingo Molnar, Andrew Morton, linux-kernel

Siddha, Suresh B wrote:

> 
> By code inspection, I see an issue with this patch
> 	[PATCH 10/13] remove aggressive idle balancing
> 
> Why are we removing cpu_and_siblings_are_idle check from active_load_balance?
> In case of SMT, we  want to give prioritization to an idle package while
> doing active_load_balance(infact, active_load_balance will be kicked
> mainly because there is an idle package) 
> 
> Just the re-addition of cpu_and_siblings_are_idle check to 
> active_load_balance might not be enough. We somehow need to communicate 
> this to move_tasks, otherwise can_migrate_task will fail and we will 
> never be able to do active_load_balance.
> 

Active balancing should only kick in after the prescribed number
of rebalancing failures - can_migrate_task will see this, and
will allow the balancing to take place.

That said, we currently aren't doing _really_ well for SMT on
some workloads, however with this patch we are heading in the
right direction I think.

I have been mainly looking at tuning CMP Opterons recently (they
are closer to a "traditional" SMP+NUMA than SMT, when it comes
to the scheduler's point of view). However, in earlier revisions
of the patch I had been looking at SMT performance and was able
to get it much closer to perfect:

I was working on a 4 socket x440 with HT. The problem area is
usually when the load is lower than the number of logical CPUs.
So on tbench, we do say 450MB/s with 4 or more threads without
HT, and 550MB/s with 8 or more threads with HT, however we only
do 300MB/s with 4 threads.

Those aren't the exact numbers, but that's basically what they
look like. Now I was able to bring the 4 thread + HT case much
closer to the 4 thread - HT numbers, but with earlier patchsets.
When I get a chance I will do more tests on the HT system, but
the x440 is infuriating for fine tuning performance, because it
is a NUMA system, but it doesn't tell the kernel about it, so
it will randomly schedule things on "far away" CPUs, and results
vary.

PS. Another thing I would like to see tested is a 3 level domain
setup (SMT + SMP + NUMA). I don't have access to one though.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-03-07  5:34                           ` Nick Piggin
@ 2005-03-07  8:04                             ` Siddha, Suresh B
  2005-03-07  8:28                               ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Siddha, Suresh B @ 2005-03-07  8:04 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Siddha, Suresh B, Ingo Molnar, Andrew Morton, linux-kernel

Nick,

On Mon, Mar 07, 2005 at 04:34:18PM +1100, Nick Piggin wrote:
> Siddha, Suresh B wrote:
> 
> > 
> > By code inspection, I see an issue with this patch
> > 	[PATCH 10/13] remove aggressive idle balancing
> > 
> > Why are we removing cpu_and_siblings_are_idle check from active_load_balance?
> > In case of SMT, we  want to give prioritization to an idle package while
> > doing active_load_balance(infact, active_load_balance will be kicked
> > mainly because there is an idle package) 
> > 
> > Just the re-addition of cpu_and_siblings_are_idle check to 
> > active_load_balance might not be enough. We somehow need to communicate 
> > this to move_tasks, otherwise can_migrate_task will fail and we will 
> > never be able to do active_load_balance.
> > 
> 
> Active balancing should only kick in after the prescribed number
> of rebalancing failures - can_migrate_task will see this, and
> will allow the balancing to take place.

We are resetting the nr_balance_failed to cache_nice_tries after kicking 
active balancing. But can_migrate_task will succeed only if
nr_balance_failed > cache_nice_tries.

> 
> That said, we currently aren't doing _really_ well for SMT on
> some workloads, however with this patch we are heading in the
> right direction I think.

Lets take an example of three packages with two logical threads each. 
Assume P0 is loaded with two processes(one in each logical thread), 
P1 contains only one process and P2 is idle.

In this example, active balance will be kicked on one of the threads(assume
thread 0) in P0, which then should find an idle package and move it to 
one of the idle threads in P2.

With your current patch, idle package check in active_load_balance has 
disappeared, and we may endup moving the process from thread 0 to thread 1 
in P0.  I can't really make logic out of the active_load_balance code 
after your patch 10/13

> 
> I have been mainly looking at tuning CMP Opterons recently (they
> are closer to a "traditional" SMP+NUMA than SMT, when it comes
> to the scheduler's point of view). However, in earlier revisions
> of the patch I had been looking at SMT performance and was able
> to get it much closer to perfect:
> 

I am reasonably sure that the removal of cpu_and_siblings_are_idle check
from active_load_balance will cause HT performance regressions.

> I was working on a 4 socket x440 with HT. The problem area is
> usually when the load is lower than the number of logical CPUs.
> So on tbench, we do say 450MB/s with 4 or more threads without
> HT, and 550MB/s with 8 or more threads with HT, however we only
> do 300MB/s with 4 threads.

Are you saying 2.6.11 has this problem?

> 
> Those aren't the exact numbers, but that's basically what they
> look like. Now I was able to bring the 4 thread + HT case much
> closer to the 4 thread - HT numbers, but with earlier patchsets.
> When I get a chance I will do more tests on the HT system, but
> the x440 is infuriating for fine tuning performance, because it
> is a NUMA system, but it doesn't tell the kernel about it, so
> it will randomly schedule things on "far away" CPUs, and results
> vary.

Why don't you use any other simple HT+SMP system?

I will also do some performance analysis with your other patches
on some of the systems that I have access to.

thanks,
suresh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-03-07  8:04                             ` Siddha, Suresh B
@ 2005-03-07  8:28                               ` Nick Piggin
  2005-03-08  7:22                                 ` Siddha, Suresh B
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-03-07  8:28 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Ingo Molnar, Andrew Morton, linux-kernel

Siddha, Suresh B wrote:
> Nick,
> 
> On Mon, Mar 07, 2005 at 04:34:18PM +1100, Nick Piggin wrote:
> 
>>
>>Active balancing should only kick in after the prescribed number
>>of rebalancing failures - can_migrate_task will see this, and
>>will allow the balancing to take place.
> 
> 
> We are resetting the nr_balance_failed to cache_nice_tries after kicking 
> active balancing. But can_migrate_task will succeed only if
> nr_balance_failed > cache_nice_tries.
> 

It is indeed, thanks for catching that. We should probably make it
reset the count to the point where it will start moving cache hot
tasks (ie. cache_nice_tries+1).

I'll look at that and send Andrew a patch.

> 
>>That said, we currently aren't doing _really_ well for SMT on
>>some workloads, however with this patch we are heading in the
>>right direction I think.
> 
> 
> Lets take an example of three packages with two logical threads each. 
> Assume P0 is loaded with two processes(one in each logical thread), 
> P1 contains only one process and P2 is idle.
> 
> In this example, active balance will be kicked on one of the threads(assume
> thread 0) in P0, which then should find an idle package and move it to 
> one of the idle threads in P2.
> 
> With your current patch, idle package check in active_load_balance has 
> disappeared, and we may endup moving the process from thread 0 to thread 1 
> in P0.  I can't really make logic out of the active_load_balance code 
> after your patch 10/13
> 

Ah yep, right you are there, too. I obviously hadn't looked closely
enough at the recent active_load_balance patches that had gone in :(
What should probably do is heed the "push_cpu" prescription (push_cpu
is now unused).

I think active_load_balance is too complex at the moment, but still
too dumb to really make the right choice here over the full range of
domains. What we need to do is pass in some more info from load_balance,
so active_load_balance doesn't need any "smarts".

Thanks for pointing this out too. I'll make a patch.

> 
>>I have been mainly looking at tuning CMP Opterons recently (they
>>are closer to a "traditional" SMP+NUMA than SMT, when it comes
>>to the scheduler's point of view). However, in earlier revisions
>>of the patch I had been looking at SMT performance and was able
>>to get it much closer to perfect:
>>
> 
> 
> I am reasonably sure that the removal of cpu_and_siblings_are_idle check
> from active_load_balance will cause HT performance regressions.
> 

Yep.

> 
>>I was working on a 4 socket x440 with HT. The problem area is
>>usually when the load is lower than the number of logical CPUs.
>>So on tbench, we do say 450MB/s with 4 or more threads without
>>HT, and 550MB/s with 8 or more threads with HT, however we only
>>do 300MB/s with 4 threads.
> 
> 
> Are you saying 2.6.11 has this problem?
> 

I think so. I'll have a look at it again.

> 
>>Those aren't the exact numbers, but that's basically what they
>>look like. Now I was able to bring the 4 thread + HT case much
>>closer to the 4 thread - HT numbers, but with earlier patchsets.
>>When I get a chance I will do more tests on the HT system, but
>>the x440 is infuriating for fine tuning performance, because it
>>is a NUMA system, but it doesn't tell the kernel about it, so
>>it will randomly schedule things on "far away" CPUs, and results
>>vary.
> 
> 
> Why don't you use any other simple HT+SMP system?
> 

Yes I will, of course. Some issues can become more pronounced
with more physical CPUs, but the main reason is that the x440
is the only one with HT at work where I was doing testing.

To be honest I hadn't looked hard enough at the HT issues yet
as you've noticed. So thanks for the review and I'll fix things
up.

> I will also do some performance analysis with your other patches
> on some of the systems that I have access to.
> 

Thanks.

Nick


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-03-07  8:28                               ` Nick Piggin
@ 2005-03-08  7:22                                 ` Siddha, Suresh B
  2005-03-08  8:17                                   ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Siddha, Suresh B @ 2005-03-08  7:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Siddha, Suresh B, Ingo Molnar, Andrew Morton, linux-kernel

Nick,

On Mon, Mar 07, 2005 at 07:28:23PM +1100, Nick Piggin wrote:
> Siddha, Suresh B wrote:
> > We are resetting the nr_balance_failed to cache_nice_tries after kicking 
> > active balancing. But can_migrate_task will succeed only if
> > nr_balance_failed > cache_nice_tries.
> > 
> 
> It is indeed, thanks for catching that. We should probably make it
> reset the count to the point where it will start moving cache hot
> tasks (ie. cache_nice_tries+1).

That still might not be enough. We probably need to pass push_cpu's
sd to move_tasks call in active_load_balance, instead of current busiest_cpu's
sd. Just like push_cpu, we need to add one more field to the runqueue which 
will specify the domain level of the push_cpu at which we have an imbalance.

> 
> I'll look at that and send Andrew a patch.
> 
> > 
> >>That said, we currently aren't doing _really_ well for SMT on
> >>some workloads, however with this patch we are heading in the
> >>right direction I think.
> > 
> > 
> > Lets take an example of three packages with two logical threads each. 
> > Assume P0 is loaded with two processes(one in each logical thread), 
> > P1 contains only one process and P2 is idle.
> > 
> > In this example, active balance will be kicked on one of the threads(assume
> > thread 0) in P0, which then should find an idle package and move it to 
> > one of the idle threads in P2.
> > 
> > With your current patch, idle package check in active_load_balance has 
> > disappeared, and we may endup moving the process from thread 0 to thread 1 
> > in P0.  I can't really make logic out of the active_load_balance code 
> > after your patch 10/13
> > 
> 
> Ah yep, right you are there, too. I obviously hadn't looked closely
> enough at the recent active_load_balance patches that had gone in :(
> What should probably do is heed the "push_cpu" prescription (push_cpu
> is now unused).

push_cpu might not be the ideal destination in all cases. Take a NUMA domain
above SMT+SMP domains in my above example. Assume P0, P1 is in node-0 and
P2, P3 in node-1. Assume Loads of P0,P1,P2 are same as the above example,with P3
containing one process load. Now any idle thread in P2 or P3 can trigger
active load balance on P0. We should be selecting thread in P2 ideally
(currently this is what we get with idle package check). But with push_cpu,
we might move to the idle thread in P3 and then finally move to P2(it will be a
two step process)

thanks,
suresh

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-03-08  7:22                                 ` Siddha, Suresh B
@ 2005-03-08  8:17                                   ` Nick Piggin
  2005-03-08 19:36                                     ` Siddha, Suresh B
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2005-03-08  8:17 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Ingo Molnar, Andrew Morton, linux-kernel

Siddha, Suresh B wrote:
> Nick,
> 
> On Mon, Mar 07, 2005 at 07:28:23PM +1100, Nick Piggin wrote:
> 
>>Siddha, Suresh B wrote:
>>
>>>We are resetting the nr_balance_failed to cache_nice_tries after kicking 
>>>active balancing. But can_migrate_task will succeed only if
>>>nr_balance_failed > cache_nice_tries.
>>>
>>
>>It is indeed, thanks for catching that. We should probably make it
>>reset the count to the point where it will start moving cache hot
>>tasks (ie. cache_nice_tries+1).
> 
> 
> That still might not be enough. We probably need to pass push_cpu's
> sd to move_tasks call in active_load_balance, instead of current busiest_cpu's
> sd. Just like push_cpu, we need to add one more field to the runqueue which 
> will specify the domain level of the push_cpu at which we have an imbalance.
> 

It should be the lowest domain level that spans both this_cpu and
push_cpu, and has the SD_BALANCE flag set. We could possibly be a bit
more general here, but so long as nobody is coming up with weird and
wonderful sched_domains schemes, push_cpu should give you all the info
needed.

>>Ah yep, right you are there, too. I obviously hadn't looked closely
>>enough at the recent active_load_balance patches that had gone in :(
>>What should probably do is heed the "push_cpu" prescription (push_cpu
>>is now unused).
> 
> 
> push_cpu might not be the ideal destination in all cases. Take a NUMA domain
> above SMT+SMP domains in my above example. Assume P0, P1 is in node-0 and
> P2, P3 in node-1. Assume Loads of P0,P1,P2 are same as the above example,with P3
> containing one process load. Now any idle thread in P2 or P3 can trigger
> active load balance on P0. We should be selecting thread in P2 ideally
> (currently this is what we get with idle package check). But with push_cpu,
> we might move to the idle thread in P3 and then finally move to P2(it will be a
> two step process)
> 

Hmm yeah. It is a bit tricky. We don't currently do exceptionally well
at this sort of "balancing over multiple domains" very well in the
periodic balancer either.

But at this stage I prefer to not get overly complex, and allow some
imperfect task movement, because it should rarely be a problem, and is
much better than it was before. The main place where it can go wrong
is multi-level NUMA balancing, where moving a task twice (between
different nodes) can cause more problems.

Nick


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 10/13] remove aggressive idle balancing
  2005-03-08  8:17                                   ` Nick Piggin
@ 2005-03-08 19:36                                     ` Siddha, Suresh B
  0 siblings, 0 replies; 38+ messages in thread
From: Siddha, Suresh B @ 2005-03-08 19:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Siddha, Suresh B, Ingo Molnar, Andrew Morton, linux-kernel

On Tue, Mar 08, 2005 at 07:17:31PM +1100, Nick Piggin wrote:
> Siddha, Suresh B wrote:
> > That still might not be enough. We probably need to pass push_cpu's
> > sd to move_tasks call in active_load_balance, instead of current busiest_cpu's
> > sd. Just like push_cpu, we need to add one more field to the runqueue which 
> > will specify the domain level of the push_cpu at which we have an imbalance.
> > 
> 
> It should be the lowest domain level that spans both this_cpu and
> push_cpu, and has the SD_BALANCE flag set. We could possibly be a bit

I agree.

> more general here, but so long as nobody is coming up with weird and
> wonderful sched_domains schemes, push_cpu should give you all the info
> needed.
> 
> > push_cpu might not be the ideal destination in all cases. Take a NUMA domain
> > above SMT+SMP domains in my above example. Assume P0, P1 is in node-0 and
> > P2, P3 in node-1. Assume Loads of P0,P1,P2 are same as the above example,with P3
> > containing one process load. Now any idle thread in P2 or P3 can trigger
> > active load balance on P0. We should be selecting thread in P2 ideally
> > (currently this is what we get with idle package check). But with push_cpu,
> > we might move to the idle thread in P3 and then finally move to P2(it will be a
> > two step process)
> > 
> 
> Hmm yeah. It is a bit tricky. We don't currently do exceptionally well
> at this sort of "balancing over multiple domains" very well in the
> periodic balancer either.

With periodic balancer, moved tasks will not be actively running
and by the time it gets a cpu slot, it will most probably be on the
correct cpu (though "most probably" is the key word here ;-)

> But at this stage I prefer to not get overly complex, and allow some
> imperfect task movement, because it should rarely be a problem, and is
> much better than it was before. The main place where it can go wrong
> is multi-level NUMA balancing, where moving a task twice (between
> different nodes) can cause more problems.

With active_load_balance, we will be moving the currently running
process. So obviously if we can move in one single step, that will be nice.

I agree with the complexity part though. And it will become more complex
with upcoming dual-core requirements.

thanks,
suresh

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2005-03-08 20:27 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-02-24  7:14 [PATCH 0/13] Multiprocessor CPU scheduler patches Nick Piggin
2005-02-24  7:16 ` [PATCH 1/13] timestamp fixes Nick Piggin
2005-02-24  7:16   ` [PATCH 2/13] improve pinned task handling Nick Piggin
2005-02-24  7:18     ` [PATCH 3/13] rework schedstats Nick Piggin
2005-02-24  7:19       ` [PATCH 4/13] find_busiest_group fixlets Nick Piggin
2005-02-24  7:20         ` [PATCH 5/13] find_busiest_group cleanup Nick Piggin
2005-02-24  7:21           ` [PATCH 6/13] no aggressive idle balancing Nick Piggin
2005-02-24  7:22             ` [PATCH 7/13] better active balancing heuristic Nick Piggin
2005-02-24  7:24               ` [PATCH 8/13] generalised CPU load averaging Nick Piggin
2005-02-24  7:25                 ` [PATCH 9/13] less affine wakups Nick Piggin
2005-02-24  7:27                   ` [PATCH 10/13] remove aggressive idle balancing Nick Piggin
2005-02-24  7:28                     ` [PATCH 11/13] sched-domains aware balance-on-fork Nick Piggin
2005-02-24  7:28                       ` [PATCH 12/13] schedstats additions for sched-balance-fork Nick Piggin
2005-02-24  7:30                         ` [PATCH 13/13] basic tuning Nick Piggin
2005-02-24  8:46                         ` [PATCH 12/13] schedstats additions for sched-balance-fork Ingo Molnar
2005-02-24 22:13                           ` Nick Piggin
2005-02-25 11:07                           ` Rick Lindsley
2005-02-25 11:21                             ` Nick Piggin
2005-02-24  8:41                     ` [PATCH 10/13] remove aggressive idle balancing Ingo Molnar
2005-02-24 12:13                       ` Nick Piggin
2005-02-24 12:16                         ` Ingo Molnar
2005-03-06  5:43                         ` Siddha, Suresh B
2005-03-07  5:34                           ` Nick Piggin
2005-03-07  8:04                             ` Siddha, Suresh B
2005-03-07  8:28                               ` Nick Piggin
2005-03-08  7:22                                 ` Siddha, Suresh B
2005-03-08  8:17                                   ` Nick Piggin
2005-03-08 19:36                                     ` Siddha, Suresh B
2005-02-24  8:39               ` [PATCH 7/13] better active balancing heuristic Ingo Molnar
2005-02-24  8:36         ` [PATCH 4/13] find_busiest_group fixlets Ingo Molnar
2005-02-24  8:07       ` [PATCH 3/13] rework schedstats Ingo Molnar
2005-02-25 10:50       ` Rick Lindsley
2005-02-25 11:10         ` Nick Piggin
2005-02-25 11:25           ` DHCP on multi homed host! Ravindra Nadgauda
2005-02-24  8:04     ` [PATCH 2/13] improve pinned task handling Ingo Molnar
2005-02-24  7:46   ` [PATCH 1/13] timestamp fixes Ingo Molnar
2005-02-24  7:56     ` Nick Piggin
2005-02-24  8:34       ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).