All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] scheduler: active_load_balance fixes
@ 2004-10-22 23:36 Darren Hart
  2004-10-23  5:30 ` Nick Piggin
  2004-10-23 10:20 ` Ingo Molnar
  0 siblings, 2 replies; 6+ messages in thread
From: Darren Hart @ 2004-10-22 23:36 UTC (permalink / raw)
  To: lkml
  Cc: Matt Dobson, Martin J Bligh, Rick Lindsley, Andrew Morton, Nick Piggin

The following patch against the latest mm fixes several problems with
active_load_balance().

Rather than starting with the highest allowable domain (SD_LOAD_BALANCE
is still set) and depending on the order of the cpu groups, we start at
the lowest domain and work up until we find a suitable CPU or run out of
options (SD_LOAD_BALANCE is no longer set).  This is a more robust
approach as it is more explicit and not subject to the construction
order of the cpu groups.

We move the test for busiest_rq->nr_running <=1 into the domain loop so
we don't continue to try and move tasks when there are none left to
move.  This new logic (testing for nr_running in the domain loop) should
make the busiest_rq==target_rq condition really impossible, so we have
replaced the graceful continue on fail with a BUG_ON. (Bjorn Helgaas,
please confirm)

We eliminate the exclusion of the busiest_cpu's group from the pool of
available groups to push to as it is the ideal group to push to, even if
not very likely to be available.  Note that by removing the test for
group==busy_group and allowing it to also be tested for suitability, the
running time is nearly the same.

We no longer force the destination CPU to be in a group of completely
idle CPUs, nor to be the last in that group.


Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
---

sched.c |  123 +++++++++++++++++++++++++++++++++++-----------------------------
 1 files changed, 69 insertions(+), 54 deletions(-)


diff -purN -X /home/mbligh/.diff.exclude /home/linux/views/linux-2.6.9-mm1/kernel/sched.c 2.6.9-mm1-active_balance/kernel/sched.c
--- /home/linux/views/linux-2.6.9-mm1/kernel/sched.c	2004-10-22 11:21:46.000000000 -0700
+++ 2.6.9-mm1-active_balance/kernel/sched.c	2004-10-22 12:08:49.000000000 -0700
@@ -2062,70 +2062,85 @@ static inline void idle_balance(int this
 	}
 }
 
+#ifdef CONFIG_SCHED_SMT
+static int cpu_and_siblings_are_idle(int cpu)
+{
+	int sib;
+	for_each_cpu_mask(sib, cpu_sibling_map[cpu]) {
+		if (idle_cpu(sib))
+			continue;
+		return 0;
+	}
+
+	return 1;
+}
+#else
+#define cpu_and_siblings_are_idle(A) idle_cpu(A)
+#endif
+
+
 /*
- * active_load_balance is run by migration threads. It pushes a running
- * task off the cpu. It can be required to correctly have at least 1 task
- * running on each physical CPU where possible, and not have a physical /
- * logical imbalance.
+ * active_load_balance is run by migration threads. It pushes running tasks
+ * off the busiest CPU onto idle CPUs. It requires at least 1 task to be
+ * running on each physical CPU where possible, and avoids physical /
+ * logical imbalances.
  *
- * Called with busiest locked.
+ * Called with busiest_rq locked.
  */
-static void active_load_balance(runqueue_t *busiest, int busiest_cpu)
+static void active_load_balance(runqueue_t *busiest_rq, int busiest_cpu)
 {
 	struct sched_domain *sd;
-	struct sched_group *group, *busy_group;
-	int i;
-
-	schedstat_inc(busiest, alb_cnt);
-	if (busiest->nr_running <= 1)
-		return;
-
-	for_each_domain(busiest_cpu, sd)
-		if (cpu_isset(busiest->push_cpu, sd->span))
-			break;
-	if (!sd)
-		return;
-
-	group = sd->groups;
-	while (!cpu_isset(busiest_cpu, group->cpumask))
-		group = group->next;
-	busy_group = group;
+	struct sched_group *cpu_group;
+	cpumask_t visited_cpus;
 
-	group = sd->groups;
-	do {
-		runqueue_t *rq;
-		int push_cpu = 0;
-
-		if (group == busy_group)
-			goto next_group;
+	schedstat_inc(busiest_rq, alb_cnt);
+	/*
+	 * Search for suitable CPUs to push tasks to in successively higher
+	 * domains with SD_LOAD_BALANCE set.
+	 */ 
+	visited_cpus = CPU_MASK_NONE;
+	for_each_domain(busiest_cpu, sd) {
+		if (!(sd->flags & SD_LOAD_BALANCE) || busiest_rq->nr_running <= 1) 
+			break; /* no more domains to search or no more tasks to move */
+
+		cpu_group = sd->groups;
+		do { /* sched_groups should either use list_heads or be merged into the domains structure */
+			int cpu, target_cpu = -1;
+			runqueue_t *target_rq;
+
+			for_each_cpu_mask(cpu, cpu_group->cpumask) {
+				if (cpu_isset(cpu, visited_cpus) || cpu == busiest_cpu || 
+				    !cpu_and_siblings_are_idle(cpu)) {
+					cpu_set(cpu, visited_cpus);
+					continue;
+				}
+				target_cpu = cpu;
+				break;
+			}
+			if (target_cpu == -1)
+				goto next_group; /* failed to find a suitable target cpu in this domain */
 
-		for_each_cpu_mask(i, group->cpumask) {
-			if (!idle_cpu(i))
-				goto next_group;
-			push_cpu = i;
-		}
+			target_rq = cpu_rq(target_cpu);
 
-		rq = cpu_rq(push_cpu);
+			/*
+			 * This condition is "impossible", if it occurs we need to fix it
+			 * Reported by Bjorn Helgaas on a 128-cpu setup.
+			 */
+			BUG_ON(busiest_rq == target_rq);
 
-		/*
-		 * This condition is "impossible", but since load
-		 * balancing is inherently a bit racy and statistical,
-		 * it can trigger.. Reported by Bjorn Helgaas on a
-		 * 128-cpu setup.
-		 */
-		if (unlikely(busiest == rq))
-			goto next_group;
-		double_lock_balance(busiest, rq);
-		if (move_tasks(rq, push_cpu, busiest, 1, sd, SCHED_IDLE)) {
-			schedstat_inc(busiest, alb_lost);
-			schedstat_inc(rq, alb_gained);
-		} else {
-			schedstat_inc(busiest, alb_failed);
-		}
-		spin_unlock(&rq->lock);
+			/* move a task from busiest_rq to target_rq */
+			double_lock_balance(busiest_rq, target_rq);
+			if (move_tasks(target_rq, target_cpu, busiest_rq, 1, sd, SCHED_IDLE)) {
+				schedstat_inc(busiest_rq, alb_lost);
+				schedstat_inc(target_rq, alb_gained);
+			} else {
+				schedstat_inc(busiest_rq, alb_failed);
+			}
+			spin_unlock(&target_rq->lock);
 next_group:
-		group = group->next;
-	} while (group != sd->groups);
+			cpu_group = cpu_group->next;
+		} while (cpu_group != sd->groups && busiest_rq->nr_running > 1);
+	}
 }
 
 /*


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] scheduler: active_load_balance fixes
  2004-10-22 23:36 [patch] scheduler: active_load_balance fixes Darren Hart
@ 2004-10-23  5:30 ` Nick Piggin
  2004-10-24  9:37   ` Andrew Morton
  2004-10-23 10:20 ` Ingo Molnar
  1 sibling, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2004-10-23  5:30 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, Matt Dobson, Martin J Bligh, Rick Lindsley, Andrew Morton



Darren Hart wrote:

>The following patch against the latest mm fixes several problems with
>active_load_balance().
>
>

This seems much better. Andrew can you put this into -mm please.

Thanks
Nick



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] scheduler: active_load_balance fixes
  2004-10-22 23:36 [patch] scheduler: active_load_balance fixes Darren Hart
  2004-10-23  5:30 ` Nick Piggin
@ 2004-10-23 10:20 ` Ingo Molnar
  1 sibling, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2004-10-23 10:20 UTC (permalink / raw)
  To: Darren Hart
  Cc: lkml, Matt Dobson, Martin J Bligh, Rick Lindsley, Andrew Morton,
	Nick Piggin


* Darren Hart <dvhltc@us.ibm.com> wrote:

> The following patch against the latest mm fixes several problems with
> active_load_balance().

i definitely like this patch, i'd vote to give it a test-drive in -mm.

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] scheduler: active_load_balance fixes
  2004-10-23  5:30 ` Nick Piggin
@ 2004-10-24  9:37   ` Andrew Morton
  2004-10-25 16:58     ` Darren Hart
  2004-10-25 22:37     ` Darren Hart
  0 siblings, 2 replies; 6+ messages in thread
From: Andrew Morton @ 2004-10-24  9:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: dvhltc, linux-kernel, colpatch, mbligh, ricklind

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> 
> 
> Darren Hart wrote:
> 
> >The following patch against the latest mm fixes several problems with
> >active_load_balance().
> >
> >
> 
> This seems much better. Andrew can you put this into -mm please.
> 

Whenever we touch the load balancing we get sad little reports about
performance regressions two months later.  How do we gain confidence in
this change?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] scheduler: active_load_balance fixes
  2004-10-24  9:37   ` Andrew Morton
@ 2004-10-25 16:58     ` Darren Hart
  2004-10-25 22:37     ` Darren Hart
  1 sibling, 0 replies; 6+ messages in thread
From: Darren Hart @ 2004-10-25 16:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, lkml, Matt Dobson, Martin J Bligh, Rick Lindsley

On Sun, 2004-10-24 at 02:37 -0700, Andrew Morton wrote:
> Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> > 
> > 
> > Darren Hart wrote:
> > 
> > >The following patch against the latest mm fixes several problems with
> > >active_load_balance().
> > >
> > >
> > 
> > This seems much better. Andrew can you put this into -mm please.
> > 
> 
> Whenever we touch the load balancing we get sad little reports about
> performance regressions two months later.  How do we gain confidence in
> this change?
> 

I did run a kernbench test and forgot to include the results, my
apologies.  On a 16 way NUMA the new code was slightly faster.  The new
code basically does what I believe the original code was intended to do,
it doesn't take a radically new approach to load balancing.  It closes
up some corner cases (like continuing to balance after the runqueue is
empty) and fragile code (like removing the dependency on the order of
the sched_group construction).  It also removes some of the artificial
limits imposed by the old code, like always moving tasks to the last CPU
of a completely idle group.

I will run some more tests on this code today to improve our confidence.
It would help if others could do the same, particularly those who have
experience balancing problems in the past.

Thanks,

-- 
Darren Hart
IBM, Linux Technology Center
503 578 3185
dvhltc@us.ibm.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] scheduler: active_load_balance fixes
  2004-10-24  9:37   ` Andrew Morton
  2004-10-25 16:58     ` Darren Hart
@ 2004-10-25 22:37     ` Darren Hart
  1 sibling, 0 replies; 6+ messages in thread
From: Darren Hart @ 2004-10-25 22:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, lkml, Matt Dobson, Martin J Bligh, Rick Lindsley

On Sun, 2004-10-24 at 02:37 -0700, Andrew Morton wrote:
> Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> > 
> > 
> > Darren Hart wrote:
> > 
> > >The following patch against the latest mm fixes several problems with
> > >active_load_balance().
> > >
> > >
> > 
> > This seems much better. Andrew can you put this into -mm please.
> > 
> 
> Whenever we touch the load balancing we get sad little reports about
> performance regressions two months later.  How do we gain confidence in
> this change?
> 

I ran kernbench and specjbb on a 16 way xeon ht numa machine (32 total
sibling CPUs) and an 8 way ppc64 machine against 2.6.9-mm1 w/ and w/o my
active_load_balance() patch.  Kernbench was marginally faster on each
machine, and specjbb performed better on 64% of the tests.  SpecJBB is a
bit erratic anyway, so I feel good about these numbers.


Kernbench results below. (2.6.9-mm1-ab is the run with the
active_load_balance patch).

32 way xeon 
2.6.9-mm1
	Elapsed: 81.444s User: 1044.06s System: 138.008s CPU: 1451.2%
2.6.9-mm1-ab
	Elapsed: 81.372s User: 1037.842s System: 139.134s CPU: 1446%

8 way ppc64
2.6.9-mm1
	Elapsed: 53.336s User: 352.932s System: 45.302s CPU: 746%
2.6.9-mm1-ab
	Elapsed: 53.24s User: 353.096s System: 44.98s CPU: 747%

-- 
Darren Hart
IBM, Linux Technology Center
503 578 3185
dvhltc@us.ibm.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-10-26  4:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-22 23:36 [patch] scheduler: active_load_balance fixes Darren Hart
2004-10-23  5:30 ` Nick Piggin
2004-10-24  9:37   ` Andrew Morton
2004-10-25 16:58     ` Darren Hart
2004-10-25 22:37     ` Darren Hart
2004-10-23 10:20 ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.