All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Patch: dynticks: idle load balancing
@ 2006-12-11 23:53 Siddha, Suresh B
  2006-12-13 22:43 ` Ingo Molnar
  2006-12-20  0:49 ` Steven Rostedt
  0 siblings, 2 replies; 25+ messages in thread
From: Siddha, Suresh B @ 2006-12-11 23:53 UTC (permalink / raw)
  To: mingo, nickpiggin, vatsa, clameter, tglx, arjan; +Cc: linux-kernel

Appended patch attempts to fix the process idle load balancing in the
presence of dynticks. cpus for which ticks are stopped will sleep
till the next event wakes it up. Potentially these sleeps can be for large
durations and during which today, there is no idle load balancing being done.
There was some discussion happened(last year) on this topic on lkml, where two
main approaches were gettting debated. One is to back off the idle load
balancing for bigger intervals and the second is a watchdog mechanism where
the busy cpu will trigger the load balance on an idle cpu.  Both of these
mechanisms have its drawbacks.

For the first mechanism, defining the interval will be tricky and if it is too
much, then the response time will also be high and we won't be able to respond
for sudden changes in the load. If it is small, then we won't be able to save
power.

Second mechanism will be making changes to the busy load balancing(which will
be doing more load balancing work, while the current busy task on that cpu
is eagerly waiting for the cpu cycles). Also busy load balancing intervals are
quite different from idle load balancing intervals. Similar to the first
mechanism, we won't be able to respond quickly to change in loads. And also
figuring out that a cpu is heavily loaded and where that extra load need to
moved, is some what difficult job, especially so in the case of hierarchical
scheduler domains.

Appended patch takes a third route which nominates an owner among
the idle cpus, which does the idle load balancing on behalf of the other
idle cpus. And once all the cpus are completely idle, then we can stop
this idle load balancing too. Checks added in fast path are minimized.
Whenever there are busy cpus in the system, there will be an owner(idle cpu)
doing the system wide idle load balancing. If we nominate this owner
carefully(like an idle core in a busy package), we can minimize the power
wasted also.

Some of the questions I have are: Will this single owner become bottleneck?
Idle load balancing is now serialized among all the idle cpus. This perhaps
will add some delays in load movement to different idle cpus. IMO, these
delays will be small and tolerable. If this comes out to be a concern, we
can offload the actual load movement work to the idle cpu, where the load
will be finally run.

Any more optimizations we can do to start/stop_sched_tick() routines to track
this info more efficiently?

Comments and review feedback welcome. Minimal testing done on couple of
i386 platforms. Perf testing yet to be done.

thanks,
suresh
---
Track the cpus for which ticks are stopped and one among these cpus will
be doing the idle load balancing on behalf of all the remaining cpus.
If the ticks are stopped for all the cpus in the system, idle load balancing
will stop at that moment. And restarts as soon as there is a busy cpu in
the system.

TBD: Select the appropriate idle cpu for doing this idle load balancing.
Such as an idle core in a busy package(which has a busy core). Selecting an idle
thread as the owner when there are other busy thread siblings is
not a good idea.

We can also think of offloading the task movements from the idle load balancing
owner to the idle cpu on behalf of which this work is being done.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff -pNru linux-2.6.19-mm1/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.6.19-mm1/include/linux/sched.h	2006-12-12 06:39:22.000000000 -0800
+++ linux/include/linux/sched.h	2006-12-12 06:51:03.000000000 -0800
@@ -195,6 +195,14 @@ extern void sched_init_smp(void);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern cpumask_t nohz_cpu_mask;
+#ifdef CONFIG_SMP
+extern int select_notick_load_balancer(int cpu);
+#else
+static inline int select_notick_load_balancer(int cpu)
+{
+	return 0;
+}
+#endif
 
 /*
  * Only dump TASK_* tasks. (-1 for all tasks)
diff -pNru linux-2.6.19-mm1/kernel/hrtimer.c linux/kernel/hrtimer.c
--- linux-2.6.19-mm1/kernel/hrtimer.c	2006-12-12 06:39:22.000000000 -0800
+++ linux/kernel/hrtimer.c	2006-12-12 06:51:03.000000000 -0800
@@ -600,6 +600,9 @@ void hrtimer_stop_sched_tick(void)
 		 * the scheduler tick in hrtimer_restart_sched_tick.
 		 */
 		if (!cpu_base->tick_stopped) {
+			if (select_notick_load_balancer(1))
+				goto end;
+
 			cpu_base->idle_tick = cpu_base->sched_timer.expires;
 			cpu_base->tick_stopped = 1;
 			cpu_base->idle_jiffies = last_jiffies;
@@ -616,6 +619,7 @@ void hrtimer_stop_sched_tick(void)
 			raise_softirq_irqoff(TIMER_SOFTIRQ);
 	}
 
+end:
 	local_irq_restore(flags);
 }
 
@@ -630,6 +634,8 @@ void hrtimer_restart_sched_tick(void)
 	unsigned long ticks;
 	ktime_t now, delta;
 
+	select_notick_load_balancer(0);
+
 	if (!cpu_base->hres_active || !cpu_base->tick_stopped)
 		return;
 
diff -pNru linux-2.6.19-mm1/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.19-mm1/kernel/sched.c	2006-12-12 06:39:22.000000000 -0800
+++ linux/kernel/sched.c	2006-12-12 07:03:38.000000000 -0800
@@ -1041,6 +1041,15 @@ static void resched_task(struct task_str
 	if (!tsk_is_polling(p))
 		smp_send_reschedule(cpu);
 }
+static void resched_cpu(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	unsigned int flags;
+
+	spin_lock_irqsave(&rq->lock, flags);
+	resched_task(cpu_curr(cpu));
+	spin_unlock_irqrestore(&rq->lock, flags);
+}
 #else
 static inline void resched_task(struct task_struct *p)
 {
@@ -2641,6 +2650,9 @@ redo:
 		double_rq_unlock(this_rq, busiest);
 		local_irq_restore(flags);
 
+		if (nr_moved && this_cpu != smp_processor_id())
+			resched_cpu(this_cpu);
+
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(all_pinned)) {
 			cpu_clear(cpu_of(busiest), cpus);
@@ -2909,6 +2921,66 @@ static void update_load(struct rq *this_
 	}
 }
 
+#ifdef CONFIG_NO_HZ
+struct {
+	int load_balancer;
+	cpumask_t  cpu_mask;
+} notick ____cacheline_aligned = {
+	.load_balancer = -1,
+	.cpu_mask = CPU_MASK_NONE,
+};
+
+/*
+ * This routine will try to nominate the ilb (idle load balancing)
+ * owner among the cpus whose ticks are stopped. ilb owner will do the idle
+ * load balancing on behalf of all those cpus. If all the cpus in the system
+ * go into this tickless mode, then there will be no ilb owner (as there is
+ * no need for one) and all the cpus will sleep till the next wakeup event
+ * arrives...
+ *
+ * For the ilb owner, tick is not stopped. And this tick will be used
+ * for idle load balancing. ilb owner will still be part of
+ * notick.cpu_mask..
+ *
+ * While stopping the tick, this cpu will become the ilb owner if there
+ * is no other owner. And will be the owner till that cpu becomes busy
+ * or if all cpus in the system stop their ticks at which point
+ * there is no need for ilb owner.
+ *
+ * When the ilb owner becomes busy, it nominates another owner, during the
+ * schedule()
+ */
+int select_notick_load_balancer(int stop_tick)
+{
+	int cpu = smp_processor_id();
+
+	if (stop_tick) {
+		cpu_set(cpu, notick.cpu_mask);
+
+		/* time for ilb owner also to sleep */
+		if (cpus_weight(notick.cpu_mask) == num_online_cpus())
+			return 0;
+
+		if (notick.load_balancer == -1) {
+			/* make me the ilb owner */
+			if (cmpxchg(&notick.load_balancer, -1, cpu) == -1)
+				return 1;
+		} else if (notick.load_balancer == cpu)
+			return 1;
+	} else {
+		if (!cpu_isset(cpu, notick.cpu_mask))
+			return 0;
+
+		cpu_clear(cpu, notick.cpu_mask);
+
+		if (notick.load_balancer == cpu)
+			if (cmpxchg(&notick.load_balancer,  cpu, -1) != cpu)
+				BUG();
+	}
+	return 0;
+}
+#endif
+
 /*
  * run_rebalance_domains is triggered when needed from the scheduler tick.
  *
@@ -2925,18 +2997,47 @@ static void run_rebalance_domains(struct
 	struct rq *this_rq = cpu_rq(this_cpu);
 	unsigned long interval;
 	struct sched_domain *sd;
+	enum idle_type idle;
+	unsigned long next_balance;
+#ifdef CONFIG_NO_HZ
+	cpumask_t cpus = notick.cpu_mask;
+	int local_cpu = this_cpu;
+
+	/*
+	 * Check if it is time for ilb to stop.
+	 */
+	if (idle_cpu(local_cpu) && notick.load_balancer == local_cpu &&
+	    cpus_weight(cpus) == num_online_cpus()) {
+		resched_cpu(local_cpu);
+		return;
+	}
+
+restart:
+	if (idle_cpu(local_cpu) && notick.load_balancer == local_cpu) {
+		this_cpu = first_cpu(cpus);
+		this_rq = cpu_rq(this_cpu);
+		cpu_clear(this_cpu, cpus);
+	}
+#endif
+
 	/*
 	 * We are idle if there are no processes running. This
 	 * is valid even if we are the idle process (SMT).
 	 */
-	enum idle_type idle = !this_rq->nr_running ?
+	idle = !this_rq->nr_running ?
 				SCHED_IDLE : NOT_IDLE;
 	/* Earliest time when we have to call run_rebalance_domains again */
-	unsigned long next_balance = jiffies + 60*HZ;
+	next_balance = jiffies + 60*HZ;
 
 	__count_vm_event(SCHED_BALANCE_SOFTIRQ);
 
 	for_each_domain(this_cpu, sd) {
+#ifdef CONFIG_NO_HZ
+		if (idle_cpu(local_cpu) && notick.load_balancer == local_cpu
+		    && need_resched())
+			return;
+#endif
+
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
 
@@ -2983,6 +3084,12 @@ out:
 			break;
 	}
 	this_rq->next_balance = next_balance;
+
+#ifdef CONFIG_NO_HZ
+	if (idle_cpu(local_cpu) && notick.load_balancer == local_cpu &&
+	    !cpus_empty(cpus))
+		goto restart;
+#endif
 }
 #else
 /*
@@ -3562,6 +3669,21 @@ switch_tasks:
 		++*switch_count;
 
 		prepare_task_switch(rq, next);
+#if defined(CONFIG_HZ) && defined(CONFIG_SMP)
+		if (prev == rq->idle && notick.load_balancer == -1) {
+			/*
+			 * simple selection for now: Nominate the first cpu in
+			 * the notick list to be the next ilb owner.
+			 *
+			 * TBD: Traverse the sched domains and nominate
+			 * the nearest cpu in the notick.cpu_mask.
+			 */
+			int ilb = first_cpu(notick.cpu_mask);
+
+			if (ilb != NR_CPUS)
+				resched_cpu(ilb);
+		}
+#endif
 		prev = context_switch(rq, prev, next);
 		barrier();
 		/*

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-11 23:53 [RFC] Patch: dynticks: idle load balancing Siddha, Suresh B
@ 2006-12-13 22:43 ` Ingo Molnar
  2006-12-13 23:13   ` Ingo Molnar
  2006-12-20  0:49 ` Steven Rostedt
  1 sibling, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2006-12-13 22:43 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: nickpiggin, vatsa, clameter, tglx, arjan, linux-kernel


* Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:

> Appended patch attempts to fix the process idle load balancing in the 
> presence of dynticks. cpus for which ticks are stopped will sleep till 
> the next event wakes it up. Potentially these sleeps can be for large 
> durations and during which today, there is no idle load balancing 
> being done. There was some discussion happened(last year) on this 
> topic on lkml, where two main approaches were gettting debated. One is 
> to back off the idle load balancing for bigger intervals and the 
> second is a watchdog mechanism where the busy cpu will trigger the 
> load balance on an idle cpu.  Both of these mechanisms have its 
> drawbacks.

nice work! I have added your patch to -rt. Btw., it needs the patch 
below to work on 64-bit.

	Ingo

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1170,7 +1170,7 @@ static void resched_task(struct task_str
 static void resched_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned int flags;
+	unsigned long flags;
 
 	spin_lock_irqsave(&rq->lock, flags);
 	resched_task(cpu_curr(cpu));

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-13 23:13   ` Ingo Molnar
@ 2006-12-13 23:03     ` Siddha, Suresh B
  2006-12-13 23:31       ` Ingo Molnar
  2006-12-13 23:48     ` [RFC] Patch: dynticks: idle load balancing Ingo Molnar
  1 sibling, 1 reply; 25+ messages in thread
From: Siddha, Suresh B @ 2006-12-13 23:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Siddha, Suresh B, nickpiggin, vatsa, clameter, tglx, arjan, linux-kernel

On Thu, Dec 14, 2006 at 12:13:16AM +0100, Ingo Molnar wrote:
> there's another bug as well: in schedule() resched_cpu() is called with 
> the current runqueue held in two places, which is deadlock potential. 

resched_cpu() was getting called after prepare_task_switch() which
releases the current runqueue lock. Isn't it?

> The easiest fix for this is to use trylock - find the patch for that. 
> This is a hint only anyway - and if a CPU is idle its runqueue will be 

Though I don't see a potential deadlock, I like this optimization.

thanks,
suresh

> lockable. (fixing it via double-locking is easy in the first call site, 
> but the second one looks harder)
> 
> 	Ingo
> 
> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -1167,12 +1167,14 @@ static void resched_task(struct task_str
>  	if (!tsk_is_polling(p))
>  		smp_send_reschedule(cpu);
>  }
> +
>  static void resched_cpu(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> -	unsigned int flags;
> +	unsigned long flags;
>  
> -	spin_lock_irqsave(&rq->lock, flags);
> +	if (!spin_trylock_irqsave(&rq->lock, flags))
> +		return;
>  	resched_task(cpu_curr(cpu));
>  	spin_unlock_irqrestore(&rq->lock, flags);
>  }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-13 22:43 ` Ingo Molnar
@ 2006-12-13 23:13   ` Ingo Molnar
  2006-12-13 23:03     ` Siddha, Suresh B
  2006-12-13 23:48     ` [RFC] Patch: dynticks: idle load balancing Ingo Molnar
  0 siblings, 2 replies; 25+ messages in thread
From: Ingo Molnar @ 2006-12-13 23:13 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: nickpiggin, vatsa, clameter, tglx, arjan, linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > Appended patch attempts to fix the process idle load balancing in 
> > the presence of dynticks. cpus for which ticks are stopped will 
> > sleep till the next event wakes it up. Potentially these sleeps can 
> > be for large durations and during which today, there is no idle load 
> > balancing being done. There was some discussion happened(last year) 
> > on this topic on lkml, where two main approaches were gettting 
> > debated. One is to back off the idle load balancing for bigger 
> > intervals and the second is a watchdog mechanism where the busy cpu 
> > will trigger the load balance on an idle cpu.  Both of these 
> > mechanisms have its drawbacks.
> 
> nice work! I have added your patch to -rt. Btw., it needs the patch 
> below to work on 64-bit.

there's another bug as well: in schedule() resched_cpu() is called with 
the current runqueue held in two places, which is deadlock potential. 
The easiest fix for this is to use trylock - find the patch for that. 
This is a hint only anyway - and if a CPU is idle its runqueue will be 
lockable. (fixing it via double-locking is easy in the first call site, 
but the second one looks harder)

	Ingo

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1167,12 +1167,14 @@ static void resched_task(struct task_str
 	if (!tsk_is_polling(p))
 		smp_send_reschedule(cpu);
 }
+
 static void resched_cpu(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned int flags;
+	unsigned long flags;
 
-	spin_lock_irqsave(&rq->lock, flags);
+	if (!spin_trylock_irqsave(&rq->lock, flags))
+		return;
 	resched_task(cpu_curr(cpu));
 	spin_unlock_irqrestore(&rq->lock, flags);
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-13 23:31       ` Ingo Molnar
@ 2006-12-13 23:19         ` Siddha, Suresh B
  2006-12-14  0:34           ` Ingo Molnar
  2006-12-19 20:12           ` Ingo Molnar
  0 siblings, 2 replies; 25+ messages in thread
From: Siddha, Suresh B @ 2006-12-13 23:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Siddha, Suresh B, nickpiggin, vatsa, clameter, tglx, arjan, linux-kernel

On Thu, Dec 14, 2006 at 12:31:57AM +0100, Ingo Molnar wrote:
> 
> * Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:
> 
> > On Thu, Dec 14, 2006 at 12:13:16AM +0100, Ingo Molnar wrote:
> > > there's another bug as well: in schedule() resched_cpu() is called with 
> > > the current runqueue held in two places, which is deadlock potential. 
> > 
> > resched_cpu() was getting called after prepare_task_switch() which 
> > releases the current runqueue lock. Isn't it?
> 
> no, it doesnt release it. The finish stage is what releases it.

I see.

> the other problem is load_balance(): there this_rq is locked and you 
> call resched_cpu() unconditionally.

But here resched_cpu() was called after double_rq_unlock().

thanks,
suresh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-13 23:03     ` Siddha, Suresh B
@ 2006-12-13 23:31       ` Ingo Molnar
  2006-12-13 23:19         ` Siddha, Suresh B
  0 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2006-12-13 23:31 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: nickpiggin, vatsa, clameter, tglx, arjan, linux-kernel


* Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:

> On Thu, Dec 14, 2006 at 12:13:16AM +0100, Ingo Molnar wrote:
> > there's another bug as well: in schedule() resched_cpu() is called with 
> > the current runqueue held in two places, which is deadlock potential. 
> 
> resched_cpu() was getting called after prepare_task_switch() which 
> releases the current runqueue lock. Isn't it?

no, it doesnt release it. The finish stage is what releases it.

the other problem is load_balance(): there this_rq is locked and you 
call resched_cpu() unconditionally.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-13 23:13   ` Ingo Molnar
  2006-12-13 23:03     ` Siddha, Suresh B
@ 2006-12-13 23:48     ` Ingo Molnar
  1 sibling, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2006-12-13 23:48 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: nickpiggin, vatsa, clameter, tglx, arjan, linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> The easiest fix for this is to use trylock - find the patch for that. 

FYI, i have released 2.6.19-rt14 with your patch integrated into it as 
well - it's looking good so far in my testing. (the yum repository will 
be updated in a few minutes.)

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-13 23:19         ` Siddha, Suresh B
@ 2006-12-14  0:34           ` Ingo Molnar
  2006-12-19 20:12           ` Ingo Molnar
  1 sibling, 0 replies; 25+ messages in thread
From: Ingo Molnar @ 2006-12-14  0:34 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: nickpiggin, vatsa, clameter, tglx, arjan, linux-kernel


* Siddha, Suresh B <suresh.b.siddha@intel.com> wrote:

> > the other problem is load_balance(): there this_rq is locked and you 
> > call resched_cpu() unconditionally.
> 
> But here resched_cpu() was called after double_rq_unlock().

yeah, you are right - the schedule() one is/was the only problem.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-13 23:19         ` Siddha, Suresh B
  2006-12-14  0:34           ` Ingo Molnar
@ 2006-12-19 20:12           ` Ingo Molnar
  2006-12-19 21:12             ` Siddha, Suresh B
  1 sibling, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2006-12-19 20:12 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: vatsa, clameter, tglx, arjan, linux-kernel


find below another bugfix for the dynticks-sched patch. (this bug caused 
crashed under a stresstest)

	Ingo

---
 kernel/sched.c |    2 ++
 1 file changed, 2 insertions(+)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3015,6 +3015,8 @@ static void run_rebalance_domains(struct
 restart:
 	if (idle_cpu(local_cpu) && notick.load_balancer == local_cpu) {
 		this_cpu = first_cpu(cpus);
+		if (unlikely(this_cpu >= NR_CPUS))
+			return;
 		this_rq = cpu_rq(this_cpu);
 		cpu_clear(this_cpu, cpus);
 	}


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-19 20:12           ` Ingo Molnar
@ 2006-12-19 21:12             ` Siddha, Suresh B
  2007-01-16 11:35               ` Ingo Molnar
  0 siblings, 1 reply; 25+ messages in thread
From: Siddha, Suresh B @ 2006-12-19 21:12 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Siddha, Suresh B, vatsa, clameter, tglx, arjan, linux-kernel

On Tue, Dec 19, 2006 at 09:12:48PM +0100, Ingo Molnar wrote:
>  restart:
>  	if (idle_cpu(local_cpu) && notick.load_balancer == local_cpu) {
>  		this_cpu = first_cpu(cpus);
> +		if (unlikely(this_cpu >= NR_CPUS))
> +			return;

oops.

There is window when local_cpu is cleared from notick.cpumask
but the notick.load_balancer still points to local_cpu..
This can also be corrected by first resetting the notick.load_balancer
before clearing that cpu from notick.cpumask in select_notick_load_balancer()

thanks,
suresh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-11 23:53 [RFC] Patch: dynticks: idle load balancing Siddha, Suresh B
  2006-12-13 22:43 ` Ingo Molnar
@ 2006-12-20  0:49 ` Steven Rostedt
  1 sibling, 0 replies; 25+ messages in thread
From: Steven Rostedt @ 2006-12-20  0:49 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: mingo, nickpiggin, vatsa, clameter, tglx, arjan, linux-kernel

On Mon, 2006-12-11 at 15:53 -0800, Siddha, Suresh B wrote:

> 
> Comments and review feedback welcome. Minimal testing done on couple of
> i386 platforms. Perf testing yet to be done.

Nice work!

> 
> thanks,
> suresh
> ---


> diff -pNru linux-2.6.19-mm1/include/linux/sched.h linux/include/linux/sched.h
> --- linux-2.6.19-mm1/include/linux/sched.h	2006-12-12 06:39:22.000000000 -0800
> +++ linux/include/linux/sched.h	2006-12-12 06:51:03.000000000 -0800
> @@ -195,6 +195,14 @@ extern void sched_init_smp(void);
>  extern void init_idle(struct task_struct *idle, int cpu);
>  
>  extern cpumask_t nohz_cpu_mask;
> +#ifdef CONFIG_SMP
> +extern int select_notick_load_balancer(int cpu);
> +#else
> +static inline int select_notick_load_balancer(int cpu)

Later on in the actual code, the parameter is named stop_tick, which
makes sense. You should change the name here too so it's not confusing
when looking later on at the code.

> +{
> +	return 0;
> +}
> +#endif

[...]

> +
> +/*
> + * This routine will try to nominate the ilb (idle load balancing)
> + * owner among the cpus whose ticks are stopped. ilb owner will do the idle
> + * load balancing on behalf of all those cpus. If all the cpus in the system
> + * go into this tickless mode, then there will be no ilb owner (as there is
> + * no need for one) and all the cpus will sleep till the next wakeup event
> + * arrives...
> + *
> + * For the ilb owner, tick is not stopped. And this tick will be used
> + * for idle load balancing. ilb owner will still be part of
> + * notick.cpu_mask..
> + *
> + * While stopping the tick, this cpu will become the ilb owner if there
> + * is no other owner. And will be the owner till that cpu becomes busy
> + * or if all cpus in the system stop their ticks at which point
> + * there is no need for ilb owner.
> + *
> + * When the ilb owner becomes busy, it nominates another owner, during the
> + * schedule()
> + */
> +int select_notick_load_balancer(int stop_tick)
> +{
> +	int cpu = smp_processor_id();
> +

[...]

> +#ifdef CONFIG_NO_HZ
> +	if (idle_cpu(local_cpu) && notick.load_balancer == local_cpu &&
> +	    !cpus_empty(cpus))
> +		goto restart;
> +#endif
>  }
>  #else
>  /*
> @@ -3562,6 +3669,21 @@ switch_tasks:
>  		++*switch_count;
>  
>  		prepare_task_switch(rq, next);
> +#if defined(CONFIG_HZ) && defined(CONFIG_SMP)

Ah! so this is where the CONFIG_NO_HZ mistake came in ;)


> +		if (prev == rq->idle && notick.load_balancer == -1) {
> +			/*
> +			 * simple selection for now: Nominate the first cpu in
> +			 * the notick list to be the next ilb owner.
> +			 *
> +			 * TBD: Traverse the sched domains and nominate
> +			 * the nearest cpu in the notick.cpu_mask.
> +			 */
> +			int ilb = first_cpu(notick.cpu_mask);
> +
> +			if (ilb != NR_CPUS)
> +				resched_cpu(ilb);
> +		}
> +#endif
>  		prev = context_switch(rq, prev, next);


-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2006-12-19 21:12             ` Siddha, Suresh B
@ 2007-01-16 11:35               ` Ingo Molnar
  2007-01-30 21:57                 ` Siddha, Suresh B
  0 siblings, 1 reply; 25+ messages in thread
From: Ingo Molnar @ 2007-01-16 11:35 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: vatsa, clameter, tglx, arjan, linux-kernel


Suresh,

on the latest -rt kernel, when the dynticks load-balancer is enabled, 
then a dual-core Core2 Duo test-system increases its irq rate from the 
normal 15/17 per second to 300-400/sec - on a completely idle system(!). 
Any idea what's going on? I'll disable the load balancer for now.

	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2007-01-16 11:35               ` Ingo Molnar
@ 2007-01-30 21:57                 ` Siddha, Suresh B
  2007-02-07 22:19                   ` Siddha, Suresh B
  2007-02-17  2:03                   ` [patch 1/2] sched: fix idle load balancing in softirqd context Siddha, Suresh B
  0 siblings, 2 replies; 25+ messages in thread
From: Siddha, Suresh B @ 2007-01-30 21:57 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Siddha, Suresh B, vatsa, clameter, tglx, arjan, linux-kernel

Sorry for the delayed response. I was away on vacation.

Please let me know if you still see this issue with the latest -rt kernel.

thanks,
suresh

On Tue, Jan 16, 2007 at 12:35:05PM +0100, Ingo Molnar wrote:
> 
> Suresh,
> 
> on the latest -rt kernel, when the dynticks load-balancer is enabled, 
> then a dual-core Core2 Duo test-system increases its irq rate from the 
> normal 15/17 per second to 300-400/sec - on a completely idle system(!). 
> Any idea what's going on? I'll disable the load balancer for now.
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] Patch: dynticks: idle load balancing
  2007-01-30 21:57                 ` Siddha, Suresh B
@ 2007-02-07 22:19                   ` Siddha, Suresh B
  2007-02-17  2:03                   ` [patch 1/2] sched: fix idle load balancing in softirqd context Siddha, Suresh B
  1 sibling, 0 replies; 25+ messages in thread
From: Siddha, Suresh B @ 2007-02-07 22:19 UTC (permalink / raw)
  To: mingo; +Cc: Ingo Molnar, vatsa, clameter, tglx, arjan, linux-kernel

On Tue, Jan 30, 2007 at 01:57:09PM -0800, Siddha, Suresh B wrote:
> Please let me know if you still see this issue with the latest -rt kernel.
> 
> On Tue, Jan 16, 2007 at 12:35:05PM +0100, Ingo Molnar wrote:
> > on the latest -rt kernel, when the dynticks load-balancer is enabled, 
> > then a dual-core Core2 Duo test-system increases its irq rate from the 
> > normal 15/17 per second to 300-400/sec - on a completely idle system(!). 
> > Any idea what's going on? I'll disable the load balancer for now.

Ok. got time to look into this.

The answer is simple. load_balancing in the recent kernels is happening
using SCHED_SOFTIRQ and in -rt tree that happens not in the idle process
context but in the context of softirqd for SCHED_SOFTIRQ.

This breaks the dynticks load balancer and also the regular idle load balancing
too :(

Am on to fixing the problem now :)

thanks,
suresh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 1/2] sched: fix idle load balancing in softirqd context
  2007-01-30 21:57                 ` Siddha, Suresh B
  2007-02-07 22:19                   ` Siddha, Suresh B
@ 2007-02-17  2:03                   ` Siddha, Suresh B
  2007-02-17  2:08                     ` [patch 2/2] sched: dynticks idle load balancing - v2 Siddha, Suresh B
                                       ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Siddha, Suresh B @ 2007-02-17  2:03 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, npiggin, rostedt

Periodic load balancing in recent kernels happen in the softirq.
In certain -rt configurations, these softirqs are handled in softirqd
context. And hence the check for idle processor was always returning
busy (as nr_running > 1).

This patch captures the idle information at the tick and passes this info
to softirq context through an element 'idle_at_tick' in rq.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff -pNru linux-2.6.20.x86_64/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.20.x86_64/kernel/sched.c	2007-02-10 03:41:27.000000000 -0800
+++ linux/kernel/sched.c	2007-02-16 16:22:44.000000000 -0800
@@ -240,6 +240,7 @@ struct rq {
 	unsigned long raw_weighted_load;
 #ifdef CONFIG_SMP
 	unsigned long cpu_load[3];
+	unsigned char idle_at_tick;
 #endif
 	unsigned long long nr_switches;
 
@@ -3350,12 +3351,7 @@ static void run_rebalance_domains(struct
 	struct rq *this_rq = cpu_rq(this_cpu);
 	unsigned long interval;
 	struct sched_domain *sd;
-	/*
-	 * We are idle if there are no processes running. This
-	 * is valid even if we are the idle process (SMT).
-	 */
-	enum idle_type idle = !this_rq->nr_running ?
-				SCHED_IDLE : NOT_IDLE;
+	enum idle_type idle = this_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
 	/* Earliest time when we have to call run_rebalance_domains again */
 	unsigned long next_balance = jiffies + 60*HZ;
 
@@ -3651,20 +3647,21 @@ void scheduler_tick(void)
 {
 	unsigned long long now = sched_clock();
 	struct task_struct *p = current;
-	int cpu = smp_processor_id();
+	int cpu = smp_processor_id(), idle_at_tick = idle_cpu(cpu);
 	struct rq *rq = cpu_rq(cpu);
 
 	BUG_ON(!irqs_disabled());
 
 	update_cpu_clock(p, rq, now);
 
-	if (p == rq->idle)
+	if (idle_at_tick)
 		/* Task on the idle queue */
 		wake_priority_sleeper(rq);
 	else
 		task_running_tick(rq, p);
 #ifdef CONFIG_SMP
 	update_load(rq);
+	rq->idle_at_tick = idle_at_tick;
 	if (time_after_eq(jiffies, rq->next_balance))
 		raise_softirq(SCHED_SOFTIRQ);
 #endif

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 2/2] sched: dynticks idle load balancing - v2
  2007-02-17  2:03                   ` [patch 1/2] sched: fix idle load balancing in softirqd context Siddha, Suresh B
@ 2007-02-17  2:08                     ` Siddha, Suresh B
  2007-02-21 20:23                       ` Andrew Morton
  2007-02-22  3:26                       ` [patch 2/2] sched: dynticks idle load balancing - v2 Nick Piggin
  2007-02-17 14:42                     ` [patch 1/2] sched: fix idle load balancing in softirqd context Steven Rostedt
  2007-02-21 20:13                     ` Andrew Morton
  2 siblings, 2 replies; 25+ messages in thread
From: Siddha, Suresh B @ 2007-02-17  2:08 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel, npiggin, rostedt

Changes since v1:

  - Move the idle load balancer selection from schedule()
    to the first busy scheduler_tick() after restarting the tick.
    This will avoid the unnecessay ownership changes when 
    softirq's(which are run in softirqd context in certain -rt
    configurations) like timer, sched are invoked for every idle tick
    that happens.

  - Misc cleanups.

---
Fix the process idle load balancing in the presence of dynticks.
cpus for which ticks are stopped will sleep till the next event wakes it up.
Potentially these sleeps can be for large durations and during which today,
there is no periodic idle load balancing being done.

This patch nominates an owner among the idle cpus, which does the idle load
balancing on behalf of the other idle cpus. And once all the cpus are
completely idle, then we can stop this idle load balancing too. Checks added
in fast path are minimized.  Whenever there are busy cpus in the system, there
will be an owner(idle cpu) doing the system wide idle load balancing.

Open items:
1. Intelligent owner selection (like an idle core in a busy package).
2. Merge with rcu's nohz_cpu_mask?

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff -pNru linux-2.6.20.x86_64/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.6.20.x86_64/include/linux/sched.h	2007-02-10 03:41:27.000000000 -0800
+++ linux/include/linux/sched.h	2007-02-16 17:44:08.000000000 -0800
@@ -237,6 +237,14 @@ extern void sched_init_smp(void);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern cpumask_t nohz_cpu_mask;
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
+extern int select_nohz_load_balancer(int cpu);
+#else
+static inline int select_nohz_load_balancer(int cpu)
+{
+	return 0;
+}
+#endif
 
 /*
  * Only dump TASK_* tasks. (-1 for all tasks)
diff -pNru linux-2.6.20.x86_64/kernel/sched.c linux/kernel/sched.c
--- linux-2.6.20.x86_64/kernel/sched.c	2007-02-16 16:40:16.000000000 -0800
+++ linux/kernel/sched.c	2007-02-16 17:56:13.000000000 -0800
@@ -241,6 +241,9 @@ struct rq {
 #ifdef CONFIG_SMP
 	unsigned long cpu_load[3];
 	unsigned char idle_at_tick;
+#ifdef CONFIG_NO_HZ
+	unsigned char in_nohz_recently;
+#endif
 #endif
 	unsigned long long nr_switches;
 
@@ -1169,6 +1172,17 @@ static void resched_task(struct task_str
 	if (!tsk_is_polling(p))
 		smp_send_reschedule(cpu);
 }
+
+static void resched_cpu(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long flags;
+
+	if (!spin_trylock_irqsave(&rq->lock, flags))
+		return;
+	resched_task(cpu_curr(cpu));
+	spin_unlock_irqrestore(&rq->lock, flags);
+}
 #else
 static inline void resched_task(struct task_struct *p)
 {
@@ -3067,6 +3081,9 @@ redo:
 		double_rq_unlock(this_rq, busiest);
 		local_irq_restore(flags);
 
+		if (nr_moved && this_cpu != smp_processor_id())
+			resched_cpu(this_cpu);
+
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(all_pinned)) {
 			cpu_clear(cpu_of(busiest), cpus);
@@ -3335,6 +3352,79 @@ static void update_load(struct rq *this_
 	}
 }
 
+#ifdef CONFIG_NO_HZ
+static struct {
+	int load_balancer;
+	cpumask_t  cpu_mask;
+} nohz ____cacheline_aligned = {
+	.load_balancer = -1,
+	.cpu_mask = CPU_MASK_NONE,
+};
+
+/*
+ * This routine will try to nominate the ilb (idle load balancing)
+ * owner among the cpus whose ticks are stopped. ilb owner will do the idle
+ * load balancing on behalf of all those cpus. If all the cpus in the system
+ * go into this tickless mode, then there will be no ilb owner (as there is
+ * no need for one) and all the cpus will sleep till the next wakeup event
+ * arrives...
+ *
+ * For the ilb owner, tick is not stopped. And this tick will be used
+ * for idle load balancing. ilb owner will still be part of
+ * nohz.cpu_mask..
+ *
+ * While stopping the tick, this cpu will become the ilb owner if there
+ * is no other owner. And will be the owner till that cpu becomes busy
+ * or if all cpus in the system stop their ticks at which point
+ * there is no need for ilb owner.
+ *
+ * When the ilb owner becomes busy, it nominates another owner, during the
+ * next busy scheduler_tick()
+ */
+int select_nohz_load_balancer(int stop_tick)
+{
+	int cpu = smp_processor_id();
+
+	if (stop_tick) {
+		cpu_set(cpu, nohz.cpu_mask);
+		cpu_rq(cpu)->in_nohz_recently = 1;
+
+		/*
+		 * If we are going offline and still the leader, give up!
+		 */
+		if (cpu_is_offline(cpu) && nohz.load_balancer == cpu) {
+			if (cmpxchg(&nohz.load_balancer,  cpu, -1) != cpu)
+				BUG();
+			return 0;
+		}
+
+		/* time for ilb owner also to sleep */
+		if (cpus_weight(nohz.cpu_mask) == num_online_cpus()) {
+			if (nohz.load_balancer == cpu)
+				nohz.load_balancer = -1;
+			return 0;
+		}
+
+		if (nohz.load_balancer == -1) {
+			/* make me the ilb owner */
+			if (cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
+				return 1;
+		} else if (nohz.load_balancer == cpu)
+			return 1;
+	} else {
+		if (!cpu_isset(cpu, nohz.cpu_mask))
+			return 0;
+
+		cpu_clear(cpu, nohz.cpu_mask);
+
+		if (nohz.load_balancer == cpu)
+			if (cmpxchg(&nohz.load_balancer,  cpu, -1) != cpu)
+				BUG();
+	}
+	return 0;
+}
+#endif
+
 /*
  * run_rebalance_domains is triggered when needed from the scheduler tick.
  *
@@ -3347,15 +3437,46 @@ static DEFINE_SPINLOCK(balancing);
 
 static void run_rebalance_domains(struct softirq_action *h)
 {
-	int this_cpu = smp_processor_id(), balance = 1;
-	struct rq *this_rq = cpu_rq(this_cpu);
-	unsigned long interval;
+	int balance_cpu = smp_processor_id(), balance;
+	struct rq *balance_rq = cpu_rq(balance_cpu);
+	unsigned long interval, next_balance;
 	struct sched_domain *sd;
-	enum idle_type idle = this_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
+	enum idle_type idle;
+
+#ifdef CONFIG_NO_HZ
+	cpumask_t cpus = nohz.cpu_mask;
+	int local_cpu = balance_cpu;
+	struct rq *local_rq = balance_rq;
+
+restart:
+
+	if (local_rq->idle_at_tick && nohz.load_balancer == local_cpu) {
+		if (need_resched())
+			return;
+
+		balance_cpu = first_cpu(cpus);
+		if (unlikely(balance_cpu >= NR_CPUS))
+			return;
+		balance_rq = cpu_rq(balance_cpu);
+		cpu_clear(balance_cpu, cpus);
+	}
+
+	/*
+	 * We must be doing idle load balancing for some other idle cpu
+	 */
+	if (local_cpu != balance_cpu)
+		idle = SCHED_IDLE;
+	else
+#endif
+		idle = balance_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
+
+	balance = 1;
+
 	/* Earliest time when we have to call run_rebalance_domains again */
-	unsigned long next_balance = jiffies + 60*HZ;
+	next_balance = jiffies + 60*HZ;
+
+	for_each_domain(balance_cpu, sd) {
 
-	for_each_domain(this_cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
 
@@ -3374,7 +3495,7 @@ static void run_rebalance_domains(struct
 		}
 
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (load_balance(this_cpu, this_rq, sd, idle, &balance)) {
+			if (load_balance(balance_cpu, balance_rq, sd, idle, &balance)) {
 				/*
 				 * We've pulled tasks over so either we're no
 				 * longer idle, or one of our SMT siblings is
@@ -3398,7 +3519,16 @@ out:
 		if (!balance)
 			break;
 	}
-	this_rq->next_balance = next_balance;
+	balance_rq->next_balance = next_balance;
+
+#ifdef CONFIG_NO_HZ
+	if (local_rq->idle_at_tick && nohz.load_balancer == local_cpu) {
+		if (time_after(next_balance, local_rq->next_balance))
+			local_rq->next_balance = next_balance;
+	    	if (!cpus_empty(cpus))
+			goto restart;
+	}
+#endif
 }
 #else
 /*
@@ -3661,7 +3791,49 @@ void scheduler_tick(void)
 		task_running_tick(rq, p);
 #ifdef CONFIG_SMP
 	update_load(rq);
+#ifdef CONFIG_NO_HZ
+	/*
+	 * If we were in the nohz mode recently and busy at the
+	 * current scheduler tick, then check if we need to nominate idle
+	 * load balancer.
+	 */
+	if (rq->in_nohz_recently && !idle_at_tick) {
+		rq->in_nohz_recently = 0;
+
+		if (nohz.load_balancer == cpu) {
+			cpu_clear(cpu, nohz.cpu_mask);
+			nohz.load_balancer = -1;
+		}
+
+		if (nohz.load_balancer == -1) {
+			/*
+			 * simple selection for now: Nominate the
+			 * first cpu in the nohz list to be the next
+			 * ilb owner.
+			 *
+			 * TBD: Traverse the sched domains and nominate
+			 * the nearest cpu in the nohz.cpu_mask.
+			 */
+			int ilb = first_cpu(nohz.cpu_mask);
+
+			if (ilb != NR_CPUS)
+				resched_cpu(ilb);
+		}
+	}
+#endif
 	rq->idle_at_tick = idle_at_tick;
+#ifdef CONFIG_NO_HZ
+	if (rq->idle_at_tick && nohz.load_balancer == cpu &&
+	    cpus_weight(nohz.cpu_mask) == num_online_cpus()) {
+		resched_cpu(cpu);
+		return;
+	}
+
+	if (rq->idle_at_tick && nohz.load_balancer != cpu &&
+	    cpu_isset(cpu, nohz.cpu_mask))
+		return;
+#endif
+
 	if (time_after_eq(jiffies, rq->next_balance))
 		raise_softirq(SCHED_SOFTIRQ);
 #endif
diff -pNru linux-2.6.20.x86_64/kernel/time/tick-sched.c linux/kernel/time/tick-sched.c
--- linux-2.6.20.x86_64/kernel/time/tick-sched.c	2007-02-10 03:41:27.000000000 -0800
+++ linux/kernel/time/tick-sched.c	2007-02-16 16:56:07.000000000 -0800
@@ -241,6 +241,11 @@ void tick_nohz_stop_sched_tick(void)
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
+			if (select_nohz_load_balancer(1)) {
+				cpu_clear(cpu, nohz_cpu_mask);
+				goto out;
+			}
+
 			ts->idle_tick = ts->sched_timer.expires;
 			ts->tick_stopped = 1;
 			ts->idle_jiffies = last_jiffies;
@@ -297,6 +302,7 @@ void tick_nohz_restart_sched_tick(void)
 	now = ktime_get();
 
 	local_irq_disable();
+	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
 	cpu_clear(cpu, nohz_cpu_mask);
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/2] sched: fix idle load balancing in softirqd context
  2007-02-17  2:03                   ` [patch 1/2] sched: fix idle load balancing in softirqd context Siddha, Suresh B
  2007-02-17  2:08                     ` [patch 2/2] sched: dynticks idle load balancing - v2 Siddha, Suresh B
@ 2007-02-17 14:42                     ` Steven Rostedt
  2007-02-21  6:25                       ` Siddha, Suresh B
  2007-02-21 20:13                     ` Andrew Morton
  2 siblings, 1 reply; 25+ messages in thread
From: Steven Rostedt @ 2007-02-17 14:42 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: mingo, linux-kernel, npiggin



On Fri, 16 Feb 2007, Siddha, Suresh B wrote:

> Periodic load balancing in recent kernels happen in the softirq.
> In certain -rt configurations, these softirqs are handled in softirqd
> context. And hence the check for idle processor was always returning
> busy (as nr_running > 1).
>
> This patch captures the idle information at the tick and passes this info
> to softirq context through an element 'idle_at_tick' in rq.

I haven't had the time yet to look too detailed at this patch.

>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
>

>  {
>  	unsigned long long now = sched_clock();
>  	struct task_struct *p = current;
> -	int cpu = smp_processor_id();
> +	int cpu = smp_processor_id(), idle_at_tick = idle_cpu(cpu);
>  	struct rq *rq = cpu_rq(cpu);
>

But I would recommend that the idle_at_tick should be on a separate line.

I'll try to look deeper at your patches tomorrow. I've also found that I'm
having some latency problems in -rt that I think might be related to
migration.

-- Steve


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/2] sched: fix idle load balancing in softirqd context
  2007-02-17 14:42                     ` [patch 1/2] sched: fix idle load balancing in softirqd context Steven Rostedt
@ 2007-02-21  6:25                       ` Siddha, Suresh B
  0 siblings, 0 replies; 25+ messages in thread
From: Siddha, Suresh B @ 2007-02-21  6:25 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Siddha, Suresh B, mingo, linux-kernel, npiggin

On Sat, Feb 17, 2007 at 09:42:16AM -0500, Steven Rostedt wrote:
> On Fri, 16 Feb 2007, Siddha, Suresh B wrote:
> > -	int cpu = smp_processor_id();
> > +	int cpu = smp_processor_id(), idle_at_tick = idle_cpu(cpu);
> >  	struct rq *rq = cpu_rq(cpu);
> >
> 
> But I would recommend that the idle_at_tick should be on a separate line.

Ok.

> I'll try to look deeper at your patches tomorrow. I've also found that I'm
> having some latency problems in -rt that I think might be related to
> migration.

There is one more issue, I have observed in -rt tree.

raise_softirq_irqoff() is unconditionally doing the wakeup_softirqd()
But unless the CONFIG_PREEMPT_SOFTIRQS is configured, do_softirq()
will process the softirq's like TIMER_SOFTIRQ and SCHED_SOFTIRQ in the
process context.  So it looks like we are unnecessarily waking the
softirqd's corresponding to those softirq's.

thanks,
suresh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/2] sched: fix idle load balancing in softirqd context
  2007-02-17  2:03                   ` [patch 1/2] sched: fix idle load balancing in softirqd context Siddha, Suresh B
  2007-02-17  2:08                     ` [patch 2/2] sched: dynticks idle load balancing - v2 Siddha, Suresh B
  2007-02-17 14:42                     ` [patch 1/2] sched: fix idle load balancing in softirqd context Steven Rostedt
@ 2007-02-21 20:13                     ` Andrew Morton
  2 siblings, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2007-02-21 20:13 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: mingo, linux-kernel, npiggin, rostedt

On Fri, 16 Feb 2007 18:03:35 -0800
"Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:

> +	int cpu = smp_processor_id(), idle_at_tick = idle_cpu(cpu);

argh.  Please, do

	int cpu = smp_processor_id();
	int idle_at_tick = idle_cpu(cpu);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/2] sched: dynticks idle load balancing - v2
  2007-02-17  2:08                     ` [patch 2/2] sched: dynticks idle load balancing - v2 Siddha, Suresh B
@ 2007-02-21 20:23                       ` Andrew Morton
  2007-02-22  3:14                         ` Nick Piggin
  2007-02-24  2:01                         ` [patch] sched: dynticks idle load balancing - v3 Siddha, Suresh B
  2007-02-22  3:26                       ` [patch 2/2] sched: dynticks idle load balancing - v2 Nick Piggin
  1 sibling, 2 replies; 25+ messages in thread
From: Andrew Morton @ 2007-02-21 20:23 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: mingo, linux-kernel, npiggin, rostedt

On Fri, 16 Feb 2007 18:08:42 -0800
"Siddha, Suresh B" <suresh.b.siddha@intel.com> wrote:

> Changes since v1:
> 
>   - Move the idle load balancer selection from schedule()
>     to the first busy scheduler_tick() after restarting the tick.
>     This will avoid the unnecessay ownership changes when 
>     softirq's(which are run in softirqd context in certain -rt
>     configurations) like timer, sched are invoked for every idle tick
>     that happens.
> 
>   - Misc cleanups.
> 
> ---
> Fix the process idle load balancing in the presence of dynticks.
> cpus for which ticks are stopped will sleep till the next event wakes it up.
> Potentially these sleeps can be for large durations and during which today,
> there is no periodic idle load balancing being done.
> 
> This patch nominates an owner among the idle cpus, which does the idle load
> balancing on behalf of the other idle cpus. And once all the cpus are
> completely idle, then we can stop this idle load balancing too. Checks added
> in fast path are minimized.  Whenever there are busy cpus in the system, there
> will be an owner(idle cpu) doing the system wide idle load balancing.
> 
> Open items:
> 1. Intelligent owner selection (like an idle core in a busy package).
> 2. Merge with rcu's nohz_cpu_mask?
> 

I spose I'll hold my nose and merge this, but it creates too much of a mess
to be mergeable into the CPU scheduler, IMO.

Can we please do something to reduce the ifdef density?  And if possible,
all the newly-added returns-from-the-middle-of-a-function?


> +#ifdef CONFIG_NO_HZ
> +static struct {
> +	int load_balancer;
> +	cpumask_t  cpu_mask;
> +} nohz ____cacheline_aligned = {
> +	.load_balancer = -1,
> +	.cpu_mask = CPU_MASK_NONE,
> +};
> +
> +/*
> + * This routine will try to nominate the ilb (idle load balancing)
> + * owner among the cpus whose ticks are stopped. ilb owner will do the idle
> + * load balancing on behalf of all those cpus. If all the cpus in the system
> + * go into this tickless mode, then there will be no ilb owner (as there is
> + * no need for one) and all the cpus will sleep till the next wakeup event
> + * arrives...
> + *
> + * For the ilb owner, tick is not stopped. And this tick will be used
> + * for idle load balancing. ilb owner will still be part of
> + * nohz.cpu_mask..
> + *
> + * While stopping the tick, this cpu will become the ilb owner if there
> + * is no other owner. And will be the owner till that cpu becomes busy
> + * or if all cpus in the system stop their ticks at which point
> + * there is no need for ilb owner.
> + *
> + * When the ilb owner becomes busy, it nominates another owner, during the
> + * next busy scheduler_tick()
> + */
> +int select_nohz_load_balancer(int stop_tick)
> +{
> +	int cpu = smp_processor_id();
> +
> +	if (stop_tick) {
> +		cpu_set(cpu, nohz.cpu_mask);
> +		cpu_rq(cpu)->in_nohz_recently = 1;
> +
> +		/*
> +		 * If we are going offline and still the leader, give up!
> +		 */
> +		if (cpu_is_offline(cpu) && nohz.load_balancer == cpu) {
> +			if (cmpxchg(&nohz.load_balancer,  cpu, -1) != cpu)

So we require that architectures which implement CONFIG_NO_HZ also
implement cmpxchg.

> +				BUG();
> +			return 0;
> +		}
> +
> +		/* time for ilb owner also to sleep */
> +		if (cpus_weight(nohz.cpu_mask) == num_online_cpus()) {
> +			if (nohz.load_balancer == cpu)
> +				nohz.load_balancer = -1;
> +			return 0;
> +		}
> +
> +		if (nohz.load_balancer == -1) {
> +			/* make me the ilb owner */
> +			if (cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
> +				return 1;
> +		} else if (nohz.load_balancer == cpu)
> +			return 1;
> +	} else {
> +		if (!cpu_isset(cpu, nohz.cpu_mask))
> +			return 0;
> +
> +		cpu_clear(cpu, nohz.cpu_mask);
> +
> +		if (nohz.load_balancer == cpu)
> +			if (cmpxchg(&nohz.load_balancer,  cpu, -1) != cpu)
> +				BUG();
> +	}
> +	return 0;
> +}
> +#endif
> +
>  /*
>   * run_rebalance_domains is triggered when needed from the scheduler tick.
>   *
> @@ -3347,15 +3437,46 @@ static DEFINE_SPINLOCK(balancing);
>  
>  static void run_rebalance_domains(struct softirq_action *h)
>  {
> -	int this_cpu = smp_processor_id(), balance = 1;
> -	struct rq *this_rq = cpu_rq(this_cpu);
> -	unsigned long interval;
> +	int balance_cpu = smp_processor_id(), balance;
> +	struct rq *balance_rq = cpu_rq(balance_cpu);
> +	unsigned long interval, next_balance;

One definition per line is preferred.

>  	struct sched_domain *sd;
> -	enum idle_type idle = this_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
> +	enum idle_type idle;
> +
> +#ifdef CONFIG_NO_HZ
> +	cpumask_t cpus = nohz.cpu_mask;
> +	int local_cpu = balance_cpu;
> +	struct rq *local_rq = balance_rq;
> +
> +restart:
> +
> +	if (local_rq->idle_at_tick && nohz.load_balancer == local_cpu) {
> +		if (need_resched())
> +			return;
> +
> +		balance_cpu = first_cpu(cpus);
> +		if (unlikely(balance_cpu >= NR_CPUS))
> +			return;
> +		balance_rq = cpu_rq(balance_cpu);
> +		cpu_clear(balance_cpu, cpus);
> +	}
> +
> +	/*
> +	 * We must be doing idle load balancing for some other idle cpu
> +	 */
> +	if (local_cpu != balance_cpu)
> +		idle = SCHED_IDLE;
> +	else
> +#endif
> +		idle = balance_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
> +

It's not just that this is an eyesore (it is) - it introduces some ongoing
maintenance problems.  I expect this can be pulled out into a separate
function with no loss in quality of generated code.

And that function would be a nice site for a nice comment explaining the
design.  Why do we return if need_resched()?  What is the significance of
(balance_cpu >= NR_CPUS)?


> +
>  	/* Earliest time when we have to call run_rebalance_domains again */
> -	unsigned long next_balance = jiffies + 60*HZ;
> +	next_balance = jiffies + 60*HZ;
> +
> +	for_each_domain(balance_cpu, sd) {
>  
> -	for_each_domain(this_cpu, sd) {
>  		if (!(sd->flags & SD_LOAD_BALANCE))
>  			continue;
>  
> @@ -3374,7 +3495,7 @@ static void run_rebalance_domains(struct
>  		}
>  
>  		if (time_after_eq(jiffies, sd->last_balance + interval)) {
> -			if (load_balance(this_cpu, this_rq, sd, idle, &balance)) {
> +			if (load_balance(balance_cpu, balance_rq, sd, idle, &balance)) {
>  				/*
>  				 * We've pulled tasks over so either we're no
>  				 * longer idle, or one of our SMT siblings is
> @@ -3398,7 +3519,16 @@ out:
>  		if (!balance)
>  			break;
>  	}
> -	this_rq->next_balance = next_balance;
> +	balance_rq->next_balance = next_balance;
> +
> +#ifdef CONFIG_NO_HZ
> +	if (local_rq->idle_at_tick && nohz.load_balancer == local_cpu) {
> +		if (time_after(next_balance, local_rq->next_balance))
> +			local_rq->next_balance = next_balance;
> +	    	if (!cpus_empty(cpus))
> +			goto restart;
> +	}
> +#endif
>  }

A goto in an ifdef :(

Perhaps it can be removed by teaching the caller to call this function again?



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/2] sched: dynticks idle load balancing - v2
  2007-02-21 20:23                       ` Andrew Morton
@ 2007-02-22  3:14                         ` Nick Piggin
  2007-02-24  2:01                         ` [patch] sched: dynticks idle load balancing - v3 Siddha, Suresh B
  1 sibling, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2007-02-22  3:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Siddha, Suresh B, mingo, linux-kernel, rostedt

On Wed, Feb 21, 2007 at 12:23:44PM -0800, Andrew Morton wrote:
> On Fri, 16 Feb 2007 18:08:42 -0800

> > +int select_nohz_load_balancer(int stop_tick)
> > +{
> > +	int cpu = smp_processor_id();
> > +
> > +	if (stop_tick) {
> > +		cpu_set(cpu, nohz.cpu_mask);
> > +		cpu_rq(cpu)->in_nohz_recently = 1;
> > +
> > +		/*
> > +		 * If we are going offline and still the leader, give up!
> > +		 */
> > +		if (cpu_is_offline(cpu) && nohz.load_balancer == cpu) {
> > +			if (cmpxchg(&nohz.load_balancer,  cpu, -1) != cpu)
> 
> So we require that architectures which implement CONFIG_NO_HZ also
> implement cmpxchg.

Just use atomic_cmpxchg, please.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/2] sched: dynticks idle load balancing - v2
  2007-02-17  2:08                     ` [patch 2/2] sched: dynticks idle load balancing - v2 Siddha, Suresh B
  2007-02-21 20:23                       ` Andrew Morton
@ 2007-02-22  3:26                       ` Nick Piggin
  2007-02-22 22:33                         ` Siddha, Suresh B
  1 sibling, 1 reply; 25+ messages in thread
From: Nick Piggin @ 2007-02-22  3:26 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: mingo, linux-kernel, rostedt

On Fri, Feb 16, 2007 at 06:08:42PM -0800, Suresh B wrote:
> Changes since v1:
> 
>   - Move the idle load balancer selection from schedule()
>     to the first busy scheduler_tick() after restarting the tick.
>     This will avoid the unnecessay ownership changes when 
>     softirq's(which are run in softirqd context in certain -rt
>     configurations) like timer, sched are invoked for every idle tick
>     that happens.
> 
>   - Misc cleanups.
> 
> ---
> Fix the process idle load balancing in the presence of dynticks.
> cpus for which ticks are stopped will sleep till the next event wakes it up.
> Potentially these sleeps can be for large durations and during which today,
> there is no periodic idle load balancing being done.
> 
> This patch nominates an owner among the idle cpus, which does the idle load
> balancing on behalf of the other idle cpus. And once all the cpus are
> completely idle, then we can stop this idle load balancing too. Checks added
> in fast path are minimized.  Whenever there are busy cpus in the system, there
> will be an owner(idle cpu) doing the system wide idle load balancing.

This is really ugly, sorry :(

My suggestion for handling this was to increase the maximum balance
interval for an idle CPU, and just implement a global shutdown when
the entire system goes idle.

The former should take care of the power savings issues for bare metal
hardware, and the latter should solve performance problems for many idle
SMP guests. It should take very little code to implement.

If that approach doesn't cut it, then at least we can have some numbers
to see how much better yours is so we can justify including it.

If you are against my approach, then I can have a try at coding it up
if you like?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/2] sched: dynticks idle load balancing - v2
  2007-02-22  3:26                       ` [patch 2/2] sched: dynticks idle load balancing - v2 Nick Piggin
@ 2007-02-22 22:33                         ` Siddha, Suresh B
  2007-02-23  3:43                           ` Nick Piggin
  0 siblings, 1 reply; 25+ messages in thread
From: Siddha, Suresh B @ 2007-02-22 22:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Siddha, Suresh B, mingo, linux-kernel, rostedt

On Thu, Feb 22, 2007 at 04:26:54AM +0100, Nick Piggin wrote:
> This is really ugly, sorry :(

hm. myself and others too thought it was a simple and nice idea.

> My suggestion for handling this was to increase the maximum balance
> interval for an idle CPU, and just implement a global shutdown when
> the entire system goes idle.
> 
> The former should take care of the power savings issues for bare metal
> hardware, and the latter should solve performance problems for many idle
> SMP guests. It should take very little code to implement.

coming to max balance interval will be challenging. It needs to save
power and at the same time respond to load changes fast enough.

> If that approach doesn't cut it, then at least we can have some numbers
> to see how much better yours is so we can justify including it.
> 
> If you are against my approach, then I can have a try at coding it up
> if you like?

Sure. If you can provide a patch, I will be glad to provide power and
performance comparision numbers with both the approaches.

thanks,
suresh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/2] sched: dynticks idle load balancing - v2
  2007-02-22 22:33                         ` Siddha, Suresh B
@ 2007-02-23  3:43                           ` Nick Piggin
  0 siblings, 0 replies; 25+ messages in thread
From: Nick Piggin @ 2007-02-23  3:43 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: mingo, linux-kernel, rostedt

On Thu, Feb 22, 2007 at 02:33:00PM -0800, Suresh B wrote:
> On Thu, Feb 22, 2007 at 04:26:54AM +0100, Nick Piggin wrote:
> > This is really ugly, sorry :(
> 
> hm. myself and others too thought it was a simple and nice idea.

The idea is not bad. I won't guarantee mine will be as good or better,
but I think it is sensible to try implementing the simplest approach
first, so we can get a baseline to justify more complexity against...

Your code just needs work, but if it really produces good results then
it should be able to be made into a mergeable patch.

> > My suggestion for handling this was to increase the maximum balance
> > interval for an idle CPU, and just implement a global shutdown when
> > the entire system goes idle.
> > 
> > The former should take care of the power savings issues for bare metal
> > hardware, and the latter should solve performance problems for many idle
> > SMP guests. It should take very little code to implement.
> 
> coming to max balance interval will be challenging. It needs to save
> power and at the same time respond to load changes fast enough.

Yep.

> > If that approach doesn't cut it, then at least we can have some numbers
> > to see how much better yours is so we can justify including it.
> > 
> > If you are against my approach, then I can have a try at coding it up
> > if you like?
> 
> Sure. If you can provide a patch, I will be glad to provide power and
> performance comparision numbers with both the approaches.

OK that would be good. I'll see if I can code something up by next week.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch] sched: dynticks idle load balancing - v3
  2007-02-21 20:23                       ` Andrew Morton
  2007-02-22  3:14                         ` Nick Piggin
@ 2007-02-24  2:01                         ` Siddha, Suresh B
  1 sibling, 0 replies; 25+ messages in thread
From: Siddha, Suresh B @ 2007-02-24  2:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Siddha, Suresh B, mingo, linux-kernel, npiggin, rostedt

On Wed, Feb 21, 2007 at 12:23:44PM -0800, Andrew Morton wrote:
> I spose I'll hold my nose and merge this, but it creates too much of a mess
> to be mergeable into the CPU scheduler, IMO.

Cleaned up patch appended. Please drop sched-dynticks-idle-load-balancing-v2.patch
and include this instead.

thanks,
suresh
---

Fix the process idle load balancing in the presence of dynticks.
cpus for which ticks are stopped will sleep till the next event wakes it up.
Potentially these sleeps can be for large durations and during which today,
there is no periodic idle load balancing being done.

This patch nominates an owner among the idle cpus, which does the idle load
balancing on behalf of the other idle cpus. And once all the cpus are
completely idle, then we can stop this idle load balancing too. Checks added
in fast path are minimized.  Whenever there are busy cpus in the system, there
will be an owner(idle cpu) doing the system wide idle load balancing.

Open items:
1. Intelligent owner selection (like an idle core in a busy package).
2. Merge with rcu's nohz_cpu_mask?

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff -pNru linus-2.6/include/linux/sched.h linux/include/linux/sched.h
--- linus-2.6/include/linux/sched.h	2007-02-22 15:25:16.000000000 -0800
+++ linux/include/linux/sched.h	2007-02-22 15:27:01.000000000 -0800
@@ -194,6 +194,14 @@ extern void sched_init_smp(void);
 extern void init_idle(struct task_struct *idle, int cpu);
 
 extern cpumask_t nohz_cpu_mask;
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ)
+extern int select_nohz_load_balancer(int cpu);
+#else
+static inline int select_nohz_load_balancer(int cpu)
+{
+	return 0;
+}
+#endif
 
 /*
  * Only dump TASK_* tasks. (-1 for all tasks)
diff -pNru linus-2.6/kernel/sched.c linux/kernel/sched.c
--- linus-2.6/kernel/sched.c	2007-02-23 16:58:48.000000000 -0800
+++ linux/kernel/sched.c	2007-02-23 18:32:23.000000000 -0800
@@ -224,6 +224,9 @@ struct rq {
 #ifdef CONFIG_SMP
 	unsigned long cpu_load[3];
 	unsigned char idle_at_tick;
+#ifdef CONFIG_NO_HZ
+	unsigned char in_nohz_recently;
+#endif
 #endif
 	unsigned long long nr_switches;
 
@@ -1050,6 +1053,17 @@ static void resched_task(struct task_str
 	if (!tsk_is_polling(p))
 		smp_send_reschedule(cpu);
 }
+
+static void resched_cpu(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long flags;
+
+	if (!spin_trylock_irqsave(&rq->lock, flags))
+		return;
+	resched_task(cpu_curr(cpu));
+	spin_unlock_irqrestore(&rq->lock, flags);
+}
 #else
 static inline void resched_task(struct task_struct *p)
 {
@@ -2658,6 +2672,12 @@ redo:
 		double_rq_unlock(this_rq, busiest);
 		local_irq_restore(flags);
 
+		/*
+		 * some other cpu did the load balance for us.
+		 */
+		if (nr_moved && this_cpu != smp_processor_id())
+			resched_cpu(this_cpu);
+
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(all_pinned)) {
 			cpu_clear(cpu_of(busiest), cpus);
@@ -2928,27 +2948,98 @@ static void update_load(struct rq *this_
 	}
 }
 
+#ifdef CONFIG_NO_HZ
+static struct {
+	atomic_t load_balancer;
+	cpumask_t  cpu_mask;
+} nohz ____cacheline_aligned = {
+	.load_balancer = ATOMIC_INIT(-1),
+	.cpu_mask = CPU_MASK_NONE,
+};
+
 /*
- * run_rebalance_domains is triggered when needed from the scheduler tick.
+ * This routine will try to nominate the ilb (idle load balancing)
+ * owner among the cpus whose ticks are stopped. ilb owner will do the idle
+ * load balancing on behalf of all those cpus. If all the cpus in the system
+ * go into this tickless mode, then there will be no ilb owner (as there is
+ * no need for one) and all the cpus will sleep till the next wakeup event
+ * arrives...
+ *
+ * For the ilb owner, tick is not stopped. And this tick will be used
+ * for idle load balancing. ilb owner will still be part of
+ * nohz.cpu_mask..
+ *
+ * While stopping the tick, this cpu will become the ilb owner if there
+ * is no other owner. And will be the owner till that cpu becomes busy
+ * or if all cpus in the system stop their ticks at which point
+ * there is no need for ilb owner.
  *
+ * When the ilb owner becomes busy, it nominates another owner, during the
+ * next busy scheduler_tick()
+ */
+int select_nohz_load_balancer(int stop_tick)
+{
+	int cpu = smp_processor_id();
+
+	if (stop_tick) {
+		cpu_set(cpu, nohz.cpu_mask);
+		cpu_rq(cpu)->in_nohz_recently = 1;
+
+		/*
+		 * If we are going offline and still the leader, give up!
+		 */
+		if (cpu_is_offline(cpu) &&
+		    atomic_read(&nohz.load_balancer) == cpu) {
+			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+				BUG();
+			return 0;
+		}
+
+		/* time for ilb owner also to sleep */
+		if (cpus_weight(nohz.cpu_mask) == num_online_cpus()) {
+			if (atomic_read(&nohz.load_balancer) == cpu)
+				atomic_set(&nohz.load_balancer, -1);
+			return 0;
+		}
+
+		if (atomic_read(&nohz.load_balancer) == -1) {
+			/* make me the ilb owner */
+			if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
+				return 1;
+		} else if (atomic_read(&nohz.load_balancer) == cpu)
+			return 1;
+	} else {
+		if (!cpu_isset(cpu, nohz.cpu_mask))
+			return 0;
+
+		cpu_clear(cpu, nohz.cpu_mask);
+
+		if (atomic_read(&nohz.load_balancer) == cpu)
+			if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
+				BUG();
+	}
+	return 0;
+}
+#endif
+
+static DEFINE_SPINLOCK(balancing);
+
+/*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
  *
  * Balancing parameters are set up in arch_init_sched_domains.
  */
-static DEFINE_SPINLOCK(balancing);
-
-static void run_rebalance_domains(struct softirq_action *h)
+static inline void rebalance_domains(int cpu, enum idle_type idle)
 {
-	int this_cpu = smp_processor_id(), balance = 1;
-	struct rq *this_rq = cpu_rq(this_cpu);
+	int balance = 1;
+	struct rq *rq = cpu_rq(cpu);
 	unsigned long interval;
 	struct sched_domain *sd;
-	enum idle_type idle = this_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
-	/* Earliest time when we have to call run_rebalance_domains again */
+	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 
-	for_each_domain(this_cpu, sd) {
+	for_each_domain(cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
 
@@ -2967,7 +3058,7 @@ static void run_rebalance_domains(struct
 		}
 
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
-			if (load_balance(this_cpu, this_rq, sd, idle, &balance)) {
+			if (load_balance(cpu, rq, sd, idle, &balance)) {
 				/*
 				 * We've pulled tasks over so either we're no
 				 * longer idle, or one of our SMT siblings is
@@ -2991,7 +3082,52 @@ out:
 		if (!balance)
 			break;
 	}
-	this_rq->next_balance = next_balance;
+	rq->next_balance = next_balance;
+}
+
+/*
+ * run_rebalance_domains is triggered when needed from the scheduler tick.
+ * In CONFIG_NO_HZ case, the idle load balance owner will do the
+ * rebalancing for all the cpus for whom scheduler ticks are stopped.
+ */
+static void run_rebalance_domains(struct softirq_action *h)
+{
+	int local_cpu = smp_processor_id();
+	struct rq *local_rq = cpu_rq(local_cpu);
+	enum idle_type idle = local_rq->idle_at_tick ? SCHED_IDLE : NOT_IDLE;
+
+	rebalance_domains(local_cpu, idle);
+
+#ifdef CONFIG_NO_HZ
+	/*
+	 * If this cpu is the owner for idle load balancing, then do the
+	 * balancing on behalf of the other idle cpus whose ticks are
+	 * stopped.
+	 */
+	if (local_rq->idle_at_tick &&
+	    atomic_read(&nohz.load_balancer) == local_cpu) {
+		cpumask_t cpus = nohz.cpu_mask;
+		struct rq *rq;
+		int balance_cpu;
+
+		cpu_clear(local_cpu, cpus);
+		for_each_cpu_mask(balance_cpu, cpus) {
+			/*
+			 * If this cpu gets work to do, stop the load balancing
+			 * work being done for other cpus. Next load
+			 * balancing owner will pick it up.
+			 */
+			if (need_resched())
+				break;
+
+			rebalance_domains(balance_cpu, SCHED_IDLE);
+
+			rq = cpu_rq(balance_cpu);
+			if (time_after(local_rq->next_balance, rq->next_balance))
+				local_rq->next_balance = rq->next_balance;
+		}
+	}
+#endif
 }
 #else
 /*
@@ -3220,6 +3356,68 @@ out_unlock:
 }
 
 /*
+ * Trigger the SCHED_SOFTIRQ if it is time to do periodic load balancing.
+ *
+ * In case of CONFIG_NO_HZ, this is the place where we nominate a new
+ * idle load balancing owner or decide to stop the periodic load balancing,
+ * if the whole system is idle.
+ */
+static inline void trigger_load_balance(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+#ifdef CONFIG_NO_HZ
+	/*
+	 * If we were in the nohz mode recently and busy at the current
+	 * scheduler tick, then check if we need to nominate new idle
+	 * load balancer.
+	 */
+	if (rq->in_nohz_recently && !rq->idle_at_tick) {
+		rq->in_nohz_recently = 0;
+
+		if (atomic_read(&nohz.load_balancer) == cpu) {
+			cpu_clear(cpu, nohz.cpu_mask);
+			atomic_set(&nohz.load_balancer, -1);
+		}
+
+		if (atomic_read(&nohz.load_balancer) == -1) {
+			/*
+			 * simple selection for now: Nominate the
+			 * first cpu in the nohz list to be the next
+			 * ilb owner.
+			 *
+			 * TBD: Traverse the sched domains and nominate
+			 * the nearest cpu in the nohz.cpu_mask.
+			 */
+			int ilb = first_cpu(nohz.cpu_mask);
+
+			if (ilb != NR_CPUS)
+				resched_cpu(ilb);
+		}
+	}
+
+	/*
+	 * If this cpu is idle and doing idle load balancing for all the
+	 * cpus with ticks stopped, is it time for that to stop?
+	 */
+	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) == cpu &&
+	    cpus_weight(nohz.cpu_mask) == num_online_cpus()) {
+		resched_cpu(cpu);
+		return;
+	}
+
+	/*
+	 * If this cpu is idle and the idle load balancing is done by
+	 * someone else, then no need raise the SCHED_SOFTIRQ
+	 */
+	if (rq->idle_at_tick && atomic_read(&nohz.load_balancer) != cpu &&
+	    cpu_isset(cpu, nohz.cpu_mask))
+		return;
+#endif
+	if (time_after_eq(jiffies, rq->next_balance))
+		raise_softirq(SCHED_SOFTIRQ);
+}
+
+/*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
  *
@@ -3244,8 +3442,8 @@ void scheduler_tick(void)
 #ifdef CONFIG_SMP
 	rq->idle_at_tick = idle_at_tick;
 	update_load(rq);
-	if (time_after_eq(jiffies, rq->next_balance))
-		raise_softirq(SCHED_SOFTIRQ);
+
+	trigger_load_balance(cpu);
 #endif
 }
 
diff -pNru linus-2.6/kernel/time/tick-sched.c linux/kernel/time/tick-sched.c
--- linus-2.6/kernel/time/tick-sched.c	2007-02-22 15:25:17.000000000 -0800
+++ linux/kernel/time/tick-sched.c	2007-02-23 17:01:14.000000000 -0800
@@ -215,6 +215,14 @@ void tick_nohz_stop_sched_tick(void)
 		 * the scheduler tick in nohz_restart_sched_tick.
 		 */
 		if (!ts->tick_stopped) {
+			if (select_nohz_load_balancer(1)) {
+				/*
+				 * sched tick not stopped!
+				 */
+				cpu_clear(cpu, nohz_cpu_mask);
+				goto out;
+			}
+
 			ts->idle_tick = ts->sched_timer.expires;
 			ts->tick_stopped = 1;
 			ts->idle_jiffies = last_jiffies;
@@ -271,6 +279,7 @@ void tick_nohz_restart_sched_tick(void)
 	now = ktime_get();
 
 	local_irq_disable();
+	select_nohz_load_balancer(0);
 	tick_do_update_jiffies64(now);
 	cpu_clear(cpu, nohz_cpu_mask);
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2007-02-24  2:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-12-11 23:53 [RFC] Patch: dynticks: idle load balancing Siddha, Suresh B
2006-12-13 22:43 ` Ingo Molnar
2006-12-13 23:13   ` Ingo Molnar
2006-12-13 23:03     ` Siddha, Suresh B
2006-12-13 23:31       ` Ingo Molnar
2006-12-13 23:19         ` Siddha, Suresh B
2006-12-14  0:34           ` Ingo Molnar
2006-12-19 20:12           ` Ingo Molnar
2006-12-19 21:12             ` Siddha, Suresh B
2007-01-16 11:35               ` Ingo Molnar
2007-01-30 21:57                 ` Siddha, Suresh B
2007-02-07 22:19                   ` Siddha, Suresh B
2007-02-17  2:03                   ` [patch 1/2] sched: fix idle load balancing in softirqd context Siddha, Suresh B
2007-02-17  2:08                     ` [patch 2/2] sched: dynticks idle load balancing - v2 Siddha, Suresh B
2007-02-21 20:23                       ` Andrew Morton
2007-02-22  3:14                         ` Nick Piggin
2007-02-24  2:01                         ` [patch] sched: dynticks idle load balancing - v3 Siddha, Suresh B
2007-02-22  3:26                       ` [patch 2/2] sched: dynticks idle load balancing - v2 Nick Piggin
2007-02-22 22:33                         ` Siddha, Suresh B
2007-02-23  3:43                           ` Nick Piggin
2007-02-17 14:42                     ` [patch 1/2] sched: fix idle load balancing in softirqd context Steven Rostedt
2007-02-21  6:25                       ` Siddha, Suresh B
2007-02-21 20:13                     ` Andrew Morton
2006-12-13 23:48     ` [RFC] Patch: dynticks: idle load balancing Ingo Molnar
2006-12-20  0:49 ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.