All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance
@ 2011-10-03 22:09 Suresh Siddha
  2011-10-03 22:09 ` [patch v2 2/2] sched: Request for idle balance during nohz idle load balance Suresh Siddha
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Suresh Siddha @ 2011-10-03 22:09 UTC (permalink / raw)
  To: Peter Zijlstra, Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava
  Cc: linux-kernel, Suresh Siddha, stable

[-- Attachment #1: sched-use_resched_ipi_to_kick_off_the_nohz_idle.patch --]
[-- Type: text/plain, Size: 4976 bytes --]

Current use of smp call function to kick the nohz idle balance can deadlock
in this scenario.

1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
data (csd) to the call single queue, cpu-A took a timer interrupt.  Actual IPI
to cpu-B to process the call single queue is not yet sent.

2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
and __smp_call_function_single() with nowait will queue the csd to the
cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
as the call single queue was not empty.

3. cpu-A is busy with lot of interrupts

4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
idle load balancer and clear its rq->nohz_balance_kick.

5. At this point, csd queued as part of the step-2 above is still locked
and waiting to be serviced on cpu-B.

6. cpu-A is still busy with interrupt load and now it got another timer
interrupt and as part of it decided to kick cpu-B for another idle load
balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
above) and does __smp_call_function_single() with the same csd that is
still locked.

7. And we get a deadlock waiting for the csd_lock() in the
__smp_call_function_single().

Main issue here is that cpu-B can service the idle load balancer kick
request from cpu-A even with out receiving the IPI and this lead to
doing multiple __smp_call_function_single() on the same csd leading to
deadlock.

To kick a cpu, scheduler already has the reschedule vector reserved. Use
that mechanism (kick_process()) instead of using the generic smp call function
mechanism to kick off the nohz idle load balancing and avoid the deadlock.

   [ This issue is present from 2.6.35+ kernels, but marking it -stable
     only from v3.0+ as the proposed fix depends on the scheduler_ipi()
     that is introduced recently. ]

Reported-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org # v3.0+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |   21 +++++++++++++++++++--
 kernel/sched_fair.c |   29 +++++++++--------------------
 2 files changed, 28 insertions(+), 22 deletions(-)
Index: linux-2.6-tip/kernel/sched.c
===================================================================
--- linux-2.6-tip.orig/kernel/sched.c
+++ linux-2.6-tip/kernel/sched.c
@@ -1404,6 +1404,18 @@ void wake_up_idle_cpu(int cpu)
 		smp_send_reschedule(cpu);
 }
 
+static inline bool got_nohz_idle_kick(void)
+{
+	return idle_cpu(smp_processor_id()) && this_rq()->nohz_balance_kick;
+}
+
+#else /* CONFIG_NO_HZ */
+
+static inline bool got_nohz_idle_kick(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_NO_HZ */
 
 static u64 sched_avg_period(void)
@@ -2717,7 +2729,7 @@ static void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
-	if (llist_empty(&this_rq()->wake_list))
+	if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
 		return;
 
 	/*
@@ -2735,6 +2747,12 @@ void scheduler_ipi(void)
 	 */
 	irq_enter();
 	sched_ttwu_pending();
+
+	/*
+	 * Check if someone kicked us for doing the nohz idle load balance.
+	 */
+	if (unlikely(got_nohz_idle_kick() && !need_resched()))
+		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	irq_exit();
 }
 
@@ -8280,7 +8298,6 @@ void __init sched_init(void)
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_balance_kick = 0;
-		init_sched_softirq_csd(&per_cpu(remote_sched_softirq_cb, i));
 #endif
 #endif
 		init_rq_hrtick(rq);
Index: linux-2.6-tip/kernel/sched_fair.c
===================================================================
--- linux-2.6-tip.orig/kernel/sched_fair.c
+++ linux-2.6-tip/kernel/sched_fair.c
@@ -4269,22 +4269,6 @@ out_unlock:
 }
 
 #ifdef CONFIG_NO_HZ
-
-static DEFINE_PER_CPU(struct call_single_data, remote_sched_softirq_cb);
-
-static void trigger_sched_softirq(void *data)
-{
-	raise_softirq_irqoff(SCHED_SOFTIRQ);
-}
-
-static inline void init_sched_softirq_csd(struct call_single_data *csd)
-{
-	csd->func = trigger_sched_softirq;
-	csd->info = NULL;
-	csd->flags = 0;
-	csd->priv = 0;
-}
-
 /*
  * idle load balancing details
  * - One of the idle CPUs nominates itself as idle load_balancer, while
@@ -4450,11 +4434,16 @@ static void nohz_balancer_kick(int cpu)
 	}
 
 	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
-		struct call_single_data *cp;
-
 		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
-		cp = &per_cpu(remote_sched_softirq_cb, cpu);
-		__smp_call_function_single(ilb_cpu, cp, 0);
+
+		smp_mb();
+		/*
+		 * Use smp_send_reschedule() instead of resched_cpu().
+		 * This way we generate a sched IPI on the target cpu which
+		 * is idle. And the softirq performing nohz idle load balance
+		 * will be run before returning from the IPI.
+		 */
+		smp_send_reschedule(ilb_cpu);
 	}
 	return;
 }



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [patch v2 2/2] sched: Request for idle balance during nohz idle load balance
  2011-10-03 22:09 [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance Suresh Siddha
@ 2011-10-03 22:09 ` Suresh Siddha
  2011-10-04  9:05 ` [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance Peter Zijlstra
  2011-10-06 17:27 ` Vivek Goyal
  2 siblings, 0 replies; 9+ messages in thread
From: Suresh Siddha @ 2011-10-03 22:09 UTC (permalink / raw)
  To: Peter Zijlstra, Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava
  Cc: linux-kernel, Suresh Siddha

[-- Attachment #1: sched-request_for_idle_balance_during_nohz_idle_load.patch --]
[-- Type: text/plain, Size: 2742 bytes --]

rq's idle_at_tick is set to idle/busy during the timer tick
depending on the cpu was idle or not. This will be used later in the load
balance that will be done in the softirq context (which is a process
context in -RT kernels).

For nohz kernels, for the cpu doing nohz idle load balance on behalf of
all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
recorded when it got the timer tick presumably when it is busy).

As the nohz idle load balancing is also being done at the same place
as the regular load balancing, nohz idle load balancing was bailing out
when it sees rq's idle_at_tick not set.

Thus leading to poor system utilization.

Rename rq's idle_at_tick to idle_balance and set it when someone requests
for nohz idle balance on an idle cpu.

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |    8 +++++---
 kernel/sched_fair.c |    4 ++--
 2 files changed, 7 insertions(+), 5 deletions(-)
Index: linux-2.6-tip/kernel/sched.c
===================================================================
--- linux-2.6-tip.orig/kernel/sched.c
+++ linux-2.6-tip/kernel/sched.c
@@ -644,7 +644,7 @@ struct rq {
 
 	unsigned long cpu_power;
 
-	unsigned char idle_at_tick;
+	unsigned char idle_balance;
 	/* For active balancing */
 	int post_schedule;
 	int active_balance;
@@ -2751,8 +2751,10 @@ void scheduler_ipi(void)
 	/*
 	 * Check if someone kicked us for doing the nohz idle load balance.
 	 */
-	if (unlikely(got_nohz_idle_kick() && !need_resched()))
+	if (unlikely(got_nohz_idle_kick() && !need_resched())) {
+		this_rq()->idle_balance = 1;
 		raise_softirq_irqoff(SCHED_SOFTIRQ);
+	}
 	irq_exit();
 }
 
@@ -4247,7 +4249,7 @@ void scheduler_tick(void)
 	perf_event_task_tick();
 
 #ifdef CONFIG_SMP
-	rq->idle_at_tick = idle_cpu(cpu);
+	rq->idle_balance = idle_cpu(cpu);
 	trigger_load_balance(rq, cpu);
 #endif
 }
Index: linux-2.6-tip/kernel/sched_fair.c
===================================================================
--- linux-2.6-tip.orig/kernel/sched_fair.c
+++ linux-2.6-tip/kernel/sched_fair.c
@@ -4676,7 +4676,7 @@ static inline int nohz_kick_needed(struc
 	if (time_before(now, nohz.next_balance))
 		return 0;
 
-	if (rq->idle_at_tick)
+	if (idle_cpu(cpu))
 		return 0;
 
 	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
@@ -4712,7 +4712,7 @@ static void run_rebalance_domains(struct
 {
 	int this_cpu = smp_processor_id();
 	struct rq *this_rq = cpu_rq(this_cpu);
-	enum cpu_idle_type idle = this_rq->idle_at_tick ?
+	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;
 
 	rebalance_domains(this_cpu, idle);



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance
  2011-10-03 22:09 [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance Suresh Siddha
  2011-10-03 22:09 ` [patch v2 2/2] sched: Request for idle balance during nohz idle load balance Suresh Siddha
@ 2011-10-04  9:05 ` Peter Zijlstra
  2011-10-04 18:03   ` Suresh Siddha
  2011-10-06 17:27 ` Vivek Goyal
  2 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2011-10-04  9:05 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, stable

On Mon, 2011-10-03 at 15:09 -0700, Suresh Siddha wrote:
> +               smp_mb();
> +               /*
> +                * Use smp_send_reschedule() instead of resched_cpu().
> +                * This way we generate a sched IPI on the target cpu which
> +                * is idle. And the softirq performing nohz idle load balance
> +                * will be run before returning from the IPI.
> +                */
> +               smp_send_reschedule(ilb_cpu); 

Now the only thing missing is where the matching barrier is, does
receiving an interrupt imply a mb? If so that wants mentioning
somewhere, otherwise people will wonder how we can ever read the right
value.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance
  2011-10-04  9:05 ` [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance Peter Zijlstra
@ 2011-10-04 18:03   ` Suresh Siddha
  0 siblings, 0 replies; 9+ messages in thread
From: Suresh Siddha @ 2011-10-04 18:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, stable

On Tue, 2011-10-04 at 02:05 -0700, Peter Zijlstra wrote:
> On Mon, 2011-10-03 at 15:09 -0700, Suresh Siddha wrote:
> > +               smp_mb();
> > +               /*
> > +                * Use smp_send_reschedule() instead of resched_cpu().
> > +                * This way we generate a sched IPI on the target cpu which
> > +                * is idle. And the softirq performing nohz idle load balance
> > +                * will be run before returning from the IPI.
> > +                */
> > +               smp_send_reschedule(ilb_cpu); 
> 
> Now the only thing missing is where the matching barrier is, does
> receiving an interrupt imply a mb? If so that wants mentioning
> somewhere, otherwise people will wonder how we can ever read the right
> value.

This is similar to resched_task()? Isn't it?

I am expecting the interrupt handlers will be already having the
necessary memory barriers, if the arch requires it.

thanks,
suresh



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance
  2011-10-03 22:09 [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance Suresh Siddha
  2011-10-03 22:09 ` [patch v2 2/2] sched: Request for idle balance during nohz idle load balance Suresh Siddha
  2011-10-04  9:05 ` [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance Peter Zijlstra
@ 2011-10-06 17:27 ` Vivek Goyal
  2011-10-08  0:09   ` Suresh Siddha
  2 siblings, 1 reply; 9+ messages in thread
From: Vivek Goyal @ 2011-10-06 17:27 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, Suresh Siddha, stable

On Mon, Oct 03, 2011 at 03:09:00PM -0700, Suresh Siddha wrote:
[..]
> 
>    [ This issue is present from 2.6.35+ kernels, but marking it -stable
>      only from v3.0+ as the proposed fix depends on the scheduler_ipi()
>      that is introduced recently. ]
> 

Hi Suresh,

Are you planning to fix this issue for older kernels too? I am wondering
how to go about fixing it there.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance
  2011-10-06 17:27 ` Vivek Goyal
@ 2011-10-08  0:09   ` Suresh Siddha
  2011-10-10 13:52     ` Vivek Goyal
  0 siblings, 1 reply; 9+ messages in thread
From: Suresh Siddha @ 2011-10-08  0:09 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, stable

On Thu, 2011-10-06 at 10:27 -0700, Vivek Goyal wrote:
> On Mon, Oct 03, 2011 at 03:09:00PM -0700, Suresh Siddha wrote:
> [..]
> > 
> >    [ This issue is present from 2.6.35+ kernels, but marking it -stable
> >      only from v3.0+ as the proposed fix depends on the scheduler_ipi()
> >      that is introduced recently. ]
> > 
> 
> Hi Suresh,
> 
> Are you planning to fix this issue for older kernels too? I am wondering
> how to go about fixing it there.
> 

Vivek, Initially when Prarit brought this issue to me, I gave an ugly
quick fix (Appended) for a quick try. But I would recommend we back port
the couple of mainline patches that introduced scheduler_ipi() code
along with these two patches. Thoughts?

---
Quick and dirty fix for the deadlock caused by the nohz balance logic.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bc8ee99..f48b950 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3618,6 +3618,7 @@ static DEFINE_PER_CPU(struct call_single_data, remote_sched_softirq_cb);
 static void trigger_sched_softirq(void *data)
 {
 	raise_softirq_irqoff(SCHED_SOFTIRQ);
+	this_rq()->nohz_balance_kick = 2;
 }
 
 static inline void init_sched_softirq_csd(struct call_single_data *csd)
@@ -3977,7 +3978,7 @@ static void nohz_idle_balance(int this_cpu, enum cpu_idle_type idle)
 	struct rq *rq;
 	int balance_cpu;
 
-	if (idle != CPU_IDLE || !this_rq->nohz_balance_kick)
+	if (idle != CPU_IDLE || (this_rq->nohz_balance_kick != 2))
 		return;
 
 	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance
  2011-10-08  0:09   ` Suresh Siddha
@ 2011-10-10 13:52     ` Vivek Goyal
  2011-10-10 14:06       ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Vivek Goyal @ 2011-10-10 13:52 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Peter Zijlstra, Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, stable

On Fri, Oct 07, 2011 at 05:09:46PM -0700, Suresh Siddha wrote:
> On Thu, 2011-10-06 at 10:27 -0700, Vivek Goyal wrote:
> > On Mon, Oct 03, 2011 at 03:09:00PM -0700, Suresh Siddha wrote:
> > [..]
> > > 
> > >    [ This issue is present from 2.6.35+ kernels, but marking it -stable
> > >      only from v3.0+ as the proposed fix depends on the scheduler_ipi()
> > >      that is introduced recently. ]
> > > 
> > 
> > Hi Suresh,
> > 
> > Are you planning to fix this issue for older kernels too? I am wondering
> > how to go about fixing it there.
> > 
> 
> Vivek, Initially when Prarit brought this issue to me, I gave an ugly
> quick fix (Appended) for a quick try. But I would recommend we back port
> the couple of mainline patches that introduced scheduler_ipi() code
> along with these two patches. Thoughts?

I agree that backporting scheduler_ipi() patches makes more sense.
Browsing through the code looks like we need just one commit which
introduced scheduler_ipi().

commit 184748cc50b2dceb8287f9fb657eda48ff8fcfe7
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Tue Apr 5 17:23:39 2011 +0200

    sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()

I did partial backport of above to come up with a fix for rhel6 and put
your first patch on top of it. 

Once your current patches get merged, I can backport above and two of
your patches for older kernel is there is interest.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance
  2011-10-10 13:52     ` Vivek Goyal
@ 2011-10-10 14:06       ` Peter Zijlstra
  2011-10-10 15:25         ` Vivek Goyal
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2011-10-10 14:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Suresh Siddha, Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, stable

On Mon, 2011-10-10 at 09:52 -0400, Vivek Goyal wrote:

> I did partial backport of above to come up with a fix for rhel6 and put
> your first patch on top of it. 

You might also want to look at these for PPC64:

23d72bfd8f9f24aa9efafed3586a99f5669c23d7
880102e78547c1db158a17e36cf0cdd98e7ad710

> Once your current patches get merged, I can backport above and two of
> your patches for older kernel is there is interest.

I think they're in current tip/master already.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance
  2011-10-10 14:06       ` Peter Zijlstra
@ 2011-10-10 15:25         ` Vivek Goyal
  0 siblings, 0 replies; 9+ messages in thread
From: Vivek Goyal @ 2011-10-10 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suresh Siddha, Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, stable, Benjamin Herrenschmidt

On Mon, Oct 10, 2011 at 04:06:04PM +0200, Peter Zijlstra wrote:
> On Mon, 2011-10-10 at 09:52 -0400, Vivek Goyal wrote:
> 
> > I did partial backport of above to come up with a fix for rhel6 and put
> > your first patch on top of it. 
> 
> You might also want to look at these for PPC64:
> 
> 23d72bfd8f9f24aa9efafed3586a99f5669c23d7

This seems to be just cleanup of the way IPI are handled in ppc64
(mux/demux). I don't understand the details but looks like not a
necessary component of the problem at hand.

> 880102e78547c1db158a17e36cf0cdd98e7ad710

Here Benjamin talks about following.

    Manual merge of arch/powerpc/kernel/smp.c and add missing scheduler_ipi()
    call to arch/powerpc/platforms/cell/interrupt.c

And I can't find any scheduler_ipi() call in current cell/interrupt.c. So
either it got moved or something else happend. CCing him to find out. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-10-10 15:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-03 22:09 [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance Suresh Siddha
2011-10-03 22:09 ` [patch v2 2/2] sched: Request for idle balance during nohz idle load balance Suresh Siddha
2011-10-04  9:05 ` [patch v2 1/2] sched: Use resched IPI to kick off the nohz idle balance Peter Zijlstra
2011-10-04 18:03   ` Suresh Siddha
2011-10-06 17:27 ` Vivek Goyal
2011-10-08  0:09   ` Suresh Siddha
2011-10-10 13:52     ` Vivek Goyal
2011-10-10 14:06       ` Peter Zijlstra
2011-10-10 15:25         ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.