All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 1/2] sched: use resched IPI to kick off the nohz idle balance
@ 2011-09-29 22:30 Suresh Siddha
  2011-09-29 22:30 ` [patch 2/2] sched: request for idle balance during nohz idle load balance Suresh Siddha
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Suresh Siddha @ 2011-09-29 22:30 UTC (permalink / raw)
  To: Peter Zijlstra, Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava
  Cc: linux-kernel, Suresh Siddha, stable

[-- Attachment #1: kick.patch --]
[-- Type: text/plain, Size: 4643 bytes --]

Current use of smp call function to kick the nohz idle balance can deadlock
in this scenario.

1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
data (csd) to the call single queue, cpu-A took a timer interrupt.  Actual IPI
to cpu-B to process the call single queue is not yet sent.

2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
and __smp_call_function_single() with nowait will queue the csd to the
cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
as the call single queue was not empty.

3. cpu-A is busy with lot of interrupts

4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
idle load balancer and clear its rq->nohz_balance_kick.

5. At this point, csd queued as part of the step-2 above is still locked
and waiting to be serviced on cpu-B.

6. cpu-A is still busy with interrupt load and now it got another timer
interrupt and as part of it decided to kick cpu-B for another idle load
balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
above) and does __smp_call_function_single() with the same csd that is
still locked.

7. And we get a deadlock waiting for the csd_lock() in the
__smp_call_function_single().

Main issue here is that cpu-B can service the idle load balancer kick
request from cpu-A even with out receiving the IPI and this lead to
doing multiple __smp_call_function_single() on the same csd leading to
deadlock.

To kick a cpu, scheduler already has the reschedule vector reserved. Use
that mechanism (kick_process()) instead of using the generic smp call function
mechanism to kick off the nohz idle load balancing and avoid the deadlock.

   [ This issue is present from 2.6.35+ kernels, but marking it -stable
     only from v3.0+ as the proposed fix depends on the scheduler_ipi()
     that is introduced recently. ]

Reported-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org	# v3.0+
---
 kernel/sched.c      |   14 +++++++++++---
 kernel/sched_fair.c |   27 +++++++--------------------
 2 files changed, 18 insertions(+), 23 deletions(-)

Index: linux-2.6-tip/kernel/sched.c
===================================================================
--- linux-2.6-tip.orig/kernel/sched.c
+++ linux-2.6-tip/kernel/sched.c
@@ -2733,7 +2733,7 @@ void scheduler_ipi(void)
 	struct rq *rq = this_rq();
 	struct task_struct *list = xchg(&rq->wake_list, NULL);
 
-	if (!list)
+	if (!list && !idle_cpu(cpu_of(rq)))
 		return;
 
 	/*
@@ -2750,7 +2750,16 @@ void scheduler_ipi(void)
 	 * somewhat pessimize the simple resched case.
 	 */
 	irq_enter();
-	sched_ttwu_do_pending(list);
+
+	if (list)
+		sched_ttwu_do_pending(list);
+
+	/*
+ 	 * Check if someone kicked us for doing the nohz idle load balance.
+ 	 */
+	if (unlikely((rq->idle == current) && rq->nohz_balance_kick &&
+		     !need_resched()))
+		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	irq_exit();
 }
 
@@ -8303,7 +8312,6 @@ void __init sched_init(void)
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_balance_kick = 0;
-		init_sched_softirq_csd(&per_cpu(remote_sched_softirq_cb, i));
 #endif
 #endif
 		init_rq_hrtick(rq);
Index: linux-2.6-tip/kernel/sched_fair.c
===================================================================
--- linux-2.6-tip.orig/kernel/sched_fair.c
+++ linux-2.6-tip/kernel/sched_fair.c
@@ -4269,22 +4269,6 @@ out_unlock:
 }
 
 #ifdef CONFIG_NO_HZ
-
-static DEFINE_PER_CPU(struct call_single_data, remote_sched_softirq_cb);
-
-static void trigger_sched_softirq(void *data)
-{
-	raise_softirq_irqoff(SCHED_SOFTIRQ);
-}
-
-static inline void init_sched_softirq_csd(struct call_single_data *csd)
-{
-	csd->func = trigger_sched_softirq;
-	csd->info = NULL;
-	csd->flags = 0;
-	csd->priv = 0;
-}
-
 /*
  * idle load balancing details
  * - One of the idle CPUs nominates itself as idle load_balancer, while
@@ -4450,11 +4434,14 @@ static void nohz_balancer_kick(int cpu)
 	}
 
 	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
-		struct call_single_data *cp;
-
 		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
-		cp = &per_cpu(remote_sched_softirq_cb, cpu);
-		__smp_call_function_single(ilb_cpu, cp, 0);
+		/*
+		 * Use kick_process instead of resched_cpu.
+		 * This way we generate a sched IPI on the target cpu which
+		 * is idle. And the softirq performing nohz idle load balance
+		 * will be run before returning from the IPI.
+		 */
+		kick_process(idle_task(ilb_cpu));
 	}
 	return;
 }



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [patch 2/2] sched: request for idle balance during nohz idle load balance
  2011-09-29 22:30 [patch 1/2] sched: use resched IPI to kick off the nohz idle balance Suresh Siddha
@ 2011-09-29 22:30 ` Suresh Siddha
  2011-10-03 19:36 ` [patch 1/2] sched: use resched IPI to kick off the nohz idle balance Peter Zijlstra
  2011-10-03 19:40 ` Peter Zijlstra
  2 siblings, 0 replies; 5+ messages in thread
From: Suresh Siddha @ 2011-09-29 22:30 UTC (permalink / raw)
  To: Peter Zijlstra, Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava
  Cc: linux-kernel, Suresh Siddha

[-- Attachment #1: request_idle_balance.patch --]
[-- Type: text/plain, Size: 2681 bytes --]

rq's idle_at_tick is set to idle/busy during the timer tick
depending on the cpu was idle or not. This will be used later in the load
balance that will be done in the softirq context (which is a process
context in -RT kernels).

For nohz kernels, for the cpu doing nohz idle load balance on behalf of
all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
recorded when it got the timer tick presumably when it is busy).

As the nohz idle load balancing is also being done at the same place
as the regular load balancing, nohz idle load balancing was bailing out
when it sees rq's idle_at_tick not set.

Thus leading to poor system utilization.

Rename rq's idle_at_tick to idle_balance and set it when someone requests
for nohz idle balance on an idle cpu.

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 kernel/sched.c      |    8 +++++---
 kernel/sched_fair.c |    4 ++--
 2 files changed, 7 insertions(+), 5 deletions(-)

Index: linux-2.6-tip/kernel/sched.c
===================================================================
--- linux-2.6-tip.orig/kernel/sched.c
+++ linux-2.6-tip/kernel/sched.c
@@ -644,7 +644,7 @@ struct rq {
 
 	unsigned long cpu_power;
 
-	unsigned char idle_at_tick;
+	unsigned char idle_balance;
 	/* For active balancing */
 	int post_schedule;
 	int active_balance;
@@ -2758,8 +2758,10 @@ void scheduler_ipi(void)
  	 * Check if someone kicked us for doing the nohz idle load balance.
  	 */
 	if (unlikely((rq->idle == current) && rq->nohz_balance_kick &&
-		     !need_resched()))
+		     !need_resched())) {
+		rq->idle_balance = 1;
 		raise_softirq_irqoff(SCHED_SOFTIRQ);
+	}
 	irq_exit();
 }
 
@@ -4266,7 +4268,7 @@ void scheduler_tick(void)
 	perf_event_task_tick();
 
 #ifdef CONFIG_SMP
-	rq->idle_at_tick = idle_cpu(cpu);
+	rq->idle_balance = idle_cpu(cpu);
 	trigger_load_balance(rq, cpu);
 #endif
 }
Index: linux-2.6-tip/kernel/sched_fair.c
===================================================================
--- linux-2.6-tip.orig/kernel/sched_fair.c
+++ linux-2.6-tip/kernel/sched_fair.c
@@ -4674,7 +4674,7 @@ static inline int nohz_kick_needed(struc
 	if (time_before(now, nohz.next_balance))
 		return 0;
 
-	if (rq->idle_at_tick)
+	if (idle_cpu(cpu))
 		return 0;
 
 	first_pick_cpu = atomic_read(&nohz.first_pick_cpu);
@@ -4710,7 +4710,7 @@ static void run_rebalance_domains(struct
 {
 	int this_cpu = smp_processor_id();
 	struct rq *this_rq = cpu_rq(this_cpu);
-	enum cpu_idle_type idle = this_rq->idle_at_tick ?
+	enum cpu_idle_type idle = this_rq->idle_balance ?
 						CPU_IDLE : CPU_NOT_IDLE;
 
 	rebalance_domains(this_cpu, idle);



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch 1/2] sched: use resched IPI to kick off the nohz idle balance
  2011-09-29 22:30 [patch 1/2] sched: use resched IPI to kick off the nohz idle balance Suresh Siddha
  2011-09-29 22:30 ` [patch 2/2] sched: request for idle balance during nohz idle load balance Suresh Siddha
@ 2011-10-03 19:36 ` Peter Zijlstra
  2011-10-03 21:13   ` Suresh Siddha
  2011-10-03 19:40 ` Peter Zijlstra
  2 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2011-10-03 19:36 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, stable

On Thu, 2011-09-29 at 15:30 -0700, Suresh Siddha wrote:

> ---
>  kernel/sched.c      |   14 +++++++++++---
>  kernel/sched_fair.c |   27 +++++++--------------------
>  2 files changed, 18 insertions(+), 23 deletions(-)
> 
> Index: linux-2.6-tip/kernel/sched.c
> ===================================================================
> --- linux-2.6-tip.orig/kernel/sched.c
> +++ linux-2.6-tip/kernel/sched.c
> @@ -2733,7 +2733,7 @@ void scheduler_ipi(void)
>  	struct rq *rq = this_rq();
>  	struct task_struct *list = xchg(&rq->wake_list, NULL);
>  
> -	if (!list)
> +	if (!list && !idle_cpu(cpu_of(rq)))
>  		return;

Why not make that !rq->nohz_balance_kick? (wrapped in a helper for !
CONFIG_NO_HZ)

>  	/*
> @@ -2750,7 +2750,16 @@ void scheduler_ipi(void)
>  	 * somewhat pessimize the simple resched case.
>  	 */
>  	irq_enter();
> -	sched_ttwu_do_pending(list);
> +
> +	if (list)
> +		sched_ttwu_do_pending(list);
> +
> +	/*
> + 	 * Check if someone kicked us for doing the nohz idle load balance.
> + 	 */
> +	if (unlikely((rq->idle == current) && rq->nohz_balance_kick &&
> +		     !need_resched()))
> +		raise_softirq_irqoff(SCHED_SOFTIRQ);

And make that: idle_cpu() && rq->nohz_balance_kick && !need_resched()

All wrapped in #ifdef CONFIG_NO_HZ goo?
> tself as idle load_balancer, while
> @@ -4450,11 +4434,14 @@ static void nohz_balancer_kick(int cpu)
>  	}
>  
>  	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
> -		struct call_single_data *cp;
> -
>  		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
> -		cp = &per_cpu(remote_sched_softirq_cb, cpu);
> -		__smp_call_function_single(ilb_cpu, cp, 0);
> +		/*
> +		 * Use kick_process instead of resched_cpu.
> +		 * This way we generate a sched IPI on the target cpu which
> +		 * is idle. And the softirq performing nohz idle load balance
> +		 * will be run before returning from the IPI.
> +		 */

Shouldn't we have a memory barrier of sorts before sending the IPI?

> +		kick_process(idle_task(ilb_cpu));
>  	}
>  	return;
>  }
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch 1/2] sched: use resched IPI to kick off the nohz idle balance
  2011-09-29 22:30 [patch 1/2] sched: use resched IPI to kick off the nohz idle balance Suresh Siddha
  2011-09-29 22:30 ` [patch 2/2] sched: request for idle balance during nohz idle load balance Suresh Siddha
  2011-10-03 19:36 ` [patch 1/2] sched: use resched IPI to kick off the nohz idle balance Peter Zijlstra
@ 2011-10-03 19:40 ` Peter Zijlstra
  2 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2011-10-03 19:40 UTC (permalink / raw)
  To: Suresh Siddha
  Cc: Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, stable


Something like so in fact,.. except for the MB bits..

---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1404,6 +1404,18 @@ void wake_up_idle_cpu(int cpu)
 		smp_send_reschedule(cpu);
 }
 
+static inline bool got_nohz_kick(void)
+{
+	return !!this_rq()->nohz_balance_kick;
+}
+
+#else /* CONFIG_NO_HZ */
+
+static inline bool got_nohz_kick(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_NO_HZ */
 
 static u64 sched_avg_period(void)
@@ -2717,7 +2729,7 @@ static void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
-	if (llist_empty(&this_rq()->wake_list))
+	if (llist_empty(&this_rq()->wake_list) && got_nohz_kick())
 		return;
 
 	/*
@@ -2734,7 +2746,14 @@ void scheduler_ipi(void)
 	 * somewhat pessimize the simple resched case.
 	 */
 	irq_enter();
-	sched_ttwu_pending();
+	if (list)
+		sched_ttwu_do_pending(list);
+
+	/*
+ 	 * Check if someone kicked us for doing the nohz idle load balance.
+ 	 */
+	if (unlikely(got_nohz_kick() && idle_cpu(smp_processor_id()) && !need_resched()))
+		raise_softirq_irqoff(SCHED_SOFTIRQ);
 	irq_exit();
 }
 
@@ -8288,7 +8307,6 @@ void __init sched_init(void)
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ
 		rq->nohz_balance_kick = 0;
-		init_sched_softirq_csd(&per_cpu(remote_sched_softirq_cb, i));
 #endif
 #endif
 		init_rq_hrtick(rq);
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -4269,22 +4269,6 @@ static int active_load_balance_cpu_stop(
 }
 
 #ifdef CONFIG_NO_HZ
-
-static DEFINE_PER_CPU(struct call_single_data, remote_sched_softirq_cb);
-
-static void trigger_sched_softirq(void *data)
-{
-	raise_softirq_irqoff(SCHED_SOFTIRQ);
-}
-
-static inline void init_sched_softirq_csd(struct call_single_data *csd)
-{
-	csd->func = trigger_sched_softirq;
-	csd->info = NULL;
-	csd->flags = 0;
-	csd->priv = 0;
-}
-
 /*
  * idle load balancing details
  * - One of the idle CPUs nominates itself as idle load_balancer, while
@@ -4450,11 +4434,14 @@ static void nohz_balancer_kick(int cpu)
 	}
 
 	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
-		struct call_single_data *cp;
-
 		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
-		cp = &per_cpu(remote_sched_softirq_cb, cpu);
-		__smp_call_function_single(ilb_cpu, cp, 0);
+		/*
+		 * Use kick_process instead of resched_cpu.
+		 * This way we generate a sched IPI on the target cpu which
+		 * is idle. And the softirq performing nohz idle load balance
+		 * will be run before returning from the IPI.
+		 */
+		kick_process(idle_task(ilb_cpu));
 	}
 	return;
 }


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch 1/2] sched: use resched IPI to kick off the nohz idle balance
  2011-10-03 19:36 ` [patch 1/2] sched: use resched IPI to kick off the nohz idle balance Peter Zijlstra
@ 2011-10-03 21:13   ` Suresh Siddha
  0 siblings, 0 replies; 5+ messages in thread
From: Suresh Siddha @ 2011-10-03 21:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Venki Pallipadi, Ingo Molnar,
	Prarit Bhargava, linux-kernel, stable

On Mon, 2011-10-03 at 12:36 -0700, Peter Zijlstra wrote:
> On Thu, 2011-09-29 at 15:30 -0700, Suresh Siddha wrote:
> 
> > ---
> >  kernel/sched.c      |   14 +++++++++++---
> >  kernel/sched_fair.c |   27 +++++++--------------------
> >  2 files changed, 18 insertions(+), 23 deletions(-)
> > 
> > Index: linux-2.6-tip/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-tip.orig/kernel/sched.c
> > +++ linux-2.6-tip/kernel/sched.c
> > @@ -2733,7 +2733,7 @@ void scheduler_ipi(void)
> >  	struct rq *rq = this_rq();
> >  	struct task_struct *list = xchg(&rq->wake_list, NULL);
> >  
> > -	if (!list)
> > +	if (!list && !idle_cpu(cpu_of(rq)))
> >  		return;
> 
> Why not make that !rq->nohz_balance_kick? (wrapped in a helper for !
> CONFIG_NO_HZ)

If a rq gets busy before we do nohz_idle_balance() which does the
nohz_balance_kick reset, we will have a busy rq with nohz_balance_kick
set. And wanted to bail out sooner by checking for idle cpu and minimize
the impact for a busy rq having the nohz_idle_balance set.

I can probably rename your got_nohz_kick() as got_nohz_idle_kick() and
fix it.

> > tself as idle load_balancer, while
> > @@ -4450,11 +4434,14 @@ static void nohz_balancer_kick(int cpu)
> >  	}
> >  
> >  	if (!cpu_rq(ilb_cpu)->nohz_balance_kick) {
> > -		struct call_single_data *cp;
> > -
> >  		cpu_rq(ilb_cpu)->nohz_balance_kick = 1;
> > -		cp = &per_cpu(remote_sched_softirq_cb, cpu);
> > -		__smp_call_function_single(ilb_cpu, cp, 0);
> > +		/*
> > +		 * Use kick_process instead of resched_cpu.
> > +		 * This way we generate a sched IPI on the target cpu which
> > +		 * is idle. And the softirq performing nohz idle load balance
> > +		 * will be run before returning from the IPI.
> > +		 */
> 
> Shouldn't we have a memory barrier of sorts before sending the IPI?
> 
> > +		kick_process(idle_task(ilb_cpu));

Correct and also I think we can use smp_send_reschedule() directly
instead of kick process. Will fix it.

thanks,
suresh


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-10-03 21:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-29 22:30 [patch 1/2] sched: use resched IPI to kick off the nohz idle balance Suresh Siddha
2011-09-29 22:30 ` [patch 2/2] sched: request for idle balance during nohz idle load balance Suresh Siddha
2011-10-03 19:36 ` [patch 1/2] sched: use resched IPI to kick off the nohz idle balance Peter Zijlstra
2011-10-03 21:13   ` Suresh Siddha
2011-10-03 19:40 ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.