All of lore.kernel.org
 help / color / mirror / Atom feed
* CFS scheduler: spin_lock usage causes dead lock when smp_apic_timer_interrupt occurs
       [not found] <AM0PR03MB480425D5999E0D08DAB30204BB8E0@AM0PR03MB4804.eurprd03.prod.outlook.com>
@ 2019-01-04 12:42 ` Tom Putzeys
  2019-01-07 10:26   ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Tom Putzeys @ 2019-01-04 12:42 UTC (permalink / raw)
  To: mingo, peterz; +Cc: linux-kernel

Dear Ingo and Peter,

I would like to report a possible bug in the CFS scheduler causing a dead lock. 

We suspect this bug to have caused intermittent yet highly-persistent system freezes on our quad-core SMP systems.

We noticed the problem on 4.1.17 preempt-rt but we suspect the problematic code is not linked to the preempt-rt patch and is also present in the latest 4.20 kernel.

The problem concerns the use of spin_lock to lock cfs_b in a situation where the spin lock is used in an interrupt handler:
-  __run_hrtimer (in kernel/time/hrtimer.c) calls fn(timer) with IRQ's enabled. This can call sched_cfs_period_timer() (in kernel/sched/fair.c) which locks cfs_b. 
- the hard IRQ smp_apic_timer_interrupt can then occur. It can call ttwu_queue() which grabs the spin lock for its CPU run queue and can then try to enqueue a task via the CFS scheduler.
- this can call check_enqueue_throttle() which can call assign_cfs_rq_runtime() which tries to obtain the cfs_b lock. It is now blocked.

The cfs_b lock uses spin_lock and so was not intended for use inside a hard irq but the CFS scheduler does just that when it uses a hrtimer_interrupt to wake up and enqueue work. Our initial impression is that  the cfs_b needs to be locked using spin_lock_irqsave.

My colleague Mike Pearce has submitted a bug report on Bugzilla 3 weeks ago: https://bugzilla.kernel.org/show_bug.cgi?id=201993

We would appreciate any feedback.

Kind regards,

Tom

     

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CFS scheduler: spin_lock usage causes dead lock when smp_apic_timer_interrupt occurs
  2019-01-04 12:42 ` CFS scheduler: spin_lock usage causes dead lock when smp_apic_timer_interrupt occurs Tom Putzeys
@ 2019-01-07 10:26   ` Peter Zijlstra
  2019-01-07 12:28     ` Mike Galbraith
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2019-01-07 10:26 UTC (permalink / raw)
  To: Tom Putzeys
  Cc: mingo, linux-kernel, Sebastian Andrzej Siewior, Thomas Gleixner

On Fri, Jan 04, 2019 at 12:42:27PM +0000, Tom Putzeys wrote:
> Dear Ingo and Peter,
> 
> I would like to report a possible bug in the CFS scheduler causing a
> dead lock. 
> 
> We suspect this bug to have caused intermittent yet highly-persistent
> system freezes on our quad-core SMP systems.
> 
> We noticed the problem on 4.1.17 preempt-rt but we suspect the
> problematic code is not linked to the preempt-rt patch and is also
> present in the latest 4.20 kernel.
> 
> The problem concerns the use of spin_lock to lock cfs_b in a situation
> where the spin lock is used in an interrupt handler:

> -  __run_hrtimer (in kernel/time/hrtimer.c) calls fn(timer) with IRQ's
> enabled. This can call sched_cfs_period_timer() (in
> kernel/sched/fair.c) which locks cfs_b. 

Hurmph, that's the softirq timer handling. And that is a nasty subtle
difference in context between softirq and hardirq timers.

Also, upstream doesn't use HRTIMER_MODE_SOFT here, but I suppose -rt
forces everything !HARD into SOFT.

> - the hard IRQ smp_apic_timer_interrupt can then occur. It can call
> ttwu_queue() which grabs the spin lock for its CPU run queue and can
> then try to enqueue a task via the CFS scheduler.

> - this can call check_enqueue_throttle() which can call
> assign_cfs_rq_runtime() which tries to obtain the cfs_b lock. It is
> now blocked.
> 
> The cfs_b lock uses spin_lock and so was not intended for use inside a
> hard irq but the CFS scheduler does just that when it uses a
> hrtimer_interrupt to wake up and enqueue work. Our initial impression
> is that  the cfs_b needs to be locked using spin_lock_irqsave.

I would expect lockdep you also complain about this, but yes, something
like that. I was very much expecting this to run with IRQs disabled (and
it does on mainline afaict).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CFS scheduler: spin_lock usage causes dead lock when smp_apic_timer_interrupt occurs
  2019-01-07 10:26   ` Peter Zijlstra
@ 2019-01-07 12:28     ` Mike Galbraith
  2019-01-07 12:52       ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Galbraith @ 2019-01-07 12:28 UTC (permalink / raw)
  To: Peter Zijlstra, Tom Putzeys
  Cc: mingo, linux-kernel, Sebastian Andrzej Siewior, Thomas Gleixner

On Mon, 2019-01-07 at 11:26 +0100, Peter Zijlstra wrote:
> 
> I would expect lockdep you also complain about this...

And grumble it did.

commit df7e8acc0c9a84979a448d215b8ef889efe4ac5a
Author: Mike Galbraith <efault@gmx.de>
Date:   Fri May 4 08:14:38 2018 +0200

    sched/fair: Fix CFS bandwidth control lockdep DEADLOCK report
    
    CFS bandwidth control yields the inversion gripe below, moving
    handling quells it.
    
    |========================================================
    |WARNING: possible irq lock inversion dependency detected
    |4.16.7-rt1-rt #2 Tainted: G            E
    |--------------------------------------------------------
    |sirq-hrtimer/0/15 just changed the state of lock:
    | (&cfs_b->lock){+...}, at: [<000000009adb5cf7>] sched_cfs_period_timer+0x28/0x140
    |but this lock was taken by another, HARDIRQ-safe lock in the past: (&rq->lock){-...}
    |and interrupts could create inverse lock ordering between them.
    |other info that might help us debug this:
    | Possible interrupt unsafe locking scenario:
    |       CPU0                    CPU1
    |       ----                    ----
    |  lock(&cfs_b->lock);
    |                               local_irq_disable();
    |                               lock(&rq->lock);
    |                               lock(&cfs_b->lock);
    |  <Interrupt>
    |    lock(&rq->lock);
    
    Cc: stable-rt@vger.kernel.org
    Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 960ad0ce77d7..420624c49f38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5007,9 +5007,9 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
-	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
-	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 }
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: CFS scheduler: spin_lock usage causes dead lock when smp_apic_timer_interrupt occurs
  2019-01-07 12:28     ` Mike Galbraith
@ 2019-01-07 12:52       ` Peter Zijlstra
  2019-01-08  5:30         ` Mike Galbraith
                           ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Peter Zijlstra @ 2019-01-07 12:52 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tom Putzeys, mingo, linux-kernel, Sebastian Andrzej Siewior,
	Thomas Gleixner

On Mon, Jan 07, 2019 at 01:28:34PM +0100, Mike Galbraith wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 960ad0ce77d7..420624c49f38 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5007,9 +5007,9 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  	cfs_b->period = ns_to_ktime(default_cfs_period());
>  
>  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> -	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> +	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
>  	cfs_b->period_timer.function = sched_cfs_period_timer;
> -	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
>  	cfs_b->slack_timer.function = sched_cfs_slack_timer;
>  }

Right, that should sort it. But I'm not sure this is the best solution
though. That cfs-runtime crud can (IIRC) iterate lists etc.. so running
it from the softirq isn't a bad idea. We just need to fix that locking
up a bit.

Something a wee bit like so perhaps..

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 13776fac7b74..3cfe26aa098a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4566,7 +4566,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		struct rq *rq = rq_of(cfs_rq);
 		struct rq_flags rf;
 
-		rq_lock(rq, &rf);
+		rq_lock_irqsave(rq, &rf);
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
@@ -4583,7 +4583,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 			unthrottle_cfs_rq(cfs_rq);
 
 next:
-		rq_unlock(rq, &rf);
+		rq_unlock_irqrestore(rq, &rf);
 
 		if (!remaining)
 			break;
@@ -4599,7 +4599,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
  * period the timer is deactivated until scheduling resumes; cfs_b->idle is
  * used to track this state.
  */
-static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
 	u64 runtime, runtime_expires;
 	int throttled;
@@ -4641,11 +4641,11 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
 		runtime = cfs_b->runtime;
 		cfs_b->distribute_running = 1;
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
 		runtime = distribute_cfs_runtime(cfs_b, runtime,
 						 runtime_expires);
-		raw_spin_lock(&cfs_b->lock);
+		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
 		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
@@ -4754,17 +4754,18 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	unsigned long flags;
 	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (cfs_b->distribute_running) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
 	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
@@ -4775,18 +4776,18 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	if (!runtime)
 		return;
 
 	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (expires == cfs_b->runtime_expires)
 		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
 
 /*
@@ -4864,20 +4865,21 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
+	unsigned long flags;
 	int overrun;
 	int idle = 0;
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	for (;;) {
 		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
-		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+		idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
 	}
 	if (idle)
 		cfs_b->period_active = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: CFS scheduler: spin_lock usage causes dead lock when smp_apic_timer_interrupt occurs
  2019-01-07 12:52       ` Peter Zijlstra
@ 2019-01-08  5:30         ` Mike Galbraith
  2019-01-08  9:06           ` Peter Zijlstra
  2019-01-21 11:37         ` [tip:sched/core] sched/fair: Robustify CFS-bandwidth timer locking tip-bot for Peter Zijlstra
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Mike Galbraith @ 2019-01-08  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tom Putzeys, mingo, linux-kernel, Sebastian Andrzej Siewior,
	Thomas Gleixner

On Mon, 2019-01-07 at 13:52 +0100, Peter Zijlstra wrote:
> On Mon, Jan 07, 2019 at 01:28:34PM +0100, Mike Galbraith wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 960ad0ce77d7..420624c49f38 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5007,9 +5007,9 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >  	cfs_b->period = ns_to_ktime(default_cfs_period());
> >  
> >  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> > -	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> > +	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
> >  	cfs_b->period_timer.function = sched_cfs_period_timer;
> > -	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > +	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
> >  	cfs_b->slack_timer.function = sched_cfs_slack_timer;
> >  }
> 
> Right, that should sort it. But I'm not sure this is the best solution
> though. That cfs-runtime crud can (IIRC) iterate lists etc.. so running
> it from the softirq isn't a bad idea. We just need to fix that locking
> up a bit.
> 
> Something a wee bit like so perhaps..

I plugged that into 4.19-rt along with revert of hard irq context, and
it (as expected) does the trick.

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 13776fac7b74..3cfe26aa098a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4566,7 +4566,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>  		struct rq *rq = rq_of(cfs_rq);
>  		struct rq_flags rf;
>  
> -		rq_lock(rq, &rf);
> +		rq_lock_irqsave(rq, &rf);
>  		if (!cfs_rq_throttled(cfs_rq))
>  			goto next;
>  
> @@ -4583,7 +4583,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>  			unthrottle_cfs_rq(cfs_rq);
>  
>  next:
> -		rq_unlock(rq, &rf);
> +		rq_unlock_irqrestore(rq, &rf);
>  
>  		if (!remaining)
>  			break;
> @@ -4599,7 +4599,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
>   * period the timer is deactivated until scheduling resumes; cfs_b->idle is
>   * used to track this state.
>   */
> -static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
>  {
>  	u64 runtime, runtime_expires;
>  	int throttled;
> @@ -4641,11 +4641,11 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  	while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
>  		runtime = cfs_b->runtime;
>  		cfs_b->distribute_running = 1;
> -		raw_spin_unlock(&cfs_b->lock);
> +		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  		/* we can't nest cfs_b->lock while distributing bandwidth */
>  		runtime = distribute_cfs_runtime(cfs_b, runtime,
>  						 runtime_expires);
> -		raw_spin_lock(&cfs_b->lock);
> +		raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  
>  		cfs_b->distribute_running = 0;
>  		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
> @@ -4754,17 +4754,18 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>  	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> +	unsigned long flags;
>  	u64 expires;
>  
>  	/* confirm we're still not at a refresh boundary */
> -	raw_spin_lock(&cfs_b->lock);
> +	raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  	if (cfs_b->distribute_running) {
> -		raw_spin_unlock(&cfs_b->lock);
> +		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  		return;
>  	}
>  
>  	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
> -		raw_spin_unlock(&cfs_b->lock);
> +		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  		return;
>  	}
>  
> @@ -4775,18 +4776,18 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (runtime)
>  		cfs_b->distribute_running = 1;
>  
> -	raw_spin_unlock(&cfs_b->lock);
> +	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  
>  	if (!runtime)
>  		return;
>  
>  	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
>  
> -	raw_spin_lock(&cfs_b->lock);
> +	raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  	if (expires == cfs_b->runtime_expires)
>  		lsub_positive(&cfs_b->runtime, runtime);
>  	cfs_b->distribute_running = 0;
> -	raw_spin_unlock(&cfs_b->lock);
> +	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  }
>  
>  /*
> @@ -4864,20 +4865,21 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  {
>  	struct cfs_bandwidth *cfs_b =
>  		container_of(timer, struct cfs_bandwidth, period_timer);
> +	unsigned long flags;
>  	int overrun;
>  	int idle = 0;
>  
> -	raw_spin_lock(&cfs_b->lock);
> +	raw_spin_lock_irqsave(&cfs_b->lock, flags);
>  	for (;;) {
>  		overrun = hrtimer_forward_now(timer, cfs_b->period);
>  		if (!overrun)
>  			break;
>  
> -		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> +		idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
>  	}
>  	if (idle)
>  		cfs_b->period_active = 0;
> -	raw_spin_unlock(&cfs_b->lock);
> +	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
>  
>  	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
>  }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CFS scheduler: spin_lock usage causes dead lock when smp_apic_timer_interrupt occurs
  2019-01-08  5:30         ` Mike Galbraith
@ 2019-01-08  9:06           ` Peter Zijlstra
  2019-01-08 11:05             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2019-01-08  9:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tom Putzeys, mingo, linux-kernel, Sebastian Andrzej Siewior,
	Thomas Gleixner

On Tue, Jan 08, 2019 at 06:30:59AM +0100, Mike Galbraith wrote:
> On Mon, 2019-01-07 at 13:52 +0100, Peter Zijlstra wrote:
> > On Mon, Jan 07, 2019 at 01:28:34PM +0100, Mike Galbraith wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 960ad0ce77d7..420624c49f38 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5007,9 +5007,9 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > >  	cfs_b->period = ns_to_ktime(default_cfs_period());
> > >  
> > >  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> > > -	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
> > > +	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
> > >  	cfs_b->period_timer.function = sched_cfs_period_timer;
> > > -	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > > +	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD);
> > >  	cfs_b->slack_timer.function = sched_cfs_slack_timer;
> > >  }
> > 
> > Right, that should sort it. But I'm not sure this is the best solution
> > though. That cfs-runtime crud can (IIRC) iterate lists etc.. so running
> > it from the softirq isn't a bad idea. We just need to fix that locking
> > up a bit.
> > 
> > Something a wee bit like so perhaps..
> 
> I plugged that into 4.19-rt along with revert of hard irq context, and
> it (as expected) does the trick.

Much appreciated; I queued it as the below.

---
Subject: sched/fair: Robustify CFS-bandwidth timer locking
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon, 7 Jan 2019 13:52:31 +0100

Traditionally hrtimer callbacks were run with IRQs disabled, but with
the introduction of HRTIMER_MODE_SOFT it is possible they run from
SoftIRQ context, which does _NOT_ have IRQs disabled.

Allow for the CFS bandwidth timers (period_timer and slack_timer) to
be ran from SoftIRQ context; this entails removing the assumption that
IRQs are already disabled from the locking.

While mainline doesn't strictly need this, -RT forces all timers not
explicitly marked with MODE_HARD into MODE_SOFT and trips over this.
And marking these timers as MODE_HARD doesn't make sense as they're
not required for RT operation and can potentially be quite expensive.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reported-by: Tom Putzeys <tom.putzeys@be.atlascopco.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190107125231.GE14122@hirez.programming.kicks-ass.net
---
 kernel/sched/fair.c |   30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4566,7 +4566,7 @@ static u64 distribute_cfs_runtime(struct
 		struct rq *rq = rq_of(cfs_rq);
 		struct rq_flags rf;
 
-		rq_lock(rq, &rf);
+		rq_lock_irqsave(rq, &rf);
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
@@ -4583,7 +4583,7 @@ static u64 distribute_cfs_runtime(struct
 			unthrottle_cfs_rq(cfs_rq);
 
 next:
-		rq_unlock(rq, &rf);
+		rq_unlock_irqrestore(rq, &rf);
 
 		if (!remaining)
 			break;
@@ -4599,7 +4599,7 @@ static u64 distribute_cfs_runtime(struct
  * period the timer is deactivated until scheduling resumes; cfs_b->idle is
  * used to track this state.
  */
-static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
 	u64 runtime, runtime_expires;
 	int throttled;
@@ -4641,11 +4641,11 @@ static int do_sched_cfs_period_timer(str
 	while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
 		runtime = cfs_b->runtime;
 		cfs_b->distribute_running = 1;
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
 		runtime = distribute_cfs_runtime(cfs_b, runtime,
 						 runtime_expires);
-		raw_spin_lock(&cfs_b->lock);
+		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
 		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
@@ -4754,17 +4754,18 @@ static __always_inline void return_cfs_r
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	unsigned long flags;
 	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (cfs_b->distribute_running) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
 	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
@@ -4775,18 +4776,18 @@ static void do_sched_cfs_slack_timer(str
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	if (!runtime)
 		return;
 
 	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (expires == cfs_b->runtime_expires)
 		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
 
 /*
@@ -4864,20 +4865,21 @@ static enum hrtimer_restart sched_cfs_pe
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
+	unsigned long flags;
 	int overrun;
 	int idle = 0;
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	for (;;) {
 		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
-		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+		idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
 	}
 	if (idle)
 		cfs_b->period_active = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: CFS scheduler: spin_lock usage causes dead lock when smp_apic_timer_interrupt occurs
  2019-01-08  9:06           ` Peter Zijlstra
@ 2019-01-08 11:05             ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 10+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-01-08 11:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Tom Putzeys, mingo, linux-kernel, Thomas Gleixner

On 2019-01-08 10:06:25 [+0100], Peter Zijlstra wrote:
> Much appreciated; I queued it as the below.
thank you, queued, too.

Sebastian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip:sched/core] sched/fair: Robustify CFS-bandwidth timer locking
  2019-01-07 12:52       ` Peter Zijlstra
  2019-01-08  5:30         ` Mike Galbraith
@ 2019-01-21 11:37         ` tip-bot for Peter Zijlstra
  2019-01-21 13:53         ` tip-bot for Peter Zijlstra
  2019-01-27 11:36         ` tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-01-21 11:37 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, tglx, efault, bigeasy, tom.putzeys, peterz, mingo,
	linux-kernel

Commit-ID:  b733c2d2f2810ec8556d2d711d1b95f491bd7697
Gitweb:     https://git.kernel.org/tip/b733c2d2f2810ec8556d2d711d1b95f491bd7697
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Jan 2019 13:52:31 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 21 Jan 2019 11:27:55 +0100

sched/fair: Robustify CFS-bandwidth timer locking

Traditionally hrtimer callbacks were run with IRQs disabled, but with
the introduction of HRTIMER_MODE_SOFT it is possible they run from
SoftIRQ context, which does _NOT_ have IRQs disabled.

Allow for the CFS bandwidth timers (period_timer and slack_timer) to
be ran from SoftIRQ context; this entails removing the assumption that
IRQs are already disabled from the locking.

While mainline doesn't strictly need this, -RT forces all timers not
explicitly marked with MODE_HARD into MODE_SOFT and trips over this.
And marking these timers as MODE_HARD doesn't make sense as they're
not required for RT operation and can potentially be quite expensive.

Reported-by: Tom Putzeys <tom.putzeys@be.atlascopco.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190107125231.GE14122@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 385725eb3bd6..90c7a7bf45d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4565,7 +4565,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		struct rq *rq = rq_of(cfs_rq);
 		struct rq_flags rf;
 
-		rq_lock(rq, &rf);
+		rq_lock_irqsave(rq, &rf);
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
@@ -4582,7 +4582,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 			unthrottle_cfs_rq(cfs_rq);
 
 next:
-		rq_unlock(rq, &rf);
+		rq_unlock_irqrestore(rq, &rf);
 
 		if (!remaining)
 			break;
@@ -4598,7 +4598,7 @@ next:
  * period the timer is deactivated until scheduling resumes; cfs_b->idle is
  * used to track this state.
  */
-static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
 	u64 runtime, runtime_expires;
 	int throttled;
@@ -4640,11 +4640,11 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
 		runtime = cfs_b->runtime;
 		cfs_b->distribute_running = 1;
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
 		runtime = distribute_cfs_runtime(cfs_b, runtime,
 						 runtime_expires);
-		raw_spin_lock(&cfs_b->lock);
+		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
 		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
@@ -4753,17 +4753,18 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	unsigned long flags;
 	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (cfs_b->distribute_running) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
 	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
@@ -4774,18 +4775,18 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	if (!runtime)
 		return;
 
 	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (expires == cfs_b->runtime_expires)
 		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
 
 /*
@@ -4863,20 +4864,21 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
+	unsigned long flags;
 	int overrun;
 	int idle = 0;
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	for (;;) {
 		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
-		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+		idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
 	}
 	if (idle)
 		cfs_b->period_active = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip:sched/core] sched/fair: Robustify CFS-bandwidth timer locking
  2019-01-07 12:52       ` Peter Zijlstra
  2019-01-08  5:30         ` Mike Galbraith
  2019-01-21 11:37         ` [tip:sched/core] sched/fair: Robustify CFS-bandwidth timer locking tip-bot for Peter Zijlstra
@ 2019-01-21 13:53         ` tip-bot for Peter Zijlstra
  2019-01-27 11:36         ` tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-01-21 13:53 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, peterz, tom.putzeys, linux-kernel, efault, hpa, mingo,
	bigeasy, tglx

Commit-ID:  3cd126af79ed5a4d6b06eba63d3349e143a3bd3b
Gitweb:     https://git.kernel.org/tip/3cd126af79ed5a4d6b06eba63d3349e143a3bd3b
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Jan 2019 13:52:31 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 21 Jan 2019 14:40:28 +0100

sched/fair: Robustify CFS-bandwidth timer locking

Traditionally hrtimer callbacks were run with IRQs disabled, but with
the introduction of HRTIMER_MODE_SOFT it is possible they run from
SoftIRQ context, which does _NOT_ have IRQs disabled.

Allow for the CFS bandwidth timers (period_timer and slack_timer) to
be ran from SoftIRQ context; this entails removing the assumption that
IRQs are already disabled from the locking.

While mainline doesn't strictly need this, -RT forces all timers not
explicitly marked with MODE_HARD into MODE_SOFT and trips over this.
And marking these timers as MODE_HARD doesn't make sense as they're
not required for RT operation and can potentially be quite expensive.

Reported-by: Tom Putzeys <tom.putzeys@be.atlascopco.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190107125231.GE14122@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1374fbddd0d..3b61e19b504a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4565,7 +4565,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		struct rq *rq = rq_of(cfs_rq);
 		struct rq_flags rf;
 
-		rq_lock(rq, &rf);
+		rq_lock_irqsave(rq, &rf);
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
@@ -4582,7 +4582,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 			unthrottle_cfs_rq(cfs_rq);
 
 next:
-		rq_unlock(rq, &rf);
+		rq_unlock_irqrestore(rq, &rf);
 
 		if (!remaining)
 			break;
@@ -4598,7 +4598,7 @@ next:
  * period the timer is deactivated until scheduling resumes; cfs_b->idle is
  * used to track this state.
  */
-static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
 	u64 runtime, runtime_expires;
 	int throttled;
@@ -4640,11 +4640,11 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
 		runtime = cfs_b->runtime;
 		cfs_b->distribute_running = 1;
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
 		runtime = distribute_cfs_runtime(cfs_b, runtime,
 						 runtime_expires);
-		raw_spin_lock(&cfs_b->lock);
+		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
 		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
@@ -4753,17 +4753,18 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	unsigned long flags;
 	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (cfs_b->distribute_running) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
 	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
@@ -4774,18 +4775,18 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	if (!runtime)
 		return;
 
 	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (expires == cfs_b->runtime_expires)
 		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
 
 /*
@@ -4863,20 +4864,21 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
+	unsigned long flags;
 	int overrun;
 	int idle = 0;
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	for (;;) {
 		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
-		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+		idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
 	}
 	if (idle)
 		cfs_b->period_active = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [tip:sched/core] sched/fair: Robustify CFS-bandwidth timer locking
  2019-01-07 12:52       ` Peter Zijlstra
                           ` (2 preceding siblings ...)
  2019-01-21 13:53         ` tip-bot for Peter Zijlstra
@ 2019-01-27 11:36         ` tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 10+ messages in thread
From: tip-bot for Peter Zijlstra @ 2019-01-27 11:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tom.putzeys, mingo, efault, bigeasy, torvalds, tglx, peterz,
	linux-kernel, hpa

Commit-ID:  c0ad4aa4d8416a39ad262a2bd68b30acd951bf0e
Gitweb:     https://git.kernel.org/tip/c0ad4aa4d8416a39ad262a2bd68b30acd951bf0e
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 7 Jan 2019 13:52:31 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 27 Jan 2019 12:29:37 +0100

sched/fair: Robustify CFS-bandwidth timer locking

Traditionally hrtimer callbacks were run with IRQs disabled, but with
the introduction of HRTIMER_MODE_SOFT it is possible they run from
SoftIRQ context, which does _NOT_ have IRQs disabled.

Allow for the CFS bandwidth timers (period_timer and slack_timer) to
be ran from SoftIRQ context; this entails removing the assumption that
IRQs are already disabled from the locking.

While mainline doesn't strictly need this, -RT forces all timers not
explicitly marked with MODE_HARD into MODE_SOFT and trips over this.
And marking these timers as MODE_HARD doesn't make sense as they're
not required for RT operation and can potentially be quite expensive.

Reported-by: Tom Putzeys <tom.putzeys@be.atlascopco.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190107125231.GE14122@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1374fbddd0d..3b61e19b504a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4565,7 +4565,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 		struct rq *rq = rq_of(cfs_rq);
 		struct rq_flags rf;
 
-		rq_lock(rq, &rf);
+		rq_lock_irqsave(rq, &rf);
 		if (!cfs_rq_throttled(cfs_rq))
 			goto next;
 
@@ -4582,7 +4582,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
 			unthrottle_cfs_rq(cfs_rq);
 
 next:
-		rq_unlock(rq, &rf);
+		rq_unlock_irqrestore(rq, &rf);
 
 		if (!remaining)
 			break;
@@ -4598,7 +4598,7 @@ next:
  * period the timer is deactivated until scheduling resumes; cfs_b->idle is
  * used to track this state.
  */
-static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags)
 {
 	u64 runtime, runtime_expires;
 	int throttled;
@@ -4640,11 +4640,11 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
 		runtime = cfs_b->runtime;
 		cfs_b->distribute_running = 1;
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		/* we can't nest cfs_b->lock while distributing bandwidth */
 		runtime = distribute_cfs_runtime(cfs_b, runtime,
 						 runtime_expires);
-		raw_spin_lock(&cfs_b->lock);
+		raw_spin_lock_irqsave(&cfs_b->lock, flags);
 
 		cfs_b->distribute_running = 0;
 		throttled = !list_empty(&cfs_b->throttled_cfs_rq);
@@ -4753,17 +4753,18 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
+	unsigned long flags;
 	u64 expires;
 
 	/* confirm we're still not at a refresh boundary */
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (cfs_b->distribute_running) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
 	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
-		raw_spin_unlock(&cfs_b->lock);
+		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 		return;
 	}
 
@@ -4774,18 +4775,18 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 	if (runtime)
 		cfs_b->distribute_running = 1;
 
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	if (!runtime)
 		return;
 
 	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	if (expires == cfs_b->runtime_expires)
 		lsub_positive(&cfs_b->runtime, runtime);
 	cfs_b->distribute_running = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 }
 
 /*
@@ -4863,20 +4864,21 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
+	unsigned long flags;
 	int overrun;
 	int idle = 0;
 
-	raw_spin_lock(&cfs_b->lock);
+	raw_spin_lock_irqsave(&cfs_b->lock, flags);
 	for (;;) {
 		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
-		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+		idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
 	}
 	if (idle)
 		cfs_b->period_active = 0;
-	raw_spin_unlock(&cfs_b->lock);
+	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-01-27 11:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <AM0PR03MB480425D5999E0D08DAB30204BB8E0@AM0PR03MB4804.eurprd03.prod.outlook.com>
2019-01-04 12:42 ` CFS scheduler: spin_lock usage causes dead lock when smp_apic_timer_interrupt occurs Tom Putzeys
2019-01-07 10:26   ` Peter Zijlstra
2019-01-07 12:28     ` Mike Galbraith
2019-01-07 12:52       ` Peter Zijlstra
2019-01-08  5:30         ` Mike Galbraith
2019-01-08  9:06           ` Peter Zijlstra
2019-01-08 11:05             ` Sebastian Andrzej Siewior
2019-01-21 11:37         ` [tip:sched/core] sched/fair: Robustify CFS-bandwidth timer locking tip-bot for Peter Zijlstra
2019-01-21 13:53         ` tip-bot for Peter Zijlstra
2019-01-27 11:36         ` tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.