linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
@ 2013-10-31 14:07 Mike Galbraith
  2013-11-06 17:49 ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2013-10-31 14:07 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Peter Zijlstra, LKML, RT

Hi Frederic,

The tick wakes ksoftirqd, ensuring nr_running test ain't gonna happen
when an otherwise lonely task takes the timer interrupt.  Deferring to
softirq processing time..... works.

---
 kernel/sched/core.c      |    2 +-
 kernel/softirq.c         |   10 ++++++++++
 kernel/time/tick-sched.c |    2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -704,7 +704,7 @@ bool sched_can_stop_tick(void)
        smp_rmb();
 
        /* More than one running task need preemption */
-       if (rq->nr_running > 1)
+       if (rq->nr_running - (rq->curr == this_cpu_ksoftirqd()) > 1)
                return false;
 
        return true;
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -529,6 +529,11 @@ static void do_current_softirqs(int need
 		softirq_clr_runner(i);
 		unlock_softirq(i);
 		WARN_ON(current->softirq_nestcnt != 1);
+
+#ifdef CONFIG_NO_HZ_COMMON
+		if (i == TIMER_SOFTIRQ && tick_nohz_full_cpu(raw_smp_processor_id()))
+			 tick_nohz_irq_exit();
+#endif
 	}
 }
 
@@ -728,6 +733,11 @@ static inline void tick_irq_exit(void)
 #ifdef CONFIG_NO_HZ_COMMON
 	int cpu = smp_processor_id();
 
+#ifdef CONFIG_PREEMPT_RT_FULL
+	if (tick_nohz_full_cpu(cpu))
+		return;
+#endif
+
 	/* Make sure that timer wheel updates are propagated */
 	if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
 		if (!in_interrupt())
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -157,7 +157,9 @@ bool have_nohz_full_mask;
 
 static bool can_stop_full_tick(void)
 {
+#ifndef CONFIG_PREEMPT_RT_FULL
 	WARN_ON_ONCE(!irqs_disabled());
+#endif
 
 	if (!sched_can_stop_tick()) {
 		trace_tick_stop(0, "more than 1 task in runqueue\n");



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-10-31 14:07 CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo Mike Galbraith
@ 2013-11-06 17:49 ` Thomas Gleixner
  2013-11-07  3:26   ` Mike Galbraith
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-06 17:49 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Frederic Weisbecker, Peter Zijlstra, LKML, RT

On Thu, 31 Oct 2013, Mike Galbraith wrote:

> Hi Frederic,
> 
> The tick wakes ksoftirqd, ensuring nr_running test ain't gonna happen
> when an otherwise lonely task takes the timer interrupt.  Deferring to
> softirq processing time..... works.

-ENOPARSE

What the heck is this patch doing aside of slapping #ifdeffery all
over the place?

I bet you are trying to work around some of the side effects of the
occasional tick which is still necessary despite of full nohz, right?

Bah, we want to solve the underlying problems and get rid of the tick
completely instead of hiding the issues by magic hackery.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-06 17:49 ` Thomas Gleixner
@ 2013-11-07  3:26   ` Mike Galbraith
  2013-11-07  4:31     ` Mike Galbraith
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2013-11-07  3:26 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Frederic Weisbecker, Peter Zijlstra, LKML, RT

On Wed, 2013-11-06 at 18:49 +0100, Thomas Gleixner wrote: 
> On Thu, 31 Oct 2013, Mike Galbraith wrote:
> 
> > Hi Frederic,
> > 
> > The tick wakes ksoftirqd, ensuring nr_running test ain't gonna happen
> > when an otherwise lonely task takes the timer interrupt.  Deferring to
> > softirq processing time..... works.
> 
> -ENOPARSE
> 
> What the heck is this patch doing aside of slapping #ifdeffery all
> over the place?

It ain't a patch, it's a diagnostic.

> I bet you are trying to work around some of the side effects of the
> occasional tick which is still necessary despite of full nohz, right?

Nope, I wanted to check out cost of nohz_full for rt, and found that it
doesn't work at all instead, looked, and found that the sole running
task has just awakened ksoftirqd when it wants to shut the tick down, so
that shutdown never happens. 

> Bah, we want to solve the underlying problems and get rid of the tick
> completely instead of hiding the issues by magic hackery.

That's why I mentioned it.  That ".... works" has a whooooole different
meaning than "this is the fix" :)

-Mike


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-07  3:26   ` Mike Galbraith
@ 2013-11-07  4:31     ` Mike Galbraith
  2013-11-07 11:21       ` Thomas Gleixner
  2013-11-08  3:23       ` Paul E. McKenney
  0 siblings, 2 replies; 39+ messages in thread
From: Mike Galbraith @ 2013-11-07  4:31 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Frederic Weisbecker, Peter Zijlstra, LKML, RT

On Thu, 2013-11-07 at 04:26 +0100, Mike Galbraith wrote: 
> On Wed, 2013-11-06 at 18:49 +0100, Thomas Gleixner wrote: 

> > I bet you are trying to work around some of the side effects of the
> > occasional tick which is still necessary despite of full nohz, right?
> 
> Nope, I wanted to check out cost of nohz_full for rt, and found that it
> doesn't work at all instead, looked, and found that the sole running
> task has just awakened ksoftirqd when it wants to shut the tick down, so
> that shutdown never happens. 

Like so in virgin 3.10-rt.  Box is x3550 M3 booted nowatchdog
rcu_nocbs=1-3 nohz_full=1-3, and CPUs1-3 are completely isolated via
cpusets as well.

... 
            pert-5229  [003] d..h2..   684.481615: hrtimer_cancel: hrtimer=ffff88017fccbac0
            pert-5229  [003] d..h1..   684.481616: hrtimer_expire_entry: hrtimer=ffff88017fccbac0 function=tick_sched_timer now=683820187877
            pert-5229  [003] d..h2..   684.481616: sched_stat_runtime: comm=pert pid=5229 runtime=993917 [ns] vruntime=179048871558 [ns]
            pert-5229  [003] d..h1..   684.481616: softirq_raise: vec=1 [action=TIMER]
            pert-5229  [003] d..h1..   684.481616: hrtimer_expire_exit: hrtimer=ffff88017fccbac0
            pert-5229  [003] d..h2..   684.481617: hrtimer_start: hrtimer=ffff88017fccbac0 function=tick_sched_timer expires=683821187500 softexpires=683821187500
            pert-5229  [003] d...3..   684.481618: sched_stat_runtime: comm=pert pid=5229 runtime=1634 [ns] vruntime=179048873192 [ns]
            pert-5229  [003] d.L.3..   684.481618: sched_wakeup: comm=ksoftirqd/3 pid=49 prio=120 success=1 target_cpu=003
            pert-5229  [003] d.L.1..   684.481618: tick_stop: success=no msg=more than 1 task in runqueue

            pert-5229  [003] d.L.1..   684.481619: tick_stop: success=no msg=more than 1 task in runqueue

            pert-5229  [003] d.L.3..   684.481620: sched_stat_runtime: comm=pert pid=5229 runtime=2096 [ns] vruntime=179048875288 [ns]
            pert-5229  [003] d...3..   684.481620: sched_switch: prev_comm=pert prev_pid=5229 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/3 next_pid=49 next_prio=120
     ksoftirqd/3-49    [003] ....111   684.481621: softirq_entry: vec=1 [action=TIMER]
     ksoftirqd/3-49    [003] ....111   684.481621: softirq_exit: vec=1 [action=TIMER]
     ksoftirqd/3-49    [003] d...3..   684.481622: sched_stat_runtime: comm=ksoftirqd/3 pid=49 runtime=1934 [ns] vruntime=179039875126 [ns]
     ksoftirqd/3-49    [003] d...3..   684.481622: sched_switch: prev_comm=ksoftirqd/3 prev_pid=49 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=5229 next_prio=120
            pert-5229  [003] d...3..   684.481623: sched_stat_runtime: comm=pert pid=5229 runtime=1289 [ns] vruntime=179048876577 [ns]
            pert-5229  [003] d..h2..   684.482616: hrtimer_cancel: hrtimer=ffff88017fccbac0
            pert-5229  [003] d..h1..   684.482617: hrtimer_expire_entry: hrtimer=ffff88017fccbac0 function=tick_sched_timer now=683821188024
            pert-5229  [003] d..h2..   684.482617: sched_stat_runtime: comm=pert pid=5229 runtime=994281 [ns] vruntime=179049870858 [ns]
            pert-5229  [003] d..h1..   684.482617: softirq_raise: vec=1 [action=TIMER]
            pert-5229  [003] d..h1..   684.482618: softirq_raise: vec=9 [action=RCU]
            pert-5229  [003] d..h1..   684.482618: hrtimer_expire_exit: hrtimer=ffff88017fccbac0
            pert-5229  [003] d..h2..   684.482618: hrtimer_start: hrtimer=ffff88017fccbac0 function=tick_sched_timer expires=683822187500 softexpires=683822187500
            pert-5229  [003] d...3..   684.482619: sched_stat_runtime: comm=pert pid=5229 runtime=1719 [ns] vruntime=179049872577 [ns]
            pert-5229  [003] d.L.3..   684.482619: sched_wakeup: comm=ksoftirqd/3 pid=49 prio=120 success=1 target_cpu=003
            pert-5229  [003] d.L.1..   684.482619: tick_stop: success=no msg=more than 1 task in runqueue

            pert-5229  [003] d.L.1..   684.482620: tick_stop: success=no msg=more than 1 task in runqueue

            pert-5229  [003] d.L.3..   684.482621: sched_stat_runtime: comm=pert pid=5229 runtime=2204 [ns] vruntime=179049874781 [ns]
            pert-5229  [003] d...3..   684.482621: sched_switch: prev_comm=pert prev_pid=5229 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/3 next_pid=49 next_prio=120
     ksoftirqd/3-49    [003] ....111   684.482622: softirq_entry: vec=1 [action=TIMER]
     ksoftirqd/3-49    [003] ....111   684.482623: softirq_exit: vec=1 [action=TIMER]
     ksoftirqd/3-49    [003] ....111   684.482623: softirq_entry: vec=9 [action=RCU]
     ksoftirqd/3-49    [003] ....111   684.482624: softirq_exit: vec=9 [action=RCU]
     ksoftirqd/3-49    [003] d...3..   684.482624: sched_stat_runtime: comm=ksoftirqd/3 pid=49 runtime=3049 [ns] vruntime=179040875626 [ns]
     ksoftirqd/3-49    [003] d...3..   684.482624: sched_switch: prev_comm=ksoftirqd/3 prev_pid=49 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=5229 next_prio=120
            pert-5229  [003] d...3..   684.482626: sched_stat_runtime: comm=pert pid=5229 runtime=1422 [ns] vruntime=179049876203 [ns]
            pert-5229  [003] d..h2..   684.483617: hrtimer_cancel: hrtimer=ffff88017fccbac0
            pert-5229  [003] d..h1..   684.483617: hrtimer_expire_entry: hrtimer=ffff88017fccbac0 function=tick_sched_timer now=683822187923
            pert-5229  [003] d..h2..   684.483618: sched_stat_runtime: comm=pert pid=5229 runtime=992021 [ns] vruntime=179050868224 [ns]
            pert-5229  [003] d..h1..   684.483618: softirq_raise: vec=1 [action=TIMER]
            pert-5229  [003] d..h1..   684.483618: softirq_raise: vec=9 [action=RCU]
            pert-5229  [003] d..h1..   684.483618: hrtimer_expire_exit: hrtimer=ffff88017fccbac0
            pert-5229  [003] d..h2..   684.483618: hrtimer_start: hrtimer=ffff88017fccbac0 function=tick_sched_timer expires=683823187500 softexpires=683823187500
            pert-5229  [003] d...3..   684.483619: sched_stat_runtime: comm=pert pid=5229 runtime=1778 [ns] vruntime=179050870002 [ns]
            pert-5229  [003] d.L.3..   684.483620: sched_wakeup: comm=ksoftirqd/3 pid=49 prio=120 success=1 target_cpu=003
            pert-5229  [003] d.L.1..   684.483620: tick_stop: success=no msg=more than 1 task in runqueue

            pert-5229  [003] d.L.1..   684.483621: tick_stop: success=no msg=more than 1 task in runqueue

            pert-5229  [003] d.L.3..   684.483621: sched_stat_runtime: comm=pert pid=5229 runtime=2096 [ns] vruntime=179050872098 [ns]
            pert-5229  [003] d...3..   684.483622: sched_switch: prev_comm=pert prev_pid=5229 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/3 next_pid=49 next_prio=120
     ksoftirqd/3-49    [003] ....111   684.483623: softirq_entry: vec=1 [action=TIMER]
     ksoftirqd/3-49    [003] ....111   684.483623: softirq_exit: vec=1 [action=TIMER]
     ksoftirqd/3-49    [003] ....111   684.483623: softirq_entry: vec=9 [action=RCU]
     ksoftirqd/3-49    [003] ....111   684.483624: softirq_exit: vec=9 [action=RCU]
     ksoftirqd/3-49    [003] d...3..   684.483625: sched_stat_runtime: comm=ksoftirqd/3 pid=49 runtime=3187 [ns] vruntime=179041873189 [ns]
     ksoftirqd/3-49    [003] d...3..   684.483625: sched_switch: prev_comm=ksoftirqd/3 prev_pid=49 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=5229 next_prio=120
            pert-5229  [003] d...3..   684.483626: sched_stat_runtime: comm=pert pid=5229 runtime=1331 [ns] vruntime=179050873429 [ns]



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-07  4:31     ` Mike Galbraith
@ 2013-11-07 11:21       ` Thomas Gleixner
  2013-11-07 12:59         ` Frederic Weisbecker
                           ` (2 more replies)
  2013-11-08  3:23       ` Paul E. McKenney
  1 sibling, 3 replies; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-07 11:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Frederic Weisbecker, Peter Zijlstra, LKML, RT, Paul E. McKenney

Mike,

On Thu, 7 Nov 2013, Mike Galbraith wrote:

> On Thu, 2013-11-07 at 04:26 +0100, Mike Galbraith wrote: 
> > On Wed, 2013-11-06 at 18:49 +0100, Thomas Gleixner wrote: 
> 
> > > I bet you are trying to work around some of the side effects of the
> > > occasional tick which is still necessary despite of full nohz, right?
> > 
> > Nope, I wanted to check out cost of nohz_full for rt, and found that it
> > doesn't work at all instead, looked, and found that the sole running
> > task has just awakened ksoftirqd when it wants to shut the tick down, so
> > that shutdown never happens. 
> 
> Like so in virgin 3.10-rt.  Box is x3550 M3 booted nowatchdog
> rcu_nocbs=1-3 nohz_full=1-3, and CPUs1-3 are completely isolated via
> cpusets as well.

well, that very same problem is in mainline if you add "threadirqs" to
the command line. But we can be smart about this. The untested patch
below should address that issue. If that works on mainline we can
adapt it for RT (needs a trylock(&base->lock) there).

Though it's not a full solution. It needs some thought versus the
softirq code of timers. Assume we have only one timer queued 1000
ticks into the future. So this change will cause the timer softirq not
to be called until that timer expires and then the timer softirq is
going to do 1000 loops until it catches up with jiffies. That's
anything but pretty ...

What worries me more is this one:

  pert-5229  [003] d..h1..   684.482618: softirq_raise: vec=9 [action=RCU]

The CPU has no callbacks as you shoved them over to cpu 0, so why is
the RCU softirq raised?

Thanks,

	tglx
------------------

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index d19a5c2..0816a50 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -445,9 +445,8 @@ extern int schedule_hrtimeout_range_clock(ktime_t *expires,
 		unsigned long delta, const enum hrtimer_mode mode, int clock);
 extern int schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
 
-/* Soft interrupt function to run the hrtimer queues: */
+/* Called from the periodic timer tick */
 extern void hrtimer_run_queues(void);
-extern void hrtimer_run_pending(void);
 
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 383319b..7359eaa 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1450,30 +1450,6 @@ static inline void __hrtimer_peek_ahead_timers(void) { }
 #endif	/* !CONFIG_HIGH_RES_TIMERS */
 
 /*
- * Called from timer softirq every jiffy, expire hrtimers:
- *
- * For HRT its the fall back code to run the softirq in the timer
- * softirq context in case the hrtimer initialization failed or has
- * not been done yet.
- */
-void hrtimer_run_pending(void)
-{
-	if (hrtimer_hres_active())
-		return;
-
-	/*
-	 * This _is_ ugly: We have to check in the softirq context,
-	 * whether we can switch to highres and / or nohz mode. The
-	 * clocksource switch happens in the timer interrupt with
-	 * xtime_lock held. Notification from there only sets the
-	 * check bit in the tick_oneshot code, otherwise we might
-	 * deadlock vs. xtime_lock.
-	 */
-	if (tick_check_oneshot_change(!hrtimer_is_hres_enabled()))
-		hrtimer_switch_to_hres();
-}
-
-/*
  * Called from hardirq context every jiffy
  */
 void hrtimer_run_queues(void)
@@ -1486,6 +1462,13 @@ void hrtimer_run_queues(void)
 	if (hrtimer_hres_active())
 		return;
 
+	/*
+	 * Check whether we can switch to highres mode.
+	 */
+	if (tick_check_oneshot_change(!hrtimer_is_hres_enabled())
+	    && hrtimer_switch_to_hres())
+		return;
+
 	for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) {
 		base = &cpu_base->clock_base[index];
 		if (!timerqueue_getnext(&base->active))
diff --git a/kernel/timer.c b/kernel/timer.c
index 4296d13..ba8ee94 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1370,8 +1370,6 @@ static void run_timer_softirq(struct softirq_action *h)
 {
 	struct tvec_base *base = __this_cpu_read(tvec_bases);
 
-	hrtimer_run_pending();
-
 	if (time_after_eq(jiffies, base->timer_jiffies))
 		__run_timers(base);
 }
@@ -1381,8 +1379,20 @@ static void run_timer_softirq(struct softirq_action *h)
  */
 void run_local_timers(void)
 {
+	struct tvec_base *base = __this_cpu_read(tvec_bases);
+
 	hrtimer_run_queues();
-	raise_softirq(TIMER_SOFTIRQ);
+	/*
+	 * We can access this lockless as we are in the timer
+	 * interrupt. If there are no timers queued, nothing to do in
+	 * the timer softirq.
+	 */
+	if (!base->active_timers)
+		return;
+
+	/* Check whether the next pending timer has expired */
+	if (time_before_eq(base->next_timer, jiffies))
+		raise_softirq(TIMER_SOFTIRQ);
 }
 
 #ifdef __ARCH_WANT_SYS_ALARM




^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-07 11:21       ` Thomas Gleixner
@ 2013-11-07 12:59         ` Frederic Weisbecker
  2013-11-07 13:13           ` Thomas Gleixner
  2013-11-07 13:07         ` CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo Mike Galbraith
  2013-12-20 15:41         ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2013-11-07 12:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, Peter Zijlstra, LKML, RT, Paul E. McKenney

On Thu, Nov 07, 2013 at 12:21:11PM +0100, Thomas Gleixner wrote:
> Mike,
> 
> On Thu, 7 Nov 2013, Mike Galbraith wrote:
> 
> > On Thu, 2013-11-07 at 04:26 +0100, Mike Galbraith wrote: 
> > > On Wed, 2013-11-06 at 18:49 +0100, Thomas Gleixner wrote: 
> > 
> > > > I bet you are trying to work around some of the side effects of the
> > > > occasional tick which is still necessary despite of full nohz, right?
> > > 
> > > Nope, I wanted to check out cost of nohz_full for rt, and found that it
> > > doesn't work at all instead, looked, and found that the sole running
> > > task has just awakened ksoftirqd when it wants to shut the tick down, so
> > > that shutdown never happens. 
> > 
> > Like so in virgin 3.10-rt.  Box is x3550 M3 booted nowatchdog
> > rcu_nocbs=1-3 nohz_full=1-3, and CPUs1-3 are completely isolated via
> > cpusets as well.
> 
> well, that very same problem is in mainline if you add "threadirqs" to
> the command line. But we can be smart about this. The untested patch
> below should address that issue. If that works on mainline we can
> adapt it for RT (needs a trylock(&base->lock) there).
> 
> Though it's not a full solution. It needs some thought versus the
> softirq code of timers. Assume we have only one timer queued 1000
> ticks into the future. So this change will cause the timer softirq not
> to be called until that timer expires and then the timer softirq is
> going to do 1000 loops until it catches up with jiffies. That's
> anything but pretty ...
> 
> What worries me more is this one:
> 
>   pert-5229  [003] d..h1..   684.482618: softirq_raise: vec=9 [action=RCU]
> 
> The CPU has no callbacks as you shoved them over to cpu 0, so why is
> the RCU softirq raised?

I see, so the problem is that we raise the timer softirq unconditionally
from the tick?

Ok we definetly don't want to keep that behaviour, even if softirqs are not
threaded, that's an overhead. So I'm looking at that loop in __run_timers()
and I guess you mean the "base->timer_jiffies" incrementation?

That's indeed not pretty. How do we handle exit from long dynticks idle periods? Are we
doing that loop until we catch up with the new jiffies?

Then it relies on the timer cascade stuff which is very obscure code to me...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-07 11:21       ` Thomas Gleixner
  2013-11-07 12:59         ` Frederic Weisbecker
@ 2013-11-07 13:07         ` Mike Galbraith
  2013-12-20 15:41         ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2013-11-07 13:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Peter Zijlstra, LKML, RT, Paul E. McKenney

On Thu, 2013-11-07 at 12:21 +0100, Thomas Gleixner wrote: 
> Mike,
> 
> On Thu, 7 Nov 2013, Mike Galbraith wrote:
> 
> > On Thu, 2013-11-07 at 04:26 +0100, Mike Galbraith wrote: 
> > > On Wed, 2013-11-06 at 18:49 +0100, Thomas Gleixner wrote: 
> > 
> > > > I bet you are trying to work around some of the side effects of the
> > > > occasional tick which is still necessary despite of full nohz, right?
> > > 
> > > Nope, I wanted to check out cost of nohz_full for rt, and found that it
> > > doesn't work at all instead, looked, and found that the sole running
> > > task has just awakened ksoftirqd when it wants to shut the tick down, so
> > > that shutdown never happens. 
> > 
> > Like so in virgin 3.10-rt.  Box is x3550 M3 booted nowatchdog
> > rcu_nocbs=1-3 nohz_full=1-3, and CPUs1-3 are completely isolated via
> > cpusets as well.
> 
> well, that very same problem is in mainline if you add "threadirqs" to
> the command line. But we can be smart about this. The untested patch
> below should address that issue. If that works on mainline we can
> adapt it for RT (needs a trylock(&base->lock) there).

Oops, in haste I wedged it straight into 3.10-rt as is.  First pert
attempt was a bit weird, but it eventually worked.

rtbox:/sys/kernel/debug/tracing # !cgexec
cgexec -g cpuset:rtcpus taskset -c 3 pert 5
2400.01 MHZ CPU
perturbation threshold 0.018 usecs.
pert/s:      807 >8.52us:        2 min:  0.04 max: 10.80 avg:  5.56 sum/s:  4485us overhead: 0.45%
pert/s:      707 >8.54us:        4 min:  2.85 max: 11.78 avg:  5.63 sum/s:  3981us overhead: 0.40%
pert/s:      807 >8.51us:        2 min:  0.04 max: 10.86 avg:  5.58 sum/s:  4502us overhead: 0.45%
pert/s:      707 >8.48us:        3 min:  0.04 max: 10.82 avg:  5.59 sum/s:  3959us overhead: 0.40%
pert/s:      630 >8.73us:        5 min:  0.04 max: 16.65 avg:  5.29 sum/s:  3335us overhead: 0.33%
pert/s:      152 >9.50us:        4 min:  0.04 max: 32.58 avg:  0.37 sum/s:    56us overhead: 0.01%
pert/s:       28 >9.74us:        3 min:  0.04 max: 22.31 avg:  1.41 sum/s:    40us overhead: 0.00%
pert/s:        8 >10.02us:        4 min:  1.75 max: 20.56 avg:  4.54 sum/s:    36us overhead: 0.00%
pert/s:        7 >10.23us:        3 min:  1.82 max: 19.94 avg:  4.33 sum/s:    34us overhead: 0.00%
pert/s:        9 >10.45us:        5 min:  0.04 max: 20.79 avg:  4.11 sum/s:    38us overhead: 0.00%
pert/s:       31 >10.57us:        5 min:  0.04 max: 22.13 avg:  1.22 sum/s:    38us overhead: 0.00%
pert/s:       10 >10.77us:        5 min:  0.04 max: 21.40 avg:  3.68 sum/s:    38us overhead: 0.00%
^C
rtbox:/sys/kernel/debug/tracing # cgexec -g cpuset:rtcpus taskset -c 3 pert 5
2400.02 MHZ CPU
perturbation threshold 0.018 usecs.
pert/s:        8 >14.06us:        2 min:  1.70 max: 19.66 avg:  4.24 sum/s:    35us overhead: 0.00%
pert/s:        8 >13.97us:        3 min:  1.80 max: 21.81 avg:  4.48 sum/s:    37us overhead: 0.00%
pert/s:        8 >13.77us:        2 min:  1.77 max: 19.64 avg:  4.35 sum/s:    35us overhead: 0.00%
pert/s:        9 >13.72us:        3 min:  0.04 max: 22.03 avg:  4.35 sum/s:    39us overhead: 0.00%
pert/s:        8 >13.55us:        2 min:  1.75 max: 19.88 avg:  4.16 sum/s:    35us overhead: 0.00%
pert/s:        8 >13.43us:        3 min:  0.04 max: 20.55 avg:  4.21 sum/s:    36us overhead: 0.00%
pert/s:        8 >13.28us:        2 min:  1.74 max: 19.53 avg:  4.34 sum/s:    35us overhead: 0.00%
pert/s:        8 >13.22us:        3 min:  1.76 max: 20.96 avg:  4.35 sum/s:    37us overhead: 0.00%
pert/s:        8 >13.10us:        2 min:  1.72 max: 19.64 avg:  4.38 sum/s:    36us overhead: 0.00%
^C
rtbox:/sys/kernel/debug/tracing # cgexec -g cpuset:rtcpus taskset -c 3 pert 5
2400.03 MHZ CPU
perturbation threshold 0.018 usecs.
pert/s:        9 >14.55us:        2 min:  0.04 max: 20.93 avg:  4.11 sum/s:    37us overhead: 0.00%
pert/s:        8 >14.36us:        3 min:  1.72 max: 20.75 avg:  4.42 sum/s:    36us overhead: 0.00%
pert/s:        8 >14.14us:        2 min:  1.74 max: 20.02 avg:  4.28 sum/s:    35us overhead: 0.00%
pert/s:        8 >13.98us:        3 min:  1.77 max: 20.54 avg:  4.51 sum/s:    36us overhead: 0.00%
pert/s:        8 >13.76us:        2 min:  1.72 max: 19.57 avg:  4.17 sum/s:    35us overhead: 0.00%
pert/s:        8 >13.63us:        3 min:  1.79 max: 20.42 avg:  4.38 sum/s:    36us overhead: 0.00%
pert/s:        9 >13.51us:        2 min:  0.04 max: 20.78 avg:  4.09 sum/s:    37us overhead: 0.00%

> What worries me more is this one:
> 
>   pert-5229  [003] d..h1..   684.482618: softirq_raise: vec=9 [action=RCU]
> 
> The CPU has no callbacks as you shoved them over to cpu 0, so why is
> the RCU softirq raised?

Dunno, but it's repeatable.  Workqueues are perturbation sources too,
update_vmstat, drain_caches (or such, didn't save all traces).

-Mike


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-07 12:59         ` Frederic Weisbecker
@ 2013-11-07 13:13           ` Thomas Gleixner
  2013-11-12  8:06             ` Mike Galbraith
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-07 13:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mike Galbraith, Peter Zijlstra, LKML, RT, Paul E. McKenney

On Thu, 7 Nov 2013, Frederic Weisbecker wrote:
> On Thu, Nov 07, 2013 at 12:21:11PM +0100, Thomas Gleixner wrote:
> > Though it's not a full solution. It needs some thought versus the
> > softirq code of timers. Assume we have only one timer queued 1000
> > ticks into the future. So this change will cause the timer softirq not
> > to be called until that timer expires and then the timer softirq is
> > going to do 1000 loops until it catches up with jiffies. That's
> > anything but pretty ...
> 
> I see, so the problem is that we raise the timer softirq unconditionally
> from the tick?

Right.
 
> Ok we definetly don't want to keep that behaviour, even if softirqs are not
> threaded, that's an overhead. So I'm looking at that loop in __run_timers()
> and I guess you mean the "base->timer_jiffies" incrementation?
>
> That's indeed not pretty. How do we handle exit from long dynticks
> idle periods? Are we doing that loop until we catch up with the new
> jiffies?

Right. I realized that right after I hit send :)
 
> Then it relies on the timer cascade stuff which is very obscure code to me...

It's not that bad, really. I have an idea how to fix that. Needs some
rewriting though.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-07  4:31     ` Mike Galbraith
  2013-11-07 11:21       ` Thomas Gleixner
@ 2013-11-08  3:23       ` Paul E. McKenney
  2013-11-08  7:31         ` Mike Galbraith
  1 sibling, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2013-11-08  3:23 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, Frederic Weisbecker, Peter Zijlstra, LKML, RT

On Thu, Nov 07, 2013 at 05:31:08AM +0100, Mike Galbraith wrote:
> On Thu, 2013-11-07 at 04:26 +0100, Mike Galbraith wrote: 
> > On Wed, 2013-11-06 at 18:49 +0100, Thomas Gleixner wrote: 
> 
> > > I bet you are trying to work around some of the side effects of the
> > > occasional tick which is still necessary despite of full nohz, right?
> > 
> > Nope, I wanted to check out cost of nohz_full for rt, and found that it
> > doesn't work at all instead, looked, and found that the sole running
> > task has just awakened ksoftirqd when it wants to shut the tick down, so
> > that shutdown never happens. 
> 
> Like so in virgin 3.10-rt.  Box is x3550 M3 booted nowatchdog
> rcu_nocbs=1-3 nohz_full=1-3, and CPUs1-3 are completely isolated via
> cpusets as well.

Hmmm...  The fact that the CPU is getting scheduling-clock interrupts
is what is causing the RCU_SOFTIRQs.  If you take a scheduling-clock
interrupt on a CPU that RCU is waiting for, it will do raise_softirq()
in order to get the CPU to respond to RCU.

So the RCU_SOFTIRQs are a direct consequence of the scheduling-clock
interrupts.

An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
Not sure how to help with the scheduling-clock interrupts.

							Thanx, Paul

> ... 
>             pert-5229  [003] d..h2..   684.481615: hrtimer_cancel: hrtimer=ffff88017fccbac0
>             pert-5229  [003] d..h1..   684.481616: hrtimer_expire_entry: hrtimer=ffff88017fccbac0 function=tick_sched_timer now=683820187877
>             pert-5229  [003] d..h2..   684.481616: sched_stat_runtime: comm=pert pid=5229 runtime=993917 [ns] vruntime=179048871558 [ns]
>             pert-5229  [003] d..h1..   684.481616: softirq_raise: vec=1 [action=TIMER]
>             pert-5229  [003] d..h1..   684.481616: hrtimer_expire_exit: hrtimer=ffff88017fccbac0
>             pert-5229  [003] d..h2..   684.481617: hrtimer_start: hrtimer=ffff88017fccbac0 function=tick_sched_timer expires=683821187500 softexpires=683821187500
>             pert-5229  [003] d...3..   684.481618: sched_stat_runtime: comm=pert pid=5229 runtime=1634 [ns] vruntime=179048873192 [ns]
>             pert-5229  [003] d.L.3..   684.481618: sched_wakeup: comm=ksoftirqd/3 pid=49 prio=120 success=1 target_cpu=003
>             pert-5229  [003] d.L.1..   684.481618: tick_stop: success=no msg=more than 1 task in runqueue
> 
>             pert-5229  [003] d.L.1..   684.481619: tick_stop: success=no msg=more than 1 task in runqueue
> 
>             pert-5229  [003] d.L.3..   684.481620: sched_stat_runtime: comm=pert pid=5229 runtime=2096 [ns] vruntime=179048875288 [ns]
>             pert-5229  [003] d...3..   684.481620: sched_switch: prev_comm=pert prev_pid=5229 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/3 next_pid=49 next_prio=120
>      ksoftirqd/3-49    [003] ....111   684.481621: softirq_entry: vec=1 [action=TIMER]
>      ksoftirqd/3-49    [003] ....111   684.481621: softirq_exit: vec=1 [action=TIMER]
>      ksoftirqd/3-49    [003] d...3..   684.481622: sched_stat_runtime: comm=ksoftirqd/3 pid=49 runtime=1934 [ns] vruntime=179039875126 [ns]
>      ksoftirqd/3-49    [003] d...3..   684.481622: sched_switch: prev_comm=ksoftirqd/3 prev_pid=49 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=5229 next_prio=120
>             pert-5229  [003] d...3..   684.481623: sched_stat_runtime: comm=pert pid=5229 runtime=1289 [ns] vruntime=179048876577 [ns]
>             pert-5229  [003] d..h2..   684.482616: hrtimer_cancel: hrtimer=ffff88017fccbac0
>             pert-5229  [003] d..h1..   684.482617: hrtimer_expire_entry: hrtimer=ffff88017fccbac0 function=tick_sched_timer now=683821188024
>             pert-5229  [003] d..h2..   684.482617: sched_stat_runtime: comm=pert pid=5229 runtime=994281 [ns] vruntime=179049870858 [ns]
>             pert-5229  [003] d..h1..   684.482617: softirq_raise: vec=1 [action=TIMER]
>             pert-5229  [003] d..h1..   684.482618: softirq_raise: vec=9 [action=RCU]
>             pert-5229  [003] d..h1..   684.482618: hrtimer_expire_exit: hrtimer=ffff88017fccbac0
>             pert-5229  [003] d..h2..   684.482618: hrtimer_start: hrtimer=ffff88017fccbac0 function=tick_sched_timer expires=683822187500 softexpires=683822187500
>             pert-5229  [003] d...3..   684.482619: sched_stat_runtime: comm=pert pid=5229 runtime=1719 [ns] vruntime=179049872577 [ns]
>             pert-5229  [003] d.L.3..   684.482619: sched_wakeup: comm=ksoftirqd/3 pid=49 prio=120 success=1 target_cpu=003
>             pert-5229  [003] d.L.1..   684.482619: tick_stop: success=no msg=more than 1 task in runqueue
> 
>             pert-5229  [003] d.L.1..   684.482620: tick_stop: success=no msg=more than 1 task in runqueue
> 
>             pert-5229  [003] d.L.3..   684.482621: sched_stat_runtime: comm=pert pid=5229 runtime=2204 [ns] vruntime=179049874781 [ns]
>             pert-5229  [003] d...3..   684.482621: sched_switch: prev_comm=pert prev_pid=5229 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/3 next_pid=49 next_prio=120
>      ksoftirqd/3-49    [003] ....111   684.482622: softirq_entry: vec=1 [action=TIMER]
>      ksoftirqd/3-49    [003] ....111   684.482623: softirq_exit: vec=1 [action=TIMER]
>      ksoftirqd/3-49    [003] ....111   684.482623: softirq_entry: vec=9 [action=RCU]
>      ksoftirqd/3-49    [003] ....111   684.482624: softirq_exit: vec=9 [action=RCU]
>      ksoftirqd/3-49    [003] d...3..   684.482624: sched_stat_runtime: comm=ksoftirqd/3 pid=49 runtime=3049 [ns] vruntime=179040875626 [ns]
>      ksoftirqd/3-49    [003] d...3..   684.482624: sched_switch: prev_comm=ksoftirqd/3 prev_pid=49 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=5229 next_prio=120
>             pert-5229  [003] d...3..   684.482626: sched_stat_runtime: comm=pert pid=5229 runtime=1422 [ns] vruntime=179049876203 [ns]
>             pert-5229  [003] d..h2..   684.483617: hrtimer_cancel: hrtimer=ffff88017fccbac0
>             pert-5229  [003] d..h1..   684.483617: hrtimer_expire_entry: hrtimer=ffff88017fccbac0 function=tick_sched_timer now=683822187923
>             pert-5229  [003] d..h2..   684.483618: sched_stat_runtime: comm=pert pid=5229 runtime=992021 [ns] vruntime=179050868224 [ns]
>             pert-5229  [003] d..h1..   684.483618: softirq_raise: vec=1 [action=TIMER]
>             pert-5229  [003] d..h1..   684.483618: softirq_raise: vec=9 [action=RCU]
>             pert-5229  [003] d..h1..   684.483618: hrtimer_expire_exit: hrtimer=ffff88017fccbac0
>             pert-5229  [003] d..h2..   684.483618: hrtimer_start: hrtimer=ffff88017fccbac0 function=tick_sched_timer expires=683823187500 softexpires=683823187500
>             pert-5229  [003] d...3..   684.483619: sched_stat_runtime: comm=pert pid=5229 runtime=1778 [ns] vruntime=179050870002 [ns]
>             pert-5229  [003] d.L.3..   684.483620: sched_wakeup: comm=ksoftirqd/3 pid=49 prio=120 success=1 target_cpu=003
>             pert-5229  [003] d.L.1..   684.483620: tick_stop: success=no msg=more than 1 task in runqueue
> 
>             pert-5229  [003] d.L.1..   684.483621: tick_stop: success=no msg=more than 1 task in runqueue
> 
>             pert-5229  [003] d.L.3..   684.483621: sched_stat_runtime: comm=pert pid=5229 runtime=2096 [ns] vruntime=179050872098 [ns]
>             pert-5229  [003] d...3..   684.483622: sched_switch: prev_comm=pert prev_pid=5229 prev_prio=120 prev_state=R+ ==> next_comm=ksoftirqd/3 next_pid=49 next_prio=120
>      ksoftirqd/3-49    [003] ....111   684.483623: softirq_entry: vec=1 [action=TIMER]
>      ksoftirqd/3-49    [003] ....111   684.483623: softirq_exit: vec=1 [action=TIMER]
>      ksoftirqd/3-49    [003] ....111   684.483623: softirq_entry: vec=9 [action=RCU]
>      ksoftirqd/3-49    [003] ....111   684.483624: softirq_exit: vec=9 [action=RCU]
>      ksoftirqd/3-49    [003] d...3..   684.483625: sched_stat_runtime: comm=ksoftirqd/3 pid=49 runtime=3187 [ns] vruntime=179041873189 [ns]
>      ksoftirqd/3-49    [003] d...3..   684.483625: sched_switch: prev_comm=ksoftirqd/3 prev_pid=49 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=5229 next_prio=120
>             pert-5229  [003] d...3..   684.483626: sched_stat_runtime: comm=pert pid=5229 runtime=1331 [ns] vruntime=179050873429 [ns]

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 32618b3fe4e6..ee4de3f6a77e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2658,6 +2658,10 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 	/* Check for CPU stalls, if enabled. */
 	check_cpu_stall(rsp, rdp);
 
+	/* Is this CPU a NO_HZ_FULL CPU that should ignore RCU? */
+	if (rcu_nohz_full_cpu(rsp))
+		return 0;
+
 	/* Is the RCU core waiting for a quiescent state from this CPU? */
 	if (rcu_scheduler_fully_active &&
 	    rdp->qs_pending && !rdp->passed_quiesce) {
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 5f97eab602cd..42cdc77dc7f1 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -562,6 +562,7 @@ static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
 				  unsigned long maxj);
 static void rcu_bind_gp_kthread(void);
 static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);
+static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
 
 #endif /* #ifndef RCU_TREE_NONCORE */
 
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 130c97b027f2..3e7087d3b9df 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2795,3 +2795,23 @@ static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp)
 }
 
 #endif /* #else #ifdef CONFIG_NO_HZ_FULL_SYSIDLE */
+
+/*
+ * Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
+ * grace-period kthread will do force_quiescent_state() processing?
+ * The idea is to avoid waking up RCU core processing on such a
+ * CPU unless the grace period has extended for too long.
+ *
+ * This code relies on the fact that all NO_HZ_FULL CPUs are also
+ * CONFIG_RCU_NOCB_CPUs.
+ */
+static bool rcu_nohz_full_cpu(struct rcu_state *rsp)
+{
+#ifdef CONFIG_NO_HZ_FULL
+	if (tick_nohz_full_cpu(smp_processor_id()) &&
+	    (!rcu_gp_in_progress(rsp) ||
+	     ULONG_CMP_LT(jiffies, ACCESS_ONCE(rsp->gp_start) + HZ)))
+		return 1;
+#endif /* #ifdef CONFIG_NO_HZ_FULL */
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08  3:23       ` Paul E. McKenney
@ 2013-11-08  7:31         ` Mike Galbraith
  2013-11-08 12:37           ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2013-11-08  7:31 UTC (permalink / raw)
  To: paulmck; +Cc: Thomas Gleixner, Frederic Weisbecker, Peter Zijlstra, LKML, RT

On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:

> An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.

Yup, toasted.

-Mike


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08  7:31         ` Mike Galbraith
@ 2013-11-08 12:37           ` Paul E. McKenney
  2013-11-08 13:26             ` Mike Galbraith
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2013-11-08 12:37 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, Frederic Weisbecker, Peter Zijlstra, LKML, RT

On Fri, Nov 08, 2013 at 08:31:20AM +0100, Mike Galbraith wrote:
> On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:
> 
> > An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
> 
> Yup, toasted.

You lost me on this one.  If my patch broke your system, any chance of
any diagnostic information?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08 12:37           ` Paul E. McKenney
@ 2013-11-08 13:26             ` Mike Galbraith
  2013-11-08 14:03               ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2013-11-08 13:26 UTC (permalink / raw)
  To: paulmck; +Cc: Thomas Gleixner, Frederic Weisbecker, Peter Zijlstra, LKML, RT

On Fri, 2013-11-08 at 04:37 -0800, Paul E. McKenney wrote: 
> On Fri, Nov 08, 2013 at 08:31:20AM +0100, Mike Galbraith wrote:
> > On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:
> > 
> > > An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
> > 
> > Yup, toasted.
> 
> You lost me on this one.  If my patch broke your system, any chance of
> any diagnostic information?

No no, the RCU_SOFTIRQs are toast, history, gone.

-Mike


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08 13:26             ` Mike Galbraith
@ 2013-11-08 14:03               ` Paul E. McKenney
  2013-11-08 14:21                 ` Mike Galbraith
  2013-11-08 14:29                 ` Frederic Weisbecker
  0 siblings, 2 replies; 39+ messages in thread
From: Paul E. McKenney @ 2013-11-08 14:03 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, Frederic Weisbecker, Peter Zijlstra, LKML, RT

On Fri, Nov 08, 2013 at 02:26:28PM +0100, Mike Galbraith wrote:
> On Fri, 2013-11-08 at 04:37 -0800, Paul E. McKenney wrote: 
> > On Fri, Nov 08, 2013 at 08:31:20AM +0100, Mike Galbraith wrote:
> > > On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:
> > > 
> > > > An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
> > > 
> > > Yup, toasted.
> > 
> > You lost me on this one.  If my patch broke your system, any chance of
> > any diagnostic information?
> 
> No no, the RCU_SOFTIRQs are toast, history, gone.

Ah, I do like that outcome much better!  ;-)  The scheduling-clock
interrupts are gone (or at least reduced) as well?

And does this mean that I can have your Tested-by?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08 14:03               ` Paul E. McKenney
@ 2013-11-08 14:21                 ` Mike Galbraith
  2013-11-08 14:29                 ` Frederic Weisbecker
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2013-11-08 14:21 UTC (permalink / raw)
  To: paulmck; +Cc: Thomas Gleixner, Frederic Weisbecker, Peter Zijlstra, LKML, RT

On Fri, 2013-11-08 at 06:03 -0800, Paul E. McKenney wrote: 
> On Fri, Nov 08, 2013 at 02:26:28PM +0100, Mike Galbraith wrote:
> > On Fri, 2013-11-08 at 04:37 -0800, Paul E. McKenney wrote: 
> > > On Fri, Nov 08, 2013 at 08:31:20AM +0100, Mike Galbraith wrote:
> > > > On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:
> > > > 
> > > > > An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
> > > > 
> > > > Yup, toasted.
> > > 
> > > You lost me on this one.  If my patch broke your system, any chance of
> > > any diagnostic information?
> > 
> > No no, the RCU_SOFTIRQs are toast, history, gone.
> 
> Ah, I do like that outcome much better!  ;-)  The scheduling-clock
> interrupts are gone (or at least reduced) as well?

No, tick shutdown needs encouragement to succeed, but there were no
RCU_SOFTIRQs on rcu_nocbs= CPUs.

> And does this mean that I can have your Tested-by?

Tested-by: Mike Galbraith <bitbucket@online.de>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08 14:03               ` Paul E. McKenney
  2013-11-08 14:21                 ` Mike Galbraith
@ 2013-11-08 14:29                 ` Frederic Weisbecker
  2013-11-08 14:45                   ` Paul E. McKenney
  2013-11-08 14:53                   ` Mike Galbraith
  1 sibling, 2 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2013-11-08 14:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mike Galbraith, Thomas Gleixner, Peter Zijlstra, LKML, RT

On Fri, Nov 08, 2013 at 06:03:31AM -0800, Paul E. McKenney wrote:
> On Fri, Nov 08, 2013 at 02:26:28PM +0100, Mike Galbraith wrote:
> > On Fri, 2013-11-08 at 04:37 -0800, Paul E. McKenney wrote: 
> > > On Fri, Nov 08, 2013 at 08:31:20AM +0100, Mike Galbraith wrote:
> > > > On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:
> > > > 
> > > > > An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
> > > > 
> > > > Yup, toasted.
> > > 
> > > You lost me on this one.  If my patch broke your system, any chance of
> > > any diagnostic information?
> > 
> > No no, the RCU_SOFTIRQs are toast, history, gone.
> 
> Ah, I do like that outcome much better!  ;-)  The scheduling-clock
> interrupts are gone (or at least reduced) as well?
> 
> And does this mean that I can have your Tested-by?

What's wrong with Toasted-by: ?  ;)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08 14:29                 ` Frederic Weisbecker
@ 2013-11-08 14:45                   ` Paul E. McKenney
  2013-11-08 16:03                     ` Frederic Weisbecker
  2013-11-08 14:53                   ` Mike Galbraith
  1 sibling, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2013-11-08 14:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Mike Galbraith, Thomas Gleixner, Peter Zijlstra, LKML, RT

On Fri, Nov 08, 2013 at 03:29:38PM +0100, Frederic Weisbecker wrote:
> On Fri, Nov 08, 2013 at 06:03:31AM -0800, Paul E. McKenney wrote:
> > On Fri, Nov 08, 2013 at 02:26:28PM +0100, Mike Galbraith wrote:
> > > On Fri, 2013-11-08 at 04:37 -0800, Paul E. McKenney wrote: 
> > > > On Fri, Nov 08, 2013 at 08:31:20AM +0100, Mike Galbraith wrote:
> > > > > On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:
> > > > > 
> > > > > > An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
> > > > > 
> > > > > Yup, toasted.
> > > > 
> > > > You lost me on this one.  If my patch broke your system, any chance of
> > > > any diagnostic information?
> > > 
> > > No no, the RCU_SOFTIRQs are toast, history, gone.
> > 
> > Ah, I do like that outcome much better!  ;-)  The scheduling-clock
> > interrupts are gone (or at least reduced) as well?
> > 
> > And does this mean that I can have your Tested-by?
> 
> What's wrong with Toasted-by: ?  ;)

Does this mean that I can have your Toasted by?  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08 14:29                 ` Frederic Weisbecker
  2013-11-08 14:45                   ` Paul E. McKenney
@ 2013-11-08 14:53                   ` Mike Galbraith
  2013-11-08 16:04                     ` Frederic Weisbecker
  1 sibling, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2013-11-08 14:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Thomas Gleixner, Peter Zijlstra, LKML, RT

On Fri, 2013-11-08 at 15:29 +0100, Frederic Weisbecker wrote: 
> On Fri, Nov 08, 2013 at 06:03:31AM -0800, Paul E. McKenney wrote:
> > On Fri, Nov 08, 2013 at 02:26:28PM +0100, Mike Galbraith wrote:
> > > On Fri, 2013-11-08 at 04:37 -0800, Paul E. McKenney wrote: 
> > > > On Fri, Nov 08, 2013 at 08:31:20AM +0100, Mike Galbraith wrote:
> > > > > On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:
> > > > > 
> > > > > > An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
> > > > > 
> > > > > Yup, toasted.
> > > > 
> > > > You lost me on this one.  If my patch broke your system, any chance of
> > > > any diagnostic information?
> > > 
> > > No no, the RCU_SOFTIRQs are toast, history, gone.
> > 
> > Ah, I do like that outcome much better!  ;-)  The scheduling-clock
> > interrupts are gone (or at least reduced) as well?
> > 
> > And does this mean that I can have your Tested-by?
> 
> What's wrong with Toasted-by: ?  ;)

It sprang to mind, but I resisted :)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08 14:45                   ` Paul E. McKenney
@ 2013-11-08 16:03                     ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2013-11-08 16:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mike Galbraith, Thomas Gleixner, Peter Zijlstra, LKML, RT

On Fri, Nov 08, 2013 at 06:45:34AM -0800, Paul E. McKenney wrote:
> On Fri, Nov 08, 2013 at 03:29:38PM +0100, Frederic Weisbecker wrote:
> > On Fri, Nov 08, 2013 at 06:03:31AM -0800, Paul E. McKenney wrote:
> > > On Fri, Nov 08, 2013 at 02:26:28PM +0100, Mike Galbraith wrote:
> > > > On Fri, 2013-11-08 at 04:37 -0800, Paul E. McKenney wrote: 
> > > > > On Fri, Nov 08, 2013 at 08:31:20AM +0100, Mike Galbraith wrote:
> > > > > > On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:
> > > > > > 
> > > > > > > An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
> > > > > > 
> > > > > > Yup, toasted.
> > > > > 
> > > > > You lost me on this one.  If my patch broke your system, any chance of
> > > > > any diagnostic information?
> > > > 
> > > > No no, the RCU_SOFTIRQs are toast, history, gone.
> > > 
> > > Ah, I do like that outcome much better!  ;-)  The scheduling-clock
> > > interrupts are gone (or at least reduced) as well?
> > > 
> > > And does this mean that I can have your Tested-by?
> > 
> > What's wrong with Toasted-by: ?  ;)
> 
> Does this mean that I can have your Toasted by?  ;-)

Sure that was assumed ;)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-08 14:53                   ` Mike Galbraith
@ 2013-11-08 16:04                     ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2013-11-08 16:04 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paul E. McKenney, Thomas Gleixner, Peter Zijlstra, LKML, RT

On Fri, Nov 08, 2013 at 03:53:47PM +0100, Mike Galbraith wrote:
> On Fri, 2013-11-08 at 15:29 +0100, Frederic Weisbecker wrote: 
> > On Fri, Nov 08, 2013 at 06:03:31AM -0800, Paul E. McKenney wrote:
> > > On Fri, Nov 08, 2013 at 02:26:28PM +0100, Mike Galbraith wrote:
> > > > On Fri, 2013-11-08 at 04:37 -0800, Paul E. McKenney wrote: 
> > > > > On Fri, Nov 08, 2013 at 08:31:20AM +0100, Mike Galbraith wrote:
> > > > > > On Thu, 2013-11-07 at 19:23 -0800, Paul E. McKenney wrote:
> > > > > > 
> > > > > > > An untested patch that gets rid of the RCU_SOFTIRQs in this case is below.
> > > > > > 
> > > > > > Yup, toasted.
> > > > > 
> > > > > You lost me on this one.  If my patch broke your system, any chance of
> > > > > any diagnostic information?
> > > > 
> > > > No no, the RCU_SOFTIRQs are toast, history, gone.
> > > 
> > > Ah, I do like that outcome much better!  ;-)  The scheduling-clock
> > > interrupts are gone (or at least reduced) as well?
> > > 
> > > And does this mean that I can have your Tested-by?
> > 
> > What's wrong with Toasted-by: ?  ;)
> 
> It sprang to mind, but I resisted :)

Yeah I'm too weak with these things :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-07 13:13           ` Thomas Gleixner
@ 2013-11-12  8:06             ` Mike Galbraith
  2013-11-12  9:28               ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2013-11-12  8:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Frederic Weisbecker, Peter Zijlstra, LKML, RT, Paul E. McKenney

On Thu, 2013-11-07 at 14:13 +0100, Thomas Gleixner wrote: 
> On Thu, 7 Nov 2013, Frederic Weisbecker wrote:
> > On Thu, Nov 07, 2013 at 12:21:11PM +0100, Thomas Gleixner wrote:
> > > Though it's not a full solution. It needs some thought versus the
> > > softirq code of timers. Assume we have only one timer queued 1000
> > > ticks into the future. So this change will cause the timer softirq not
> > > to be called until that timer expires and then the timer softirq is
> > > going to do 1000 loops until it catches up with jiffies. That's
> > > anything but pretty ...
> > 
> > I see, so the problem is that we raise the timer softirq unconditionally
> > from the tick?
> 
> Right.
>  
> > Ok we definetly don't want to keep that behaviour, even if softirqs are not
> > threaded, that's an overhead. So I'm looking at that loop in __run_timers()
> > and I guess you mean the "base->timer_jiffies" incrementation?
> >
> > That's indeed not pretty. How do we handle exit from long dynticks
> > idle periods? Are we doing that loop until we catch up with the new
> > jiffies?
> 
> Right. I realized that right after I hit send :)
>  
> > Then it relies on the timer cascade stuff which is very obscure code to me...
> 
> It's not that bad, really. I have an idea how to fix that. Needs some
> rewriting though.

FYI, shiny new (and virgin) 3.12.0-rt1 nohz_full config is deadlock
prone.  I was measuring fastpath cost yesterday with pinned pipe-test.. 

x3550 M3 E5620 (bloatware config)
CONFIG_NO_HZ_IDLE - CPU3
2.957012 usecs/loop -- avg 2.957012 676.4 KHz     1.000

CONFIG_NO_HZ_FULL - CPU1 != nohz_full
3.735279 usecs/loop -- avg 3.735279 535.4 KHz      .791

CONFIG_NO_HZ_FULL - CPU3 == nohz_full
5.922986 usecs/loop -- avg 5.922986 337.7 KHz      .499

(ow)

..and noticed box eventually deadlocks, if it boots, which the instance
below didn't.

crash> bt
PID: 11     TASK: ffff88017a27d5a0  CPU: 2   COMMAND: "rcu_preempt"
 #0 [ffff88017b245ae0] machine_kexec at ffffffff810392f1
 #1 [ffff88017b245b40] crash_kexec at ffffffff810cd9d5
 #2 [ffff88017b245c10] panic at ffffffff815bea93
 #3 [ffff88017b245c90] watchdog_overflow_callback.part.3 at ffffffff810f4fd2
 #4 [ffff88017b245ca0] __perf_event_overflow at ffffffff8112715c
 #5 [ffff88017b245d10] intel_pmu_handle_irq at ffffffff8101f432
 #6 [ffff88017b245e00] perf_event_nmi_handler at ffffffff815d4732
 #7 [ffff88017b245e20] nmi_handle.isra.4 at ffffffff815d3dad
 #8 [ffff88017b245eb0] default_do_nmi at ffffffff815d4099
 #9 [ffff88017b245ee0] do_nmi at ffffffff815d42b8
#10 [ffff88017b245ef0] end_repeat_nmi at ffffffff815d31b1
    [exception RIP: _raw_spin_lock+38]
    RIP: ffffffff815d2596  RSP: ffff88017b243e90  RFLAGS: 00000093
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000093
    RDX: ffff88017b243e90  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff815d2596   R8: ffffffff815d2596   R9: 0000000000000018
    R10: ffff88017b243e90  R11: 0000000000000093  R12: ffffffffffffffff
    R13: ffff880179ef8000  R14: 0000000000000001  R15: 0000000000000eb6
    ORIG_RAX: 0000000000000eb6  CS: 0010  SS: 0018
--- <RT exception stack> ---
#11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596
#12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be
#13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42
#14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd
#15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2
#16 [ffff88017b243f90] irq_exit at ffffffff8105b02d
#17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd
--- <IRQ stack> ---
#18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd
    [exception RIP: task_blocks_on_rt_mutex+51]
    RIP: ffffffff810c1ed3  RSP: ffff88017a2a9c78  RFLAGS: 00000296
    RAX: 0000000000080000  RBX: 0000000000000001  RCX: 0000000000000000
    RDX: ffff88017a27d5a0  RSI: ffff88017a2a9d00  RDI: ffff880179ef8000
    RBP: ffff880179ef8000   R8: ffff880179cfef50   R9: ffff880179ef8018
    R10: ffff880179cfef51  R11: 0000000000000002  R12: 0000000000000001
    R13: 0000000000000001  R14: 0000000100000000  R15: 0000000100000000
    ORIG_RAX: ffffffffffffff02  CS: 0010  SS: 0018
#19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c
#20 [ffff88017a2a9da0] lock_timer_base.isra.35 at ffffffff81061cbf
#21 [ffff88017a2a9dd0] schedule_timeout at ffffffff815cf1ce
#22 [ffff88017a2a9e50] rcu_gp_kthread at ffffffff810f9bbb
#23 [ffff88017a2a9ed0] kthread at ffffffff810796d5
#24 [ffff88017a2a9f50] ret_from_fork at ffffffff815da04c
crash>



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-12  8:06             ` Mike Galbraith
@ 2013-11-12  9:28               ` Thomas Gleixner
  2013-11-15 16:30                 ` [PATCH] rtmutex: take the waiter lock with irqs off Sebastian Andrzej Siewior
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2013-11-12  9:28 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Frederic Weisbecker, Peter Zijlstra, LKML, RT, Paul E. McKenney,
	Sebastian Sewior

On Tue, 12 Nov 2013, Mike Galbraith wrote:
> FYI, shiny new (and virgin) 3.12.0-rt1 nohz_full config is deadlock
> prone.  I was measuring fastpath cost yesterday with pinned pipe-test.. 
> 
> x3550 M3 E5620 (bloatware config)
> CONFIG_NO_HZ_IDLE - CPU3
> 2.957012 usecs/loop -- avg 2.957012 676.4 KHz     1.000
> 
> CONFIG_NO_HZ_FULL - CPU1 != nohz_full
> 3.735279 usecs/loop -- avg 3.735279 535.4 KHz      .791
> 
> CONFIG_NO_HZ_FULL - CPU3 == nohz_full
> 5.922986 usecs/loop -- avg 5.922986 337.7 KHz      .499
> 
> (ow)
> 
> ..and noticed box eventually deadlocks, if it boots, which the instance
> below didn't.
 
> --- <RT exception stack> ---
> #11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596
> #12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be
> #13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42
> #14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd
> #15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2
> #16 [ffff88017b243f90] irq_exit at ffffffff8105b02d
> #17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd
> --- <IRQ stack> ---
> #18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd
>     [exception RIP: task_blocks_on_rt_mutex+51]
>     RIP: ffffffff810c1ed3  RSP: ffff88017a2a9c78  RFLAGS: 00000296
>     RAX: 0000000000080000  RBX: 0000000000000001  RCX: 0000000000000000
>     RDX: ffff88017a27d5a0  RSI: ffff88017a2a9d00  RDI: ffff880179ef8000
>     RBP: ffff880179ef8000   R8: ffff880179cfef50   R9: ffff880179ef8018
>     R10: ffff880179cfef51  R11: 0000000000000002  R12: 0000000000000001
>     R13: 0000000000000001  R14: 0000000100000000  R15: 0000000100000000
>     ORIG_RAX: ffffffffffffff02  CS: 0010  SS: 0018
> #19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c

Grr....
 
     raw_spin_lock(&lock->wait_lock);

Ditto in rt_mutex_slowtrylock()

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH] rtmutex: take the waiter lock with irqs off
  2013-11-12  9:28               ` Thomas Gleixner
@ 2013-11-15 16:30                 ` Sebastian Andrzej Siewior
  2013-11-15 20:14                   ` [PATCH v2] " Sebastian Andrzej Siewior
  0 siblings, 1 reply; 39+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-11-15 16:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, Frederic Weisbecker, Peter Zijlstra, LKML, RT,
	Paul E. McKenney

Mike Galbraith captered the following:
| >#11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596
| >#12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be
| >#13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42
| >#14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd
| >#15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2
| >#16 [ffff88017b243f90] irq_exit at ffffffff8105b02d
| >#17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd
| >--- <IRQ stack> ---
| >#18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd
| >    [exception RIP: task_blocks_on_rt_mutex+51]
| >#19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c
| >#20 [ffff88017a2a9da0] lock_timer_base.isra.35 at ffffffff81061cbf
| >#21 [ffff88017a2a9dd0] schedule_timeout at ffffffff815cf1ce
| >#22 [ffff88017a2a9e50] rcu_gp_kthread at ffffffff810f9bbb
| >#23 [ffff88017a2a9ed0] kthread at ffffffff810796d5
| >#24 [ffff88017a2a9f50] ret_from_fork at ffffffff815da04c

lock_timer_base() does a try_lock() which deadlocks on the waiter lock
not the lock itself.
This patch makes sure all users of the waiter_lock take the lock with
interrupts off so a try_lock from irq context is possible.

rt_spin_lock_slowlock() uses _irqsave() instead of _irq() because during
system boot it seems to be called a few times with irqs disabled.

Reported-by: <Mike Galbraith <bitbucket@online.de>>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/futex.c   |  16 +++---
 kernel/rtmutex.c | 156 +++++++++++++++++++++++++++++--------------------------
 2 files changed, 89 insertions(+), 83 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 0ef419d..404d0bd 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -891,7 +891,7 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
 	if (pi_state->owner != current)
 		return -EINVAL;
 
-	raw_spin_lock(&pi_state->pi_mutex.wait_lock);
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
 
 	/*
@@ -917,21 +917,21 @@ static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
 		else if (curval != uval)
 			ret = -EINVAL;
 		if (ret) {
-			raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
+			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 			return ret;
 		}
 	}
 
-	raw_spin_lock_irq(&pi_state->owner->pi_lock);
+	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
 	list_del_init(&pi_state->list);
-	raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+	raw_spin_unlock(&pi_state->owner->pi_lock);
 
-	raw_spin_lock_irq(&new_owner->pi_lock);
+	raw_spin_lock(&new_owner->pi_lock);
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &new_owner->pi_state_list);
 	pi_state->owner = new_owner;
-	raw_spin_unlock_irq(&new_owner->pi_lock);
+	raw_spin_unlock(&new_owner->pi_lock);
 
 	raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
 	rt_mutex_unlock(&pi_state->pi_mutex);
@@ -1762,11 +1762,11 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
 		 * we returned due to timeout or signal without taking the
 		 * rt_mutex. Too late.
 		 */
-		raw_spin_lock(&q->pi_state->pi_mutex.wait_lock);
+		raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
 		owner = rt_mutex_owner(&q->pi_state->pi_mutex);
 		if (!owner)
 			owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
-		raw_spin_unlock(&q->pi_state->pi_mutex.wait_lock);
+		raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
 		ret = fixup_pi_state_owner(uaddr, q, owner);
 		goto out;
 	}
diff --git a/kernel/rtmutex.c b/kernel/rtmutex.c
index b4c651e..d2d4a33 100644
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -298,7 +298,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	plist_add(&waiter->list_entry, &lock->wait_list);
 
 	/* Release the task */
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	raw_spin_unlock(&task->pi_lock);
 	if (!rt_mutex_owner(lock)) {
 		struct rt_mutex_waiter *lock_top_waiter;
 
@@ -309,7 +309,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		lock_top_waiter = rt_mutex_top_waiter(lock);
 		if (top_waiter != lock_top_waiter)
 			rt_mutex_wake_waiter(lock_top_waiter);
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		goto out_put_task;
 	}
 	put_task_struct(task);
@@ -317,7 +317,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 	/* Grab the next task */
 	task = rt_mutex_owner(lock);
 	get_task_struct(task);
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
+	raw_spin_lock(&task->pi_lock);
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		/* Boost the owner */
@@ -335,10 +335,10 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task,
 		__rt_mutex_adjust_prio(task);
 	}
 
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	raw_spin_unlock(&task->pi_lock);
 
 	top_waiter = rt_mutex_top_waiter(lock);
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	if (!detect_deadlock && waiter != top_waiter)
 		goto out_put_task;
@@ -425,10 +425,9 @@ __try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 	/* We got the lock. */
 
 	if (waiter || rt_mutex_has_waiters(lock)) {
-		unsigned long flags;
 		struct rt_mutex_waiter *top;
 
-		raw_spin_lock_irqsave(&task->pi_lock, flags);
+		raw_spin_lock(&task->pi_lock);
 
 		/* remove the queued waiter. */
 		if (waiter) {
@@ -445,7 +444,7 @@ __try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 			top->pi_list_entry.prio = top->list_entry.prio;
 			plist_add(&top->pi_list_entry, &task->pi_waiters);
 		}
-		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+		raw_spin_unlock(&task->pi_lock);
 	}
 
 	debug_rt_mutex_lock(lock);
@@ -474,14 +473,13 @@ try_to_take_rt_mutex(struct rt_mutex *lock, struct task_struct *task,
 static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 				   struct rt_mutex_waiter *waiter,
 				   struct task_struct *task,
-				   int detect_deadlock)
+				   int detect_deadlock, unsigned long *flags)
 {
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex_waiter *top_waiter = waiter;
-	unsigned long flags;
 	int chain_walk = 0, res;
 
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
+	raw_spin_lock(&task->pi_lock);
 
 	/*
 	 * In the case of futex requeue PI, this will be a proxy
@@ -493,7 +491,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	 * the task if PI_WAKEUP_INPROGRESS is set.
 	 */
 	if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
-		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+		raw_spin_unlock(&task->pi_lock);
 		return -EAGAIN;
 	}
 
@@ -512,20 +510,20 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 
 	task->pi_blocked_on = waiter;
 
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	raw_spin_unlock(&task->pi_lock);
 
 	if (!owner)
 		return 0;
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
-		raw_spin_lock_irqsave(&owner->pi_lock, flags);
+		raw_spin_lock(&owner->pi_lock);
 		plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
 		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
 
 		__rt_mutex_adjust_prio(owner);
 		if (rt_mutex_real_waiter(owner->pi_blocked_on))
 			chain_walk = 1;
-		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
+		raw_spin_unlock(&owner->pi_lock);
 	}
 	else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock))
 		chain_walk = 1;
@@ -540,12 +538,12 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 	 */
 	get_task_struct(owner);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
 
 	res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,
 					 task);
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, *flags);
 
 	return res;
 }
@@ -560,9 +558,8 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock,
 static void wakeup_next_waiter(struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *waiter;
-	unsigned long flags;
 
-	raw_spin_lock_irqsave(&current->pi_lock, flags);
+	raw_spin_lock(&current->pi_lock);
 
 	waiter = rt_mutex_top_waiter(lock);
 
@@ -576,7 +573,7 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
 
 	rt_mutex_set_owner(lock, NULL);
 
-	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
+	raw_spin_unlock(&current->pi_lock);
 
 	rt_mutex_wake_waiter(waiter);
 }
@@ -588,24 +585,24 @@ static void wakeup_next_waiter(struct rt_mutex *lock)
  * have just failed to try_to_take_rt_mutex().
  */
 static void remove_waiter(struct rt_mutex *lock,
-			  struct rt_mutex_waiter *waiter)
+			  struct rt_mutex_waiter *waiter,
+			  unsigned long *flags)
 {
 	int first = (waiter == rt_mutex_top_waiter(lock));
 	struct task_struct *owner = rt_mutex_owner(lock);
-	unsigned long flags;
 	int chain_walk = 0;
 
-	raw_spin_lock_irqsave(&current->pi_lock, flags);
+	raw_spin_lock(&current->pi_lock);
 	plist_del(&waiter->list_entry, &lock->wait_list);
 	current->pi_blocked_on = NULL;
-	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
+	raw_spin_unlock(&current->pi_lock);
 
 	if (!owner)
 		return;
 
 	if (first) {
 
-		raw_spin_lock_irqsave(&owner->pi_lock, flags);
+		raw_spin_lock(&owner->pi_lock);
 
 		plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
 
@@ -620,7 +617,7 @@ static void remove_waiter(struct rt_mutex *lock,
 		if (rt_mutex_real_waiter(owner->pi_blocked_on))
 			chain_walk = 1;
 
-		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
+		raw_spin_unlock(&owner->pi_lock);
 	}
 
 	WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
@@ -631,11 +628,11 @@ static void remove_waiter(struct rt_mutex *lock,
 	/* gets dropped in rt_mutex_adjust_prio_chain()! */
 	get_task_struct(owner);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
 
 	rt_mutex_adjust_prio_chain(owner, 0, lock, NULL, current);
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, *flags);
 }
 
 /*
@@ -723,9 +720,6 @@ static int adaptive_wait(struct rt_mutex *lock,
 }
 #endif
 
-# define pi_lock(lock)			raw_spin_lock_irq(lock)
-# define pi_unlock(lock)		raw_spin_unlock_irq(lock)
-
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -737,15 +731,16 @@ static void  noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock)
 {
 	struct task_struct *lock_owner, *self = current;
 	struct rt_mutex_waiter waiter, *top_waiter;
+	unsigned long flags;
 	int ret;
 
 	rt_mutex_init_waiter(&waiter, true);
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
 
 	if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) {
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return;
 	}
 
@@ -757,12 +752,12 @@ static void  noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock)
 	 * as well. We are serialized via pi_lock against wakeups. See
 	 * try_to_wake_up().
 	 */
-	pi_lock(&self->pi_lock);
+	raw_spin_lock(&self->pi_lock);
 	self->saved_state = self->state;
 	__set_current_state(TASK_UNINTERRUPTIBLE);
-	pi_unlock(&self->pi_lock);
+	raw_spin_unlock(&self->pi_lock);
 
-	ret = task_blocks_on_rt_mutex(lock, &waiter, self, 0);
+	ret = task_blocks_on_rt_mutex(lock, &waiter, self, 0, &flags);
 	BUG_ON(ret);
 
 	for (;;) {
@@ -773,18 +768,18 @@ static void  noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock)
 		top_waiter = rt_mutex_top_waiter(lock);
 		lock_owner = rt_mutex_owner(lock);
 
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 		debug_rt_mutex_print_deadlock(&waiter);
 
 		if (top_waiter != &waiter || adaptive_wait(lock, lock_owner))
 			schedule_rt_mutex(lock);
 
-		raw_spin_lock(&lock->wait_lock);
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 
-		pi_lock(&self->pi_lock);
+		raw_spin_lock(&self->pi_lock);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
-		pi_unlock(&self->pi_lock);
+		raw_spin_unlock(&self->pi_lock);
 	}
 
 	/*
@@ -794,10 +789,10 @@ static void  noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock)
 	 * happened while we were blocked. Clear saved_state so
 	 * try_to_wakeup() does not get confused.
 	 */
-	pi_lock(&self->pi_lock);
+	raw_spin_lock(&self->pi_lock);
 	__set_current_state(self->saved_state);
 	self->saved_state = TASK_RUNNING;
-	pi_unlock(&self->pi_lock);
+	raw_spin_unlock(&self->pi_lock);
 
 	/*
 	 * try_to_take_rt_mutex() sets the waiter bit
@@ -808,7 +803,7 @@ static void  noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock)
 	BUG_ON(rt_mutex_has_waiters(lock) && &waiter == rt_mutex_top_waiter(lock));
 	BUG_ON(!plist_node_empty(&waiter.list_entry));
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	debug_rt_mutex_free_waiter(&waiter);
 }
@@ -818,7 +813,9 @@ static void  noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock)
  */
 static void  noinline __sched rt_spin_lock_slowunlock(struct rt_mutex *lock)
 {
-	raw_spin_lock(&lock->wait_lock);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 
 	debug_rt_mutex_unlock(lock);
 
@@ -826,13 +823,13 @@ static void  noinline __sched rt_spin_lock_slowunlock(struct rt_mutex *lock)
 
 	if (!rt_mutex_has_waiters(lock)) {
 		lock->owner = NULL;
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return;
 	}
 
 	wakeup_next_waiter(lock);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* Undo pi boosting.when necessary */
 	rt_mutex_adjust_prio(current);
@@ -1003,7 +1000,8 @@ static int __sched
 __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 		    struct hrtimer_sleeper *timeout,
 		    struct rt_mutex_waiter *waiter,
-		    struct ww_acquire_ctx *ww_ctx)
+		    struct ww_acquire_ctx *ww_ctx,
+		    unsigned long *flags)
 {
 	int ret = 0;
 
@@ -1032,13 +1030,13 @@ __rt_mutex_slowlock(struct rt_mutex *lock, int state,
 				break;
 		}
 
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
 
 		debug_rt_mutex_print_deadlock(waiter);
 
 		schedule_rt_mutex(lock);
 
-		raw_spin_lock(&lock->wait_lock);
+		raw_spin_lock_irqsave(&lock->wait_lock, *flags);
 		set_current_state(state);
 	}
 
@@ -1130,18 +1128,19 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 		  int detect_deadlock, struct ww_acquire_ctx *ww_ctx)
 {
 	struct rt_mutex_waiter waiter;
+	unsigned long flags;
 	int ret = 0;
 
 	rt_mutex_init_waiter(&waiter, false);
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
 
 	/* Try to acquire the lock again: */
 	if (try_to_take_rt_mutex(lock, current, NULL)) {
 		if (ww_ctx)
 			ww_mutex_account_lock(lock, ww_ctx);
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return 0;
 	}
 
@@ -1154,15 +1153,17 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 			timeout->task = NULL;
 	}
 
-	ret = task_blocks_on_rt_mutex(lock, &waiter, current, detect_deadlock);
+	ret = task_blocks_on_rt_mutex(lock, &waiter, current, detect_deadlock,
+			&flags);
 
 	if (likely(!ret))
-		ret = __rt_mutex_slowlock(lock, state, timeout, &waiter, ww_ctx);
+		ret = __rt_mutex_slowlock(lock, state, timeout, &waiter, ww_ctx,
+				&flags);
 
 	set_current_state(TASK_RUNNING);
 
 	if (unlikely(ret))
-		remove_waiter(lock, &waiter);
+		remove_waiter(lock, &waiter, &flags);
 	else if (ww_ctx)
 		ww_mutex_account_lock(lock, ww_ctx);
 
@@ -1172,7 +1173,7 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 	 */
 	fixup_rt_mutex_waiters(lock);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* Remove pending timer: */
 	if (unlikely(timeout))
@@ -1189,9 +1190,10 @@ rt_mutex_slowlock(struct rt_mutex *lock, int state,
 static inline int
 rt_mutex_slowtrylock(struct rt_mutex *lock)
 {
+	unsigned long flags;
 	int ret = 0;
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
 
 	if (likely(rt_mutex_owner(lock) != current)) {
@@ -1204,7 +1206,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lock)
 		fixup_rt_mutex_waiters(lock);
 	}
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	return ret;
 }
@@ -1215,7 +1217,9 @@ rt_mutex_slowtrylock(struct rt_mutex *lock)
 static void __sched
 rt_mutex_slowunlock(struct rt_mutex *lock)
 {
-	raw_spin_lock(&lock->wait_lock);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 
 	debug_rt_mutex_unlock(lock);
 
@@ -1223,13 +1227,13 @@ rt_mutex_slowunlock(struct rt_mutex *lock)
 
 	if (!rt_mutex_has_waiters(lock)) {
 		lock->owner = NULL;
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return;
 	}
 
 	wakeup_next_waiter(lock);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* Undo pi boosting if necessary: */
 	rt_mutex_adjust_prio(current);
@@ -1487,12 +1491,13 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 			      struct rt_mutex_waiter *waiter,
 			      struct task_struct *task, int detect_deadlock)
 {
+	unsigned long flags;
 	int ret;
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 
 	if (try_to_take_rt_mutex(lock, task, NULL)) {
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return 1;
 	}
 
@@ -1515,18 +1520,18 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 	 * PI_REQUEUE_INPROGRESS, so that if the task is waking up
 	 * it will know that we are in the process of requeuing it.
 	 */
-	raw_spin_lock_irq(&task->pi_lock);
+	raw_spin_lock(&task->pi_lock);
 	if (task->pi_blocked_on) {
-		raw_spin_unlock_irq(&task->pi_lock);
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock(&task->pi_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return -EAGAIN;
 	}
 	task->pi_blocked_on = PI_REQUEUE_INPROGRESS;
-	raw_spin_unlock_irq(&task->pi_lock);
+	raw_spin_unlock(&task->pi_lock);
 #endif
 
-	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);
-
+	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock,
+			&flags);
 	if (ret && !rt_mutex_owner(lock)) {
 		/*
 		 * Reset the return value. We might have
@@ -1538,9 +1543,9 @@ int rt_mutex_start_proxy_lock(struct rt_mutex *lock,
 	}
 
 	if (unlikely(ret))
-		remove_waiter(lock, waiter);
+		remove_waiter(lock, waiter, &flags);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	debug_rt_mutex_print_deadlock(waiter);
 
@@ -1589,17 +1594,18 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 			       int detect_deadlock)
 {
 	int ret;
+	unsigned long flags;
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 
 	set_current_state(TASK_INTERRUPTIBLE);
 
-	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter, NULL);
-
+	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter, NULL,
+			&flags);
 	set_current_state(TASK_RUNNING);
 
 	if (unlikely(ret))
-		remove_waiter(lock, waiter);
+		remove_waiter(lock, waiter, &flags);
 
 	/*
 	 * try_to_take_rt_mutex() sets the waiter bit unconditionally. We might
@@ -1607,7 +1613,7 @@ int rt_mutex_finish_proxy_lock(struct rt_mutex *lock,
 	 */
 	fixup_rt_mutex_waiters(lock);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	return ret;
 }
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-15 16:30                 ` [PATCH] rtmutex: take the waiter lock with irqs off Sebastian Andrzej Siewior
@ 2013-11-15 20:14                   ` Sebastian Andrzej Siewior
  2013-11-18 14:10                     ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-11-15 20:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, Frederic Weisbecker, Peter Zijlstra, LKML, RT,
	Paul E. McKenney

Mike Galbraith captered the following:
| >#11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596
| >#12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be
| >#13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42
| >#14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd
| >#15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2
| >#16 [ffff88017b243f90] irq_exit at ffffffff8105b02d
| >#17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd
| >--- <IRQ stack> ---
| >#18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd
| >    [exception RIP: task_blocks_on_rt_mutex+51]
| >#19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c
| >#20 [ffff88017a2a9da0] lock_timer_base.isra.35 at ffffffff81061cbf
| >#21 [ffff88017a2a9dd0] schedule_timeout at ffffffff815cf1ce
| >#22 [ffff88017a2a9e50] rcu_gp_kthread at ffffffff810f9bbb
| >#23 [ffff88017a2a9ed0] kthread at ffffffff810796d5
| >#24 [ffff88017a2a9f50] ret_from_fork at ffffffff815da04c

lock_timer_base() does a try_lock() which deadlocks on the waiter lock
not the lock itself.
This patch makes sure all users of the waiter_lock take the lock with
interrupts off so a try_lock from irq context is possible.

Cc: stable-rt@vger.kernel.org
Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
v1…v2: rt_spin_lock_slowlock() and rt_mutex_slowlock() may be called very
early during the system boot with irq off if the fast path can't be
taken. This is the case if lockdep is on because it switches the cmpxchg
off. Instead always using _irqsave() variant I now restore flags in the
first trylock attempt. If that fails (the lock is taken) the code will
BUG() if the interruts were disabled.
    
 kernel/futex.c   |   16 +++---
 kernel/rtmutex.c |  130 +++++++++++++++++++++++++++----------------------------
 2 files changed, 74 insertions(+), 72 deletions(-)

--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -891,7 +891,7 @@ static int wake_futex_pi(u32 __user *uad
 	if (pi_state->owner != current)
 		return -EINVAL;
 
-	raw_spin_lock(&pi_state->pi_mutex.wait_lock);
+	raw_spin_lock_irq(&pi_state->pi_mutex.wait_lock);
 	new_owner = rt_mutex_next_owner(&pi_state->pi_mutex);
 
 	/*
@@ -917,21 +917,21 @@ static int wake_futex_pi(u32 __user *uad
 		else if (curval != uval)
 			ret = -EINVAL;
 		if (ret) {
-			raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
+			raw_spin_unlock_irq(&pi_state->pi_mutex.wait_lock);
 			return ret;
 		}
 	}
 
-	raw_spin_lock_irq(&pi_state->owner->pi_lock);
+	raw_spin_lock(&pi_state->owner->pi_lock);
 	WARN_ON(list_empty(&pi_state->list));
 	list_del_init(&pi_state->list);
-	raw_spin_unlock_irq(&pi_state->owner->pi_lock);
+	raw_spin_unlock(&pi_state->owner->pi_lock);
 
-	raw_spin_lock_irq(&new_owner->pi_lock);
+	raw_spin_lock(&new_owner->pi_lock);
 	WARN_ON(!list_empty(&pi_state->list));
 	list_add(&pi_state->list, &new_owner->pi_state_list);
 	pi_state->owner = new_owner;
-	raw_spin_unlock_irq(&new_owner->pi_lock);
+	raw_spin_unlock(&new_owner->pi_lock);
 
 	raw_spin_unlock(&pi_state->pi_mutex.wait_lock);
 	rt_mutex_unlock(&pi_state->pi_mutex);
@@ -1762,11 +1762,11 @@ static int fixup_owner(u32 __user *uaddr
 		 * we returned due to timeout or signal without taking the
 		 * rt_mutex. Too late.
 		 */
-		raw_spin_lock(&q->pi_state->pi_mutex.wait_lock);
+		raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
 		owner = rt_mutex_owner(&q->pi_state->pi_mutex);
 		if (!owner)
 			owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
-		raw_spin_unlock(&q->pi_state->pi_mutex.wait_lock);
+		raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
 		ret = fixup_pi_state_owner(uaddr, q, owner);
 		goto out;
 	}
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -298,7 +298,7 @@ static int rt_mutex_adjust_prio_chain(st
 	plist_add(&waiter->list_entry, &lock->wait_list);
 
 	/* Release the task */
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	raw_spin_unlock(&task->pi_lock);
 	if (!rt_mutex_owner(lock)) {
 		struct rt_mutex_waiter *lock_top_waiter;
 
@@ -309,7 +309,7 @@ static int rt_mutex_adjust_prio_chain(st
 		lock_top_waiter = rt_mutex_top_waiter(lock);
 		if (top_waiter != lock_top_waiter)
 			rt_mutex_wake_waiter(lock_top_waiter);
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		goto out_put_task;
 	}
 	put_task_struct(task);
@@ -317,7 +317,7 @@ static int rt_mutex_adjust_prio_chain(st
 	/* Grab the next task */
 	task = rt_mutex_owner(lock);
 	get_task_struct(task);
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
+	raw_spin_lock(&task->pi_lock);
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
 		/* Boost the owner */
@@ -335,10 +335,10 @@ static int rt_mutex_adjust_prio_chain(st
 		__rt_mutex_adjust_prio(task);
 	}
 
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	raw_spin_unlock(&task->pi_lock);
 
 	top_waiter = rt_mutex_top_waiter(lock);
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	if (!detect_deadlock && waiter != top_waiter)
 		goto out_put_task;
@@ -425,10 +425,9 @@ static int
 	/* We got the lock. */
 
 	if (waiter || rt_mutex_has_waiters(lock)) {
-		unsigned long flags;
 		struct rt_mutex_waiter *top;
 
-		raw_spin_lock_irqsave(&task->pi_lock, flags);
+		raw_spin_lock(&task->pi_lock);
 
 		/* remove the queued waiter. */
 		if (waiter) {
@@ -445,7 +444,7 @@ static int
 			top->pi_list_entry.prio = top->list_entry.prio;
 			plist_add(&top->pi_list_entry, &task->pi_waiters);
 		}
-		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+		raw_spin_unlock(&task->pi_lock);
 	}
 
 	debug_rt_mutex_lock(lock);
@@ -478,10 +477,9 @@ static int task_blocks_on_rt_mutex(struc
 {
 	struct task_struct *owner = rt_mutex_owner(lock);
 	struct rt_mutex_waiter *top_waiter = waiter;
-	unsigned long flags;
 	int chain_walk = 0, res;
 
-	raw_spin_lock_irqsave(&task->pi_lock, flags);
+	raw_spin_lock(&task->pi_lock);
 
 	/*
 	 * In the case of futex requeue PI, this will be a proxy
@@ -493,7 +491,7 @@ static int task_blocks_on_rt_mutex(struc
 	 * the task if PI_WAKEUP_INPROGRESS is set.
 	 */
 	if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) {
-		raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+		raw_spin_unlock(&task->pi_lock);
 		return -EAGAIN;
 	}
 
@@ -512,20 +510,20 @@ static int task_blocks_on_rt_mutex(struc
 
 	task->pi_blocked_on = waiter;
 
-	raw_spin_unlock_irqrestore(&task->pi_lock, flags);
+	raw_spin_unlock(&task->pi_lock);
 
 	if (!owner)
 		return 0;
 
 	if (waiter == rt_mutex_top_waiter(lock)) {
-		raw_spin_lock_irqsave(&owner->pi_lock, flags);
+		raw_spin_lock(&owner->pi_lock);
 		plist_del(&top_waiter->pi_list_entry, &owner->pi_waiters);
 		plist_add(&waiter->pi_list_entry, &owner->pi_waiters);
 
 		__rt_mutex_adjust_prio(owner);
 		if (rt_mutex_real_waiter(owner->pi_blocked_on))
 			chain_walk = 1;
-		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
+		raw_spin_unlock(&owner->pi_lock);
 	}
 	else if (debug_rt_mutex_detect_deadlock(waiter, detect_deadlock))
 		chain_walk = 1;
@@ -540,12 +538,12 @@ static int task_blocks_on_rt_mutex(struc
 	 */
 	get_task_struct(owner);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irq(&lock->wait_lock);
 
 	res = rt_mutex_adjust_prio_chain(owner, detect_deadlock, lock, waiter,
 					 task);
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irq(&lock->wait_lock);
 
 	return res;
 }
@@ -560,9 +558,8 @@ static int task_blocks_on_rt_mutex(struc
 static void wakeup_next_waiter(struct rt_mutex *lock)
 {
 	struct rt_mutex_waiter *waiter;
-	unsigned long flags;
 
-	raw_spin_lock_irqsave(&current->pi_lock, flags);
+	raw_spin_lock(&current->pi_lock);
 
 	waiter = rt_mutex_top_waiter(lock);
 
@@ -576,7 +573,7 @@ static void wakeup_next_waiter(struct rt
 
 	rt_mutex_set_owner(lock, NULL);
 
-	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
+	raw_spin_unlock(&current->pi_lock);
 
 	rt_mutex_wake_waiter(waiter);
 }
@@ -592,20 +589,19 @@ static void remove_waiter(struct rt_mute
 {
 	int first = (waiter == rt_mutex_top_waiter(lock));
 	struct task_struct *owner = rt_mutex_owner(lock);
-	unsigned long flags;
 	int chain_walk = 0;
 
-	raw_spin_lock_irqsave(&current->pi_lock, flags);
+	raw_spin_lock(&current->pi_lock);
 	plist_del(&waiter->list_entry, &lock->wait_list);
 	current->pi_blocked_on = NULL;
-	raw_spin_unlock_irqrestore(&current->pi_lock, flags);
+	raw_spin_unlock(&current->pi_lock);
 
 	if (!owner)
 		return;
 
 	if (first) {
 
-		raw_spin_lock_irqsave(&owner->pi_lock, flags);
+		raw_spin_lock(&owner->pi_lock);
 
 		plist_del(&waiter->pi_list_entry, &owner->pi_waiters);
 
@@ -620,7 +616,7 @@ static void remove_waiter(struct rt_mute
 		if (rt_mutex_real_waiter(owner->pi_blocked_on))
 			chain_walk = 1;
 
-		raw_spin_unlock_irqrestore(&owner->pi_lock, flags);
+		raw_spin_unlock(&owner->pi_lock);
 	}
 
 	WARN_ON(!plist_node_empty(&waiter->pi_list_entry));
@@ -631,11 +627,11 @@ static void remove_waiter(struct rt_mute
 	/* gets dropped in rt_mutex_adjust_prio_chain()! */
 	get_task_struct(owner);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irq(&lock->wait_lock);
 
 	rt_mutex_adjust_prio_chain(owner, 0, lock, NULL, current);
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irq(&lock->wait_lock);
 }
 
 /*
@@ -723,9 +719,6 @@ static int adaptive_wait(struct rt_mutex
 }
 #endif
 
-# define pi_lock(lock)			raw_spin_lock_irq(lock)
-# define pi_unlock(lock)		raw_spin_unlock_irq(lock)
-
 /*
  * Slow path lock function spin_lock style: this variant is very
  * careful not to miss any non-lock wakeups.
@@ -737,19 +730,22 @@ static void  noinline __sched rt_spin_lo
 {
 	struct task_struct *lock_owner, *self = current;
 	struct rt_mutex_waiter waiter, *top_waiter;
+	unsigned long flags;
 	int ret;
 
 	rt_mutex_init_waiter(&waiter, true);
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_local_save_flags(flags);
+	raw_spin_lock_irq(&lock->wait_lock);
 	init_lists(lock);
 
 	if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) {
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return;
 	}
 
 	BUG_ON(rt_mutex_owner(lock) == self);
+	BUG_ON(arch_irqs_disabled_flags(flags));
 
 	/*
 	 * We save whatever state the task is in and we'll restore it
@@ -757,10 +753,10 @@ static void  noinline __sched rt_spin_lo
 	 * as well. We are serialized via pi_lock against wakeups. See
 	 * try_to_wake_up().
 	 */
-	pi_lock(&self->pi_lock);
+	raw_spin_lock(&self->pi_lock);
 	self->saved_state = self->state;
 	__set_current_state(TASK_UNINTERRUPTIBLE);
-	pi_unlock(&self->pi_lock);
+	raw_spin_unlock(&self->pi_lock);
 
 	ret = task_blocks_on_rt_mutex(lock, &waiter, self, 0);
 	BUG_ON(ret);
@@ -773,18 +769,18 @@ static void  noinline __sched rt_spin_lo
 		top_waiter = rt_mutex_top_waiter(lock);
 		lock_owner = rt_mutex_owner(lock);
 
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irq(&lock->wait_lock);
 
 		debug_rt_mutex_print_deadlock(&waiter);
 
 		if (top_waiter != &waiter || adaptive_wait(lock, lock_owner))
 			schedule_rt_mutex(lock);
 
-		raw_spin_lock(&lock->wait_lock);
+		raw_spin_lock_irq(&lock->wait_lock);
 
-		pi_lock(&self->pi_lock);
+		raw_spin_lock(&self->pi_lock);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
-		pi_unlock(&self->pi_lock);
+		raw_spin_unlock(&self->pi_lock);
 	}
 
 	/*
@@ -794,10 +790,10 @@ static void  noinline __sched rt_spin_lo
 	 * happened while we were blocked. Clear saved_state so
 	 * try_to_wakeup() does not get confused.
 	 */
-	pi_lock(&self->pi_lock);
+	raw_spin_lock(&self->pi_lock);
 	__set_current_state(self->saved_state);
 	self->saved_state = TASK_RUNNING;
-	pi_unlock(&self->pi_lock);
+	raw_spin_unlock(&self->pi_lock);
 
 	/*
 	 * try_to_take_rt_mutex() sets the waiter bit
@@ -808,7 +804,7 @@ static void  noinline __sched rt_spin_lo
 	BUG_ON(rt_mutex_has_waiters(lock) && &waiter == rt_mutex_top_waiter(lock));
 	BUG_ON(!plist_node_empty(&waiter.list_entry));
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irq(&lock->wait_lock);
 
 	debug_rt_mutex_free_waiter(&waiter);
 }
@@ -818,7 +814,9 @@ static void  noinline __sched rt_spin_lo
  */
 static void  noinline __sched rt_spin_lock_slowunlock(struct rt_mutex *lock)
 {
-	raw_spin_lock(&lock->wait_lock);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 
 	debug_rt_mutex_unlock(lock);
 
@@ -826,13 +824,13 @@ static void  noinline __sched rt_spin_lo
 
 	if (!rt_mutex_has_waiters(lock)) {
 		lock->owner = NULL;
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return;
 	}
 
 	wakeup_next_waiter(lock);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* Undo pi boosting.when necessary */
 	rt_mutex_adjust_prio(current);
@@ -1037,13 +1035,13 @@ static int __sched
 				break;
 		}
 
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irq(&lock->wait_lock);
 
 		debug_rt_mutex_print_deadlock(waiter);
 
 		schedule_rt_mutex(lock);
 
-		raw_spin_lock(&lock->wait_lock);
+		raw_spin_lock_irq(&lock->wait_lock);
 		set_current_state(state);
 	}
 
@@ -1135,20 +1133,23 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 		  int detect_deadlock, struct ww_acquire_ctx *ww_ctx)
 {
 	struct rt_mutex_waiter waiter;
+	unsigned long flags;
 	int ret = 0;
 
 	rt_mutex_init_waiter(&waiter, false);
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_local_save_flags(flags);
+	raw_spin_lock_irq(&lock->wait_lock);
 	init_lists(lock);
 
 	/* Try to acquire the lock again: */
 	if (try_to_take_rt_mutex(lock, current, NULL)) {
 		if (ww_ctx)
 			ww_mutex_account_lock(lock, ww_ctx);
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return 0;
 	}
+	BUG_ON(arch_irqs_disabled_flags(flags));
 
 	set_current_state(state);
 
@@ -1177,7 +1178,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 	 */
 	fixup_rt_mutex_waiters(lock);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irq(&lock->wait_lock);
 
 	/* Remove pending timer: */
 	if (unlikely(timeout))
@@ -1194,9 +1195,10 @@ rt_mutex_slowlock(struct rt_mutex *lock,
 static inline int
 rt_mutex_slowtrylock(struct rt_mutex *lock)
 {
+	unsigned long flags;
 	int ret = 0;
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	init_lists(lock);
 
 	if (likely(rt_mutex_owner(lock) != current)) {
@@ -1209,7 +1211,7 @@ rt_mutex_slowtrylock(struct rt_mutex *lo
 		fixup_rt_mutex_waiters(lock);
 	}
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	return ret;
 }
@@ -1220,7 +1222,9 @@ rt_mutex_slowtrylock(struct rt_mutex *lo
 static void __sched
 rt_mutex_slowunlock(struct rt_mutex *lock)
 {
-	raw_spin_lock(&lock->wait_lock);
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 
 	debug_rt_mutex_unlock(lock);
 
@@ -1228,13 +1232,13 @@ rt_mutex_slowunlock(struct rt_mutex *loc
 
 	if (!rt_mutex_has_waiters(lock)) {
 		lock->owner = NULL;
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		return;
 	}
 
 	wakeup_next_waiter(lock);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 
 	/* Undo pi boosting if necessary: */
 	rt_mutex_adjust_prio(current);
@@ -1494,10 +1498,10 @@ int rt_mutex_start_proxy_lock(struct rt_
 {
 	int ret;
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irq(&lock->wait_lock);
 
 	if (try_to_take_rt_mutex(lock, task, NULL)) {
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irq(&lock->wait_lock);
 		return 1;
 	}
 
@@ -1520,18 +1524,17 @@ int rt_mutex_start_proxy_lock(struct rt_
 	 * PI_REQUEUE_INPROGRESS, so that if the task is waking up
 	 * it will know that we are in the process of requeuing it.
 	 */
-	raw_spin_lock_irq(&task->pi_lock);
+	raw_spin_lock(&task->pi_lock);
 	if (task->pi_blocked_on) {
-		raw_spin_unlock_irq(&task->pi_lock);
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock(&task->pi_lock);
+		raw_spin_unlock_irq(&lock->wait_lock);
 		return -EAGAIN;
 	}
 	task->pi_blocked_on = PI_REQUEUE_INPROGRESS;
-	raw_spin_unlock_irq(&task->pi_lock);
+	raw_spin_unlock(&task->pi_lock);
 #endif
 
 	ret = task_blocks_on_rt_mutex(lock, waiter, task, detect_deadlock);
-
 	if (ret && !rt_mutex_owner(lock)) {
 		/*
 		 * Reset the return value. We might have
@@ -1545,7 +1548,7 @@ int rt_mutex_start_proxy_lock(struct rt_
 	if (unlikely(ret))
 		remove_waiter(lock, waiter);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irq(&lock->wait_lock);
 
 	debug_rt_mutex_print_deadlock(waiter);
 
@@ -1595,12 +1598,11 @@ int rt_mutex_finish_proxy_lock(struct rt
 {
 	int ret;
 
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irq(&lock->wait_lock);
 
 	set_current_state(TASK_INTERRUPTIBLE);
 
 	ret = __rt_mutex_slowlock(lock, TASK_INTERRUPTIBLE, to, waiter, NULL);
-
 	set_current_state(TASK_RUNNING);
 
 	if (unlikely(ret))
@@ -1612,7 +1614,7 @@ int rt_mutex_finish_proxy_lock(struct rt
 	 */
 	fixup_rt_mutex_waiters(lock);
 
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irq(&lock->wait_lock);
 
 	return ret;
 }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-15 20:14                   ` [PATCH v2] " Sebastian Andrzej Siewior
@ 2013-11-18 14:10                     ` Peter Zijlstra
  2013-11-18 17:56                       ` Peter Zijlstra
                                         ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-11-18 14:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, Mike Galbraith, Frederic Weisbecker, LKML, RT,
	Paul E. McKenney

On Fri, Nov 15, 2013 at 09:14:36PM +0100, Sebastian Andrzej Siewior wrote:
> Mike Galbraith captered the following:
> | >#11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596
> | >#12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be
> | >#13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42
> | >#14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd
> | >#15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2
> | >#16 [ffff88017b243f90] irq_exit at ffffffff8105b02d
> | >#17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd
> | >--- <IRQ stack> ---
> | >#18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd
> | >    [exception RIP: task_blocks_on_rt_mutex+51]
> | >#19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c
> | >#20 [ffff88017a2a9da0] lock_timer_base.isra.35 at ffffffff81061cbf
> | >#21 [ffff88017a2a9dd0] schedule_timeout at ffffffff815cf1ce
> | >#22 [ffff88017a2a9e50] rcu_gp_kthread at ffffffff810f9bbb
> | >#23 [ffff88017a2a9ed0] kthread at ffffffff810796d5
> | >#24 [ffff88017a2a9f50] ret_from_fork at ffffffff815da04c
> 
> lock_timer_base() does a try_lock() which deadlocks on the waiter lock
> not the lock itself.
> This patch makes sure all users of the waiter_lock take the lock with
> interrupts off so a try_lock from irq context is possible.

Its get_next_timer_interrupt() that does a trylock() and only for
PREEMPT_RT_FULL.

Also; on IRC you said:

  "<bigeasy> I'm currently not sure if we should do
the _irq() lock or a trylock for the wait_lock in
rt_mutex_slowtrylock()"

Which I misread and dismissed -- but yes that might actually work too
and would be a much smaller patch. You'd only need trylock and unlock.

That said, allowing such usage from actual IRQ context is iffy; suppose
the trylock succeeds, who then is the lock owner?

I suppose it would be whatever task we interrupted and boosting will
'work' because we're non-preemptable, but still *YUCK*.


That said; the reason I looked at this is that lockdep didn't catch it.
This turns out to be because in irq_exit():

	void irq_exit(void)
	{
	#ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
		local_irq_disable();
	#else
		WARN_ON_ONCE(!irqs_disabled());
	#endif

		account_irq_exit_time(current);
		trace_hardirq_exit();
		sub_preempt_count(HARDIRQ_OFFSET);
		if (!in_interrupt() && local_softirq_pending())
			invoke_softirq();

		tick_irq_exit();
		rcu_irq_exit();
	}

We call trace_hardirq_exit() before tick_irq_exit(), so lockdep doesn't
see the offending raw_spin_lock(&->wait_lock) as happening from IRQ
context.

So I tried the little hack below to try and catch it; but no luck so
far. I suppose with regular NOHZ the tick_irq_exit() condition:

	static inline void tick_irq_exit(void)
	{
	#ifdef CONFIG_NO_HZ_COMMON
		int cpu = smp_processor_id();

		/* Make sure that timer wheel updates are propagated */
		if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
			if (!in_interrupt())
				tick_nohz_irq_exit();
		}
	#endif
	}

Is rather uncommon; maybe I should let the box run for a bit; see if it
triggers.

Ugleh problem allround.

Also, I'm not sure if this patch was supposed to be an 'upstream' patch
-- $SUBJECT seems to suggest so, but note that it will not apply to
anything recent.

---


--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -746,13 +746,23 @@ void irq_exit(void)
 #endif
 
 	account_irq_exit_time(current);
-	trace_hardirq_exit();
 	sub_preempt_count(HARDIRQ_OFFSET);
-	if (!in_interrupt() && local_softirq_pending())
+	if (!in_interrupt() && local_softirq_pending()) {
+		/*
+		 * Temp. disable hardirq context so as not to confuse lockdep;
+		 * otherwise it might think we're running softirq handler from
+		 * hardirq context.
+		 *
+		 * Should probably sort this someplace else..
+		 */
+		trace_hardirq_exit();
 		invoke_softirq();
+		trace_hardirq_enter();
+	}
 
 	tick_irq_exit();
 	rcu_irq_exit();
+	trace_hardirq_exit();
 }
 
 void raise_softirq(unsigned int nr)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-18 14:10                     ` Peter Zijlstra
@ 2013-11-18 17:56                       ` Peter Zijlstra
  2013-11-18 23:59                       ` Frederic Weisbecker
  2013-11-22 13:59                       ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-11-18 17:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, Mike Galbraith, Frederic Weisbecker, LKML, RT,
	Paul E. McKenney

On Mon, Nov 18, 2013 at 03:10:21PM +0100, Peter Zijlstra wrote:
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -746,13 +746,23 @@ void irq_exit(void)
>  #endif
>  
>  	account_irq_exit_time(current);
> -	trace_hardirq_exit();
>  	sub_preempt_count(HARDIRQ_OFFSET);
> -	if (!in_interrupt() && local_softirq_pending())
> +	if (!in_interrupt() && local_softirq_pending()) {
> +		/*
> +		 * Temp. disable hardirq context so as not to confuse lockdep;
> +		 * otherwise it might think we're running softirq handler from
> +		 * hardirq context.
> +		 *
> +		 * Should probably sort this someplace else..
> +		 */
> +		trace_hardirq_exit();
>  		invoke_softirq();
> +		trace_hardirq_enter();
> +	}
>  
>  	tick_irq_exit();
>  	rcu_irq_exit();
> +	trace_hardirq_exit();
>  }
>  
>  void raise_softirq(unsigned int nr)

*SPLAT*

---
[ 7794.620512] =================================
[ 7794.620513] [ INFO: inconsistent lock state ]
[ 7794.620515] 3.12.0-rt2-dirty #585 Not tainted
[ 7794.620517] ---------------------------------
[ 7794.620518] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[ 7794.620520] swapper/0/0 [HC1[0]:SC0[0]:HE0:SE1] takes:
[ 7794.620527]  (&(&(&base->lock)->lock)->wait_lock){?.+...}, at: [<ffffffff811036af>] rt_mutex_slowtrylock+0xf/0x80
[ 7794.620528] {HARDIRQ-ON-W} state was registered at:
[ 7794.620532]   [<ffffffff810fb58c>] __lock_acquire+0x64c/0x1ec0
[ 7794.620534]   [<ffffffff810fd430>] lock_acquire+0x90/0x150
[ 7794.620537]   [<ffffffff8166079b>] _raw_spin_lock+0x3b/0x50
[ 7794.620540]   [<ffffffff8165f273>] rt_spin_lock_slowlock+0x33/0x260
[ 7794.620542]   [<ffffffff81660009>] rt_spin_lock+0x69/0x70
[ 7794.620546]   [<ffffffff8109f48a>] run_timer_softirq+0x4a/0x2f0
[ 7794.620548]   [<ffffffff81096441>] do_current_softirqs+0x231/0x460
[ 7794.620550]   [<ffffffff810966a8>] run_ksoftirqd+0x38/0x60
[ 7794.620553]   [<ffffffff810c0fec>] smpboot_thread_fn+0x22c/0x350
[ 7794.620555]   [<ffffffff810b7e0d>] kthread+0xcd/0xe0
[ 7794.620558]   [<ffffffff816687ec>] ret_from_fork+0x7c/0xb0
[ 7794.620559] irq event stamp: 15216954
[ 7794.620562] hardirqs last  enabled at (15216953): [<ffffffff81508f67>] cpuidle_enter_state+0x67/0xf0
[ 7794.620564] hardirqs last disabled at (15216954): [<ffffffff8166106a>] common_interrupt+0x6a/0x6f
[ 7794.620565] softirqs last  enabled at (0): [<          (null)>]           (null)
[ 7794.620566] softirqs last disabled at (0): [<          (null)>]           (null)
[ 7794.620566] 
[ 7794.620566] other info that might help us debug this:
[ 7794.620566]  Possible unsafe locking scenario:
[ 7794.620566] 
[ 7794.620567]        CPU0
[ 7794.620567]        ----
[ 7794.620568]   lock(&(&(&base->lock)->lock)->wait_lock);
[ 7794.620568]   <Interrupt>
[ 7794.620569]     lock(&(&(&base->lock)->lock)->wait_lock);
[ 7794.620569] 
[ 7794.620569]  *** DEADLOCK ***
[ 7794.620569] 
[ 7794.620570] no locks held by swapper/0/0.
[ 7794.620570] 
[ 7794.620570] stack backtrace:
[ 7794.620572] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.0-rt2-dirty #585
[ 7794.620573] Hardware name: Supermicro X8DTN/X8DTN, BIOS 4.6.3 01/08/2010
[ 7794.620577]  ffffffff820fb6f0 ffff880237c03bf8 ffffffff8165aca2 ffffffff81c164c0
[ 7794.620579]  ffff880237c03c48 ffffffff816569d9 0000000000000000 ffffffff00000000
[ 7794.620581]  ffff880200000001 0000000000000002 ffffffff81c164c0 ffffffff810f9150
[ 7794.620582] Call Trace:
[ 7794.620585]  <IRQ>  [<ffffffff8165aca2>] dump_stack+0x4e/0x8f
[ 7794.620588]  [<ffffffff816569d9>] print_usage_bug+0x1f2/0x203
[ 7794.620591]  [<ffffffff810f9150>] ? check_usage_backwards+0x130/0x130
[ 7794.620596]  [<ffffffff810f9ced>] mark_lock+0x2ad/0x320
[ 7794.620598]  [<ffffffff810fb7ba>] __lock_acquire+0x87a/0x1ec0
[ 7794.620600]  [<ffffffff810fb34f>] ? __lock_acquire+0x40f/0x1ec0
[ 7794.620601]  [<ffffffff810fb34f>] ? __lock_acquire+0x40f/0x1ec0
[ 7794.620604]  [<ffffffff810fd430>] lock_acquire+0x90/0x150
[ 7794.620605]  [<ffffffff811036af>] ? rt_mutex_slowtrylock+0xf/0x80
[ 7794.620607]  [<ffffffff8166079b>] _raw_spin_lock+0x3b/0x50
[ 7794.620609]  [<ffffffff811036af>] ? rt_mutex_slowtrylock+0xf/0x80
[ 7794.620610]  [<ffffffff811036af>] rt_mutex_slowtrylock+0xf/0x80
[ 7794.620612]  [<ffffffff8165f0aa>] rt_mutex_trylock+0x2a/0x30
[ 7794.620614]  [<ffffffff8165fe36>] rt_spin_trylock+0x16/0x50
[ 7794.620616]  [<ffffffff810a0091>] get_next_timer_interrupt+0x51/0x290
[ 7794.620618]  [<ffffffff810501c4>] ? native_sched_clock+0x24/0x80
[ 7794.620620]  [<ffffffff810f6115>] __tick_nohz_idle_enter+0x305/0x4a0
[ 7794.620622]  [<ffffffff810501c4>] ? native_sched_clock+0x24/0x80
[ 7794.620624]  [<ffffffff810f6854>] tick_nohz_irq_exit+0x34/0x40
[ 7794.620626]  [<ffffffff8109709d>] irq_exit+0x10d/0x140
[ 7794.620628]  [<ffffffff8166a793>] do_IRQ+0x63/0xd0
[ 7794.620629]  [<ffffffff8166106f>] common_interrupt+0x6f/0x6f
[ 7794.620632]  <EOI>  [<ffffffff81508f6b>] ? cpuidle_enter_state+0x6b/0xf0
[ 7794.620634]  [<ffffffff815090f6>] cpuidle_idle_call+0x106/0x2b0
[ 7794.620636]  [<ffffffff810519de>] arch_cpu_idle+0xe/0x30
[ 7794.620639]  [<ffffffff810e44b8>] cpu_startup_entry+0x298/0x310
[ 7794.620642]  [<ffffffff8164b683>] rest_init+0xc3/0xd0
[ 7794.620644]  [<ffffffff8164b5c5>] ? rest_init+0x5/0xd0
[ 7794.620648]  [<ffffffff81cece74>] start_kernel+0x3dd/0x3ea
[ 7794.620650]  [<ffffffff81cec89f>] ? repair_env_string+0x5e/0x5e
[ 7794.620652]  [<ffffffff81cec5a5>] x86_64_start_reservations+0x2a/0x2c
[ 7794.620653]  [<ffffffff81cec6a2>] x86_64_start_kernel+0xfb/0xfe

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-18 14:10                     ` Peter Zijlstra
  2013-11-18 17:56                       ` Peter Zijlstra
@ 2013-11-18 23:59                       ` Frederic Weisbecker
  2013-11-19  8:30                         ` Peter Zijlstra
  2013-11-22 13:59                       ` Sebastian Andrzej Siewior
  2 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2013-11-18 23:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith, LKML,
	RT, Paul E. McKenney

On Mon, Nov 18, 2013 at 03:10:21PM +0100, Peter Zijlstra wrote:
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -746,13 +746,23 @@ void irq_exit(void)
>  #endif
>  
>  	account_irq_exit_time(current);
> -	trace_hardirq_exit();
>  	sub_preempt_count(HARDIRQ_OFFSET);
> -	if (!in_interrupt() && local_softirq_pending())
> +	if (!in_interrupt() && local_softirq_pending()) {
> +		/*
> +		 * Temp. disable hardirq context so as not to confuse lockdep;
> +		 * otherwise it might think we're running softirq handler from
> +		 * hardirq context.
> +		 *
> +		 * Should probably sort this someplace else..
> +		 */
> +		trace_hardirq_exit();
>  		invoke_softirq();
> +		trace_hardirq_enter();
> +	}
>  
>  	tick_irq_exit();
>  	rcu_irq_exit();
> +	trace_hardirq_exit();

Looks like a change we really want!

Thanks!

>  }
>  
>  void raise_softirq(unsigned int nr)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-18 23:59                       ` Frederic Weisbecker
@ 2013-11-19  8:30                         ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-11-19  8:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Sebastian Andrzej Siewior, Thomas Gleixner, Mike Galbraith, LKML,
	RT, Paul E. McKenney

On Tue, Nov 19, 2013 at 12:59:59AM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 18, 2013 at 03:10:21PM +0100, Peter Zijlstra wrote:
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -746,13 +746,23 @@ void irq_exit(void)
> >  #endif
> >  
> >  	account_irq_exit_time(current);
> > -	trace_hardirq_exit();
> >  	sub_preempt_count(HARDIRQ_OFFSET);
> > -	if (!in_interrupt() && local_softirq_pending())
> > +	if (!in_interrupt() && local_softirq_pending()) {
> > +		/*
> > +		 * Temp. disable hardirq context so as not to confuse lockdep;
> > +		 * otherwise it might think we're running softirq handler from
> > +		 * hardirq context.
> > +		 *
> > +		 * Should probably sort this someplace else..
> > +		 */
> > +		trace_hardirq_exit();
> >  		invoke_softirq();
> > +		trace_hardirq_enter();
> > +	}
> >  
> >  	tick_irq_exit();
> >  	rcu_irq_exit();
> > +	trace_hardirq_exit();
> 
> Looks like a change we really want!

Yeah, let me stare at the softirq mess again to see if there's anything
better/prettier we can do.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-18 14:10                     ` Peter Zijlstra
  2013-11-18 17:56                       ` Peter Zijlstra
  2013-11-18 23:59                       ` Frederic Weisbecker
@ 2013-11-22 13:59                       ` Sebastian Andrzej Siewior
  2013-11-22 16:08                         ` Peter Zijlstra
  2 siblings, 1 reply; 39+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-11-22 13:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Mike Galbraith, Frederic Weisbecker, LKML, RT,
	Paul E. McKenney

* Peter Zijlstra | 2013-11-18 15:10:21 [+0100]:

>Its get_next_timer_interrupt() that does a trylock() and only for
>PREEMPT_RT_FULL.
>
>Also; on IRC you said:
>
>  "<bigeasy> I'm currently not sure if we should do
>the _irq() lock or a trylock for the wait_lock in
>rt_mutex_slowtrylock()"
>
>Which I misread and dismissed -- but yes that might actually work too
>and would be a much smaller patch. You'd only need trylock and unlock.

Yes and the patch is at the end of the mail. I also added your "lockdep:
Correctly annotate hardirq context in irq_exit()" to the -RT queue and
saw the splat. After the trylock replacement I needed something similar
for the unlock path _or_ lockdep would complain. So unless there is a
better way…

>That said, allowing such usage from actual IRQ context is iffy; suppose
>the trylock succeeds, who then is the lock owner?
>
>I suppose it would be whatever task we interrupted and boosting will
>'work' because we're non-preemptable, but still *YUCK*.

You are right, this ain't pretty. I hope this timer mess can be replaced
with something else so the whole trylock thingy can be removed.

Sebastian

--- a/include/linux/spinlock_rt.h
+++ b/include/linux/spinlock_rt.h
@@ -22,6 +22,7 @@ extern void __lockfunc rt_spin_lock(spin
 extern unsigned long __lockfunc rt_spin_lock_trace_flags(spinlock_t *lock);
 extern void __lockfunc rt_spin_lock_nested(spinlock_t *lock, int subclass);
 extern void __lockfunc rt_spin_unlock(spinlock_t *lock);
+extern void __lockfunc rt_spin_try_unlock(spinlock_t *lock);
 extern void __lockfunc rt_spin_unlock_wait(spinlock_t *lock);
 extern int __lockfunc rt_spin_trylock_irqsave(spinlock_t *lock, unsigned long *flags);
 extern int __lockfunc rt_spin_trylock_bh(spinlock_t *lock);
--- a/kernel/rtmutex.c
+++ b/kernel/rtmutex.c
@@ -816,10 +816,8 @@ static void  noinline __sched rt_spin_lo
 /*
  * Slow path to release a rt_mutex spin_lock style
  */
-static void  noinline __sched rt_spin_lock_slowunlock(struct rt_mutex *lock)
+static void __sched __rt_spin_lock_slowunlock(struct rt_mutex *lock)
 {
-	raw_spin_lock(&lock->wait_lock);
-
 	debug_rt_mutex_unlock(lock);
 
 	rt_mutex_deadlock_account_unlock(current);
@@ -838,6 +836,21 @@ static void  noinline __sched rt_spin_lo
 	rt_mutex_adjust_prio(current);
 }
 
+static void  noinline __sched rt_spin_lock_slowunlock(struct rt_mutex *lock)
+{
+	raw_spin_lock(&lock->wait_lock);
+	__rt_spin_lock_slowunlock(lock);
+}
+
+static void  noinline __sched rt_spin_lock_try_slowunlock(struct rt_mutex *lock)
+{
+	int ret;
+
+	ret = raw_spin_trylock(&lock->wait_lock);
+	BUG_ON(!ret);
+	__rt_spin_lock_slowunlock(lock);
+}
+
 void __lockfunc rt_spin_lock(spinlock_t *lock)
 {
 	rt_spin_lock_fastlock(&lock->lock, rt_spin_lock_slowlock);
@@ -868,6 +881,13 @@ void __lockfunc rt_spin_unlock(spinlock_
 }
 EXPORT_SYMBOL(rt_spin_unlock);
 
+void __lockfunc rt_spin_try_unlock(spinlock_t *lock)
+{
+	/* NOTE: we always pass in '1' for nested, for simplicity */
+	spin_release(&lock->dep_map, 1, _RET_IP_);
+	rt_spin_lock_fastunlock(&lock->lock, rt_spin_lock_try_slowunlock);
+}
+
 void __lockfunc __rt_spin_unlock(struct rt_mutex *lock)
 {
 	rt_spin_lock_fastunlock(lock, rt_spin_lock_slowunlock);
@@ -1191,7 +1211,8 @@ rt_mutex_slowtrylock(struct rt_mutex *lo
 {
 	int ret = 0;
 
-	raw_spin_lock(&lock->wait_lock);
+	if (!raw_spin_trylock(&lock->wait_lock))
+		return ret;
 	init_lists(lock);
 
 	if (likely(rt_mutex_owner(lock) != current)) {
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1400,7 +1400,7 @@ unsigned long get_next_timer_interrupt(u
 		expires = base->next_timer;
 	}
 #ifdef CONFIG_PREEMPT_RT_FULL
-	rt_spin_unlock(&base->lock);
+	rt_spin_try_unlock(&base->lock);
 #else
 	spin_unlock(&base->lock);
 #endif

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-22 13:59                       ` Sebastian Andrzej Siewior
@ 2013-11-22 16:08                         ` Peter Zijlstra
  2013-11-22 16:21                           ` Sebastian Andrzej Siewior
  2013-11-25 18:33                           ` Paul Gortmaker
  0 siblings, 2 replies; 39+ messages in thread
From: Peter Zijlstra @ 2013-11-22 16:08 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, Mike Galbraith, Frederic Weisbecker, LKML, RT,
	Paul E. McKenney

On Fri, Nov 22, 2013 at 02:59:31PM +0100, Sebastian Andrzej Siewior wrote:
> +extern void __lockfunc rt_spin_try_unlock(spinlock_t *lock);

I know what you mean, but.. try_unlock() just sounds wrong, how can we
attempt but fail to unlock a lock we hold ;-)


/me crawls back under his rock ;-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-22 16:08                         ` Peter Zijlstra
@ 2013-11-22 16:21                           ` Sebastian Andrzej Siewior
  2013-11-22 17:44                             ` Sebastian Andrzej Siewior
  2013-11-25 18:33                           ` Paul Gortmaker
  1 sibling, 1 reply; 39+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-11-22 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Mike Galbraith, Frederic Weisbecker, LKML, RT,
	Paul E. McKenney

On 11/22/2013 05:08 PM, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 02:59:31PM +0100, Sebastian Andrzej Siewior wrote:
>> +extern void __lockfunc rt_spin_try_unlock(spinlock_t *lock);
> 
> I know what you mean, but.. try_unlock() just sounds wrong, how can we
> attempt but fail to unlock a lock we hold ;-)

what about
- rt_spin_unlock_try_or_fail();
- __rt_spin_unlock_dont_use_me()

Couldn't we tell lockdep to bend the rules in this case?

> 
> 
> /me crawls back under his rock ;-)
:)

Sebastian

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-22 16:21                           ` Sebastian Andrzej Siewior
@ 2013-11-22 17:44                             ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 39+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-11-22 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Mike Galbraith, Frederic Weisbecker, LKML, RT,
	Paul E. McKenney

On 11/22/2013 05:21 PM, Sebastian Andrzej Siewior wrote:
> On 11/22/2013 05:08 PM, Peter Zijlstra wrote:
>> On Fri, Nov 22, 2013 at 02:59:31PM +0100, Sebastian Andrzej Siewior wrote:
>>> +extern void __lockfunc rt_spin_try_unlock(spinlock_t *lock);
>>
>> I know what you mean, but.. try_unlock() just sounds wrong, how can we
>> attempt but fail to unlock a lock we hold ;-)

kernel BUG at kernel/rtmutex.c:850!
[<ffffffff815ee9d5>] rt_spin_try_unlock+0x25/0x30
[<ffffffff81091ccd>] get_next_timer_interrupt+0xdd/0x270

It depends how hard you try :P Bah.

Sebastian

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2] rtmutex: take the waiter lock with irqs off
  2013-11-22 16:08                         ` Peter Zijlstra
  2013-11-22 16:21                           ` Sebastian Andrzej Siewior
@ 2013-11-25 18:33                           ` Paul Gortmaker
  1 sibling, 0 replies; 39+ messages in thread
From: Paul Gortmaker @ 2013-11-25 18:33 UTC (permalink / raw)
  To: Peter Zijlstra, Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, Mike Galbraith, Frederic Weisbecker, LKML, RT,
	Paul E. McKenney

On 13-11-22 11:08 AM, Peter Zijlstra wrote:
> On Fri, Nov 22, 2013 at 02:59:31PM +0100, Sebastian Andrzej Siewior wrote:
>> +extern void __lockfunc rt_spin_try_unlock(spinlock_t *lock);
> 
> I know what you mean, but.. try_unlock() just sounds wrong, how can we
> attempt but fail to unlock a lock we hold ;-)

rt-v3.10-devel$git grep -l tryunlock
tasklet-rt-prevent-tasklets-from-going-into-infinite-spin-in-rt.patch
rt-v3.10-devel$

> 
> 
> /me crawls back under his rock ;-)

Can you take tasklets with you?   ;-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-11-07 11:21       ` Thomas Gleixner
  2013-11-07 12:59         ` Frederic Weisbecker
  2013-11-07 13:07         ` CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo Mike Galbraith
@ 2013-12-20 15:41         ` Sebastian Andrzej Siewior
  2013-12-21  9:11           ` Mike Galbraith
  2 siblings, 1 reply; 39+ messages in thread
From: Sebastian Andrzej Siewior @ 2013-12-20 15:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, Frederic Weisbecker, Peter Zijlstra, LKML, RT,
	Paul E. McKenney

On 07.11.13, Thomas Gleixner wrote:
> below should address that issue. If that works on mainline we can
> adapt it for RT (needs a trylock(&base->lock) there).

okay, so this is waht I'm going to take for -RT:

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 79a7a35..bdbf77db 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -461,9 +461,8 @@ extern int schedule_hrtimeout_range_clock(ktime_t *expires,
 		unsigned long delta, const enum hrtimer_mode mode, int clock);
 extern int schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
 
-/* Soft interrupt function to run the hrtimer queues: */
+/* Called from the periodic timer tick */
 extern void hrtimer_run_queues(void);
-extern void hrtimer_run_pending(void);
 
 /* Bootup initialization: */
 extern void __init hrtimers_init(void);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index c383841..7aa442e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1694,30 +1694,6 @@ static void run_hrtimer_softirq(struct softirq_action *h)
 }
 
 /*
- * Called from timer softirq every jiffy, expire hrtimers:
- *
- * For HRT its the fall back code to run the softirq in the timer
- * softirq context in case the hrtimer initialization failed or has
- * not been done yet.
- */
-void hrtimer_run_pending(void)
-{
-	if (hrtimer_hres_active())
-		return;
-
-	/*
-	 * This _is_ ugly: We have to check in the softirq context,
-	 * whether we can switch to highres and / or nohz mode. The
-	 * clocksource switch happens in the timer interrupt with
-	 * xtime_lock held. Notification from there only sets the
-	 * check bit in the tick_oneshot code, otherwise we might
-	 * deadlock vs. xtime_lock.
-	 */
-	if (tick_check_oneshot_change(!hrtimer_is_hres_enabled()))
-		hrtimer_switch_to_hres();
-}
-
-/*
  * Called from hardirq context every jiffy
  */
 void hrtimer_run_queues(void)
@@ -1730,6 +1706,13 @@ void hrtimer_run_queues(void)
 	if (hrtimer_hres_active())
 		return;
 
+	/*
+	 * Check whether we can switch to highres mode.
+	 */
+	if (tick_check_oneshot_change(!hrtimer_is_hres_enabled())
+	    && hrtimer_switch_to_hres())
+		return;
+
 	for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) {
 		base = &cpu_base->clock_base[index];
 		if (!timerqueue_getnext(&base->active))
diff --git a/kernel/timer.c b/kernel/timer.c
index b06c647..46467be 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1443,8 +1443,6 @@ static void run_timer_softirq(struct softirq_action *h)
 	irq_work_run();
 #endif
 
-	hrtimer_run_pending();
-
 	if (time_after_eq(jiffies, base->timer_jiffies))
 		__run_timers(base);
 }
@@ -1454,8 +1452,27 @@ static void run_timer_softirq(struct softirq_action *h)
  */
 void run_local_timers(void)
 {
+	struct tvec_base *base = __this_cpu_read(tvec_bases);
+
 	hrtimer_run_queues();
-	raise_softirq(TIMER_SOFTIRQ);
+	/*
+	 * We can access this lockless as we are in the timer
+	 * interrupt. If there are no timers queued, nothing to do in
+	 * the timer softirq.
+	 */
+	if (!spin_do_trylock(&base->lock)) {
+		raise_softirq(TIMER_SOFTIRQ);
+		return;
+	}
+	if (!base->active_timers)
+		goto out;
+
+	/* Check whether the next pending timer has expired */
+	if (time_before_eq(base->next_timer, jiffies))
+		raise_softirq(TIMER_SOFTIRQ);
+out:
+	rt_spin_unlock_after_trylock_in_irq(&base->lock);
+
 }
 
 #ifdef __ARCH_WANT_SYS_ALARM
-- 
1.8.5.1

Sebastian

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-12-20 15:41         ` Sebastian Andrzej Siewior
@ 2013-12-21  9:11           ` Mike Galbraith
  2013-12-21 17:21             ` Muli Baron
  0 siblings, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2013-12-21  9:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Thomas Gleixner, Frederic Weisbecker, Peter Zijlstra, LKML, RT,
	Paul E. McKenney

On Fri, 2013-12-20 at 16:41 +0100, Sebastian Andrzej Siewior wrote: 
> On 07.11.13, Thomas Gleixner wrote:
> > below should address that issue. If that works on mainline we can
> > adapt it for RT (needs a trylock(&base->lock) there).
> 
> okay, so this is waht I'm going to take for -RT:

Works, modulo noisy workqueues.

rtbox:~ # sleep 35 && killall pert& cgexec -g cpuset:rtcpus taskset -c 3 pert 5
[1] 5660
2400.05 MHZ CPU
perturbation threshold 0.018 usecs.
pert/s:       33 >15.75us:        2 min:  0.04 max: 24.14 avg:  2.51 sum/s:    84us overhead: 0.01%
pert/s:       35 >15.54us:        3 min:  0.04 max: 24.89 avg:  2.39 sum/s:    84us overhead: 0.01%
pert/s:       30 >15.27us:        2 min:  0.04 max: 23.03 avg:  2.64 sum/s:    80us overhead: 0.01%
pert/s:       34 >15.12us:        3 min:  0.04 max: 25.03 avg:  2.51 sum/s:    86us overhead: 0.01%
pert/s:       31 >14.93us:        2 min:  0.04 max: 23.86 avg:  2.60 sum/s:    83us overhead: 0.01%
Terminated

> diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
> index 79a7a35..bdbf77db 100644
> --- a/include/linux/hrtimer.h
> +++ b/include/linux/hrtimer.h
> @@ -461,9 +461,8 @@ extern int schedule_hrtimeout_range_clock(ktime_t *expires,
>  		unsigned long delta, const enum hrtimer_mode mode, int clock);
>  extern int schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
>  
> -/* Soft interrupt function to run the hrtimer queues: */
> +/* Called from the periodic timer tick */
>  extern void hrtimer_run_queues(void);
> -extern void hrtimer_run_pending(void);
>  
>  /* Bootup initialization: */
>  extern void __init hrtimers_init(void);
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index c383841..7aa442e 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -1694,30 +1694,6 @@ static void run_hrtimer_softirq(struct softirq_action *h)
>  }
>  
>  /*
> - * Called from timer softirq every jiffy, expire hrtimers:
> - *
> - * For HRT its the fall back code to run the softirq in the timer
> - * softirq context in case the hrtimer initialization failed or has
> - * not been done yet.
> - */
> -void hrtimer_run_pending(void)
> -{
> -	if (hrtimer_hres_active())
> -		return;
> -
> -	/*
> -	 * This _is_ ugly: We have to check in the softirq context,
> -	 * whether we can switch to highres and / or nohz mode. The
> -	 * clocksource switch happens in the timer interrupt with
> -	 * xtime_lock held. Notification from there only sets the
> -	 * check bit in the tick_oneshot code, otherwise we might
> -	 * deadlock vs. xtime_lock.
> -	 */
> -	if (tick_check_oneshot_change(!hrtimer_is_hres_enabled()))
> -		hrtimer_switch_to_hres();
> -}
> -
> -/*
>   * Called from hardirq context every jiffy
>   */
>  void hrtimer_run_queues(void)
> @@ -1730,6 +1706,13 @@ void hrtimer_run_queues(void)
>  	if (hrtimer_hres_active())
>  		return;
>  
> +	/*
> +	 * Check whether we can switch to highres mode.
> +	 */
> +	if (tick_check_oneshot_change(!hrtimer_is_hres_enabled())
> +	    && hrtimer_switch_to_hres())
> +		return;
> +
>  	for (index = 0; index < HRTIMER_MAX_CLOCK_BASES; index++) {
>  		base = &cpu_base->clock_base[index];
>  		if (!timerqueue_getnext(&base->active))
> diff --git a/kernel/timer.c b/kernel/timer.c
> index b06c647..46467be 100644
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -1443,8 +1443,6 @@ static void run_timer_softirq(struct softirq_action *h)
>  	irq_work_run();
>  #endif
>  
> -	hrtimer_run_pending();
> -
>  	if (time_after_eq(jiffies, base->timer_jiffies))
>  		__run_timers(base);
>  }
> @@ -1454,8 +1452,27 @@ static void run_timer_softirq(struct softirq_action *h)
>   */
>  void run_local_timers(void)
>  {
> +	struct tvec_base *base = __this_cpu_read(tvec_bases);
> +
>  	hrtimer_run_queues();
> -	raise_softirq(TIMER_SOFTIRQ);
> +	/*
> +	 * We can access this lockless as we are in the timer
> +	 * interrupt. If there are no timers queued, nothing to do in
> +	 * the timer softirq.
> +	 */
> +	if (!spin_do_trylock(&base->lock)) {
> +		raise_softirq(TIMER_SOFTIRQ);
> +		return;
> +	}
> +	if (!base->active_timers)
> +		goto out;
> +
> +	/* Check whether the next pending timer has expired */
> +	if (time_before_eq(base->next_timer, jiffies))
> +		raise_softirq(TIMER_SOFTIRQ);
> +out:
> +	rt_spin_unlock_after_trylock_in_irq(&base->lock);
> +
>  }
>  
>  #ifdef __ARCH_WANT_SYS_ALARM



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-12-21  9:11           ` Mike Galbraith
@ 2013-12-21 17:21             ` Muli Baron
  2013-12-22  4:17               ` Mike Galbraith
  0 siblings, 1 reply; 39+ messages in thread
From: Muli Baron @ 2013-12-21 17:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-rt-users

On 21/12/2013 11:11, Mike Galbraith wrote:
>
> Works, modulo noisy workqueues.
>
> rtbox:~ # sleep 35 && killall pert& cgexec -g cpuset:rtcpus taskset -c 3 pert 5
> [1] 5660
> 2400.05 MHZ CPU
> perturbation threshold 0.018 usecs.
> pert/s:       33 >15.75us:        2 min:  0.04 max: 24.14 avg:  2.51 sum/s:    84us overhead: 0.01%
> pert/s:       35 >15.54us:        3 min:  0.04 max: 24.89 avg:  2.39 sum/s:    84us overhead: 0.01%
> pert/s:       30 >15.27us:        2 min:  0.04 max: 23.03 avg:  2.64 sum/s:    80us overhead: 0.01%
> pert/s:       34 >15.12us:        3 min:  0.04 max: 25.03 avg:  2.51 sum/s:    86us overhead: 0.01%
> pert/s:       31 >14.93us:        2 min:  0.04 max: 23.86 avg:  2.60 sum/s:    83us overhead: 0.01%
> Terminated
>

I can confirm this works for me as well, but I have noticed some strange behavior under certain 
conditions.

If I run a process with SCHED_OTHER priority and pin it to a specific CPU like Mike did then all is 
well and everything functions as expected. If however I mask the execution of ksoftirqd by running 
that process as SCHED_FIFO for too long (say a few seconds) and later I free the CPU, then what I'm 
seeing is that the tick is never turned off again even though the CPU is completely idle, and 
running that same SCHED_OTHER process from before now gets 1K timer interrupts/s. I suspect this has 
something to do with the timer softirq never running on that CPU again as a result of this patch.

Following is trace output during this condition for two consecutive ticks, when the CPU is 
completely idle (shielded by isolcpus, HZ=1000):

<idle>-0     [003] d..h3..   355.620760: hrtimer_cancel: hrtimer=ffff88011fd8b720
<idle>-0     [003] d..h2..   355.620760: hrtimer_expire_entry: hrtimer=ffff88011fd8b720 
function=tick_sched_timer now=355487000349
<idle>-0     [003] d..h2..   355.620761: hrtimer_expire_exit: hrtimer=ffff88011fd8b720
<idle>-0     [003] d..h3..   355.620761: hrtimer_start: hrtimer=ffff88011fd8b720 
function=tick_sched_timer expires=355488000000 softexpires=355488000000
<idle>-0     [003] ....2..   355.620762: cpu_idle: state=4294967295 cpu_id=3
<idle>-0     [003] d...2..   355.620762: cpu_idle: state=1 cpu_id=3

<idle>-0     [003] d..h3..   355.621760: hrtimer_cancel: hrtimer=ffff88011fd8b720
<idle>-0     [003] d..h2..   355.621761: hrtimer_expire_entry: hrtimer=ffff88011fd8b720 
function=tick_sched_timer now=355488000330
<idle>-0     [003] d..h2..   355.621761: hrtimer_expire_exit: hrtimer=ffff88011fd8b720
<idle>-0     [003] d..h3..   355.621761: hrtimer_start: hrtimer=ffff88011fd8b720 
function=tick_sched_timer expires=355489000000 softexpires=355489000000
<idle>-0     [003] ....2..   355.621762: cpu_idle: state=4294967295 cpu_id=3
<idle>-0     [003] d...2..   355.621762: cpu_idle: state=1 cpu_id=3

--Muli



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-12-21 17:21             ` Muli Baron
@ 2013-12-22  4:17               ` Mike Galbraith
  2013-12-22  5:10                 ` Mike Galbraith
  2013-12-22  5:16                 ` Mike Galbraith
  0 siblings, 2 replies; 39+ messages in thread
From: Mike Galbraith @ 2013-12-22  4:17 UTC (permalink / raw)
  To: muli.baron; +Cc: linux-kernel, linux-rt-users

On Sat, 2013-12-21 at 19:21 +0200, Muli Baron wrote: 
> On 21/12/2013 11:11, Mike Galbraith wrote:
> >
> > Works, modulo noisy workqueues.
> >
> > rtbox:~ # sleep 35 && killall pert& cgexec -g cpuset:rtcpus taskset -c 3 pert 5
> > [1] 5660
> > 2400.05 MHZ CPU
> > perturbation threshold 0.018 usecs.
> > pert/s:       33 >15.75us:        2 min:  0.04 max: 24.14 avg:  2.51 sum/s:    84us overhead: 0.01%
> > pert/s:       35 >15.54us:        3 min:  0.04 max: 24.89 avg:  2.39 sum/s:    84us overhead: 0.01%
> > pert/s:       30 >15.27us:        2 min:  0.04 max: 23.03 avg:  2.64 sum/s:    80us overhead: 0.01%
> > pert/s:       34 >15.12us:        3 min:  0.04 max: 25.03 avg:  2.51 sum/s:    86us overhead: 0.01%
> > pert/s:       31 >14.93us:        2 min:  0.04 max: 23.86 avg:  2.60 sum/s:    83us overhead: 0.01%
> > Terminated
> >
> 
> I can confirm this works for me as well, but I have noticed some strange behavior under certain 
> conditions.

Hm, so can I.  Same everything, different day, nohz_full is NOT working.

-Mike


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-12-22  4:17               ` Mike Galbraith
@ 2013-12-22  5:10                 ` Mike Galbraith
  2013-12-22  5:37                   ` Mike Galbraith
  2013-12-22  5:16                 ` Mike Galbraith
  1 sibling, 1 reply; 39+ messages in thread
From: Mike Galbraith @ 2013-12-22  5:10 UTC (permalink / raw)
  To: muli.baron; +Cc: linux-kernel, linux-rt-users

On Sun, 2013-12-22 at 05:17 +0100, Mike Galbraith wrote: 
> On Sat, 2013-12-21 at 19:21 +0200, Muli Baron wrote: 
> > On 21/12/2013 11:11, Mike Galbraith wrote:
> > >
> > > Works, modulo noisy workqueues.
> > >
> > > rtbox:~ # sleep 35 && killall pert& cgexec -g cpuset:rtcpus taskset -c 3 pert 5
> > > [1] 5660
> > > 2400.05 MHZ CPU
> > > perturbation threshold 0.018 usecs.
> > > pert/s:       33 >15.75us:        2 min:  0.04 max: 24.14 avg:  2.51 sum/s:    84us overhead: 0.01%
> > > pert/s:       35 >15.54us:        3 min:  0.04 max: 24.89 avg:  2.39 sum/s:    84us overhead: 0.01%
> > > pert/s:       30 >15.27us:        2 min:  0.04 max: 23.03 avg:  2.64 sum/s:    80us overhead: 0.01%
> > > pert/s:       34 >15.12us:        3 min:  0.04 max: 25.03 avg:  2.51 sum/s:    86us overhead: 0.01%
> > > pert/s:       31 >14.93us:        2 min:  0.04 max: 23.86 avg:  2.60 sum/s:    83us overhead: 0.01%
> > > Terminated
> > >
> > 
> > I can confirm this works for me as well, but I have noticed some strange behavior under certain 
> > conditions.
> 
> Hm, so can I.  Same everything, different day, nohz_full is NOT working.

DL980 with same rt7 patchset/config, but with nohz_full=55-63 can take a
while to make up its mind, but does kick in (this boot anyway).

vogelweide:/abuild/mike/:[0]# cgexec -g cpuset:rtcpus taskset -c 60 pert 5
2260.90 MHZ CPU
perturbation threshold 0.024 usecs.
pert/s:     1000 >10.20us:        7 min:  7.21 max: 11.28 avg:  8.00 sum/s:  8000us overhead: 0.80%
pert/s:      600 >10.92us:        5 min:  1.64 max: 26.90 avg:  7.81 sum/s:  4690us overhead: 0.47%
pert/s:     1000 >10.79us:        4 min:  7.08 max: 11.49 avg:  7.86 sum/s:  7863us overhead: 0.79%
pert/s:     1000 >10.70us:        5 min:  7.15 max: 11.90 avg:  7.88 sum/s:  7878us overhead: 0.79%
pert/s:      937 >10.57us:        5 min:  4.30 max: 11.06 avg:  7.81 sum/s:  7324us overhead: 0.73%
pert/s:        1 >9.99us:        0 min:  4.10 max:  5.03 avg:  4.45 sum/s:     4us overhead: 0.00%
pert/s:        1 >9.77us:        1 min:  4.40 max: 10.13 avg:  5.49 sum/s:     7us overhead: 0.00%
pert/s:        2 >14.79us:        3 min:  2.65 max:104.68 avg: 15.21 sum/s:    30us overhead: 0.00%
pert/s:        1 >13.84us:        0 min:  4.31 max:  5.86 avg:  4.82 sum/s:     5us overhead: 0.00%
pert/s:        1 >13.18us:        0 min:  4.03 max:  9.17 avg:  5.19 sum/s:     6us overhead: 0.00%
pert/s:        1 >12.69us:        0 min:  2.83 max: 10.65 avg:  5.85 sum/s:     8us overhead: 0.00%

vogelweide:/abuild/mike/:[0]# cgexec -g cpuset:rtcpus taskset -c 60 pert 5
2260.90 MHZ CPU
perturbation threshold 0.024 usecs.
pert/s:        1 >9.11us:        2 min:  2.87 max: 11.95 avg:  6.12 sum/s:    10us overhead: 0.00%
pert/s:        1 >9.09us:        2 min:  3.09 max: 11.75 avg:  6.04 sum/s:     8us overhead: 0.00%
pert/s:        1 >22.94us:        2 min:  2.73 max:262.06 avg: 33.14 sum/s:    60us overhead: 0.01%
pert/s:        1 >21.53us:        0 min:  2.85 max: 12.02 avg:  5.69 sum/s:     7us overhead: 0.00%
pert/s:        1 >20.26us:        0 min:  3.38 max: 11.94 avg:  5.67 sum/s:     7us overhead: 0.00%
pert/s:        1 >19.12us:        0 min:  2.89 max: 12.11 avg:  5.72 sum/s:     7us overhead: 0.00%
pert/s:        1 >18.05us:        0 min:  3.12 max: 11.28 avg:  5.58 sum/s:     7us overhead: 0.00%
pert/s:        1 >17.24us:        0 min:  2.85 max: 13.20 avg:  6.72 sum/s:     9us overhead: 0.00%
pert/s:        2 >26.85us:        1 min:  2.70 max:201.53 avg: 24.99 sum/s:    50us overhead: 0.00%
pert/s:        1 >25.00us:        0 min:  2.81 max: 11.19 avg:  5.56 sum/s:     7us overhead: 0.00%
pert/s:        1 >23.33us:        0 min:  3.16 max: 11.11 avg:  5.59 sum/s:     7us overhead: 0.00%
pert/s:        1 >21.90us:        0 min:  2.87 max: 11.50 avg:  6.47 sum/s:    10us overhead: 0.00%
pert/s:        1 >20.57us:        0 min:  2.87 max: 11.21 avg:  5.99 sum/s:     8us overhead: 0.00%
pert/s:        1 >19.37us:        0 min:  3.22 max: 11.51 avg:  5.64 sum/s:     7us overhead: 0.00%
pert/s:        1 >18.26us:        0 min:  3.26 max: 10.72 avg:  5.97 sum/s:     8us overhead: 0.00%
pert/s:        1 >27.90us:        1 min:  2.67 max:202.74 avg: 26.52 sum/s:    48us overhead: 0.00%

Hm, size hefty perturbations every 30 seconds (sniff.. eau de vmstat).


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-12-22  4:17               ` Mike Galbraith
  2013-12-22  5:10                 ` Mike Galbraith
@ 2013-12-22  5:16                 ` Mike Galbraith
  1 sibling, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2013-12-22  5:16 UTC (permalink / raw)
  To: muli.baron; +Cc: linux-kernel, linux-rt-users

On Sun, 2013-12-22 at 05:17 +0100, Mike Galbraith wrote: 
> On Sat, 2013-12-21 at 19:21 +0200, Muli Baron wrote: 
> > On 21/12/2013 11:11, Mike Galbraith wrote:
> > >
> > > Works, modulo noisy workqueues.
> > >
> > > rtbox:~ # sleep 35 && killall pert& cgexec -g cpuset:rtcpus taskset -c 3 pert 5
> > > [1] 5660
> > > 2400.05 MHZ CPU
> > > perturbation threshold 0.018 usecs.
> > > pert/s:       33 >15.75us:        2 min:  0.04 max: 24.14 avg:  2.51 sum/s:    84us overhead: 0.01%
> > > pert/s:       35 >15.54us:        3 min:  0.04 max: 24.89 avg:  2.39 sum/s:    84us overhead: 0.01%
> > > pert/s:       30 >15.27us:        2 min:  0.04 max: 23.03 avg:  2.64 sum/s:    80us overhead: 0.01%
> > > pert/s:       34 >15.12us:        3 min:  0.04 max: 25.03 avg:  2.51 sum/s:    86us overhead: 0.01%
> > > pert/s:       31 >14.93us:        2 min:  0.04 max: 23.86 avg:  2.60 sum/s:    83us overhead: 0.01%
> > > Terminated
> > >
> > 
> > I can confirm this works for me as well, but I have noticed some strange behavior under certain 
> > conditions.
> 
> Hm, so can I.  Same everything, different day, nohz_full is NOT working.

But while I was off futzing with DL980, it started working all by
itself, after not working _at all_ for three reboot/play attempts.  

There's a phase-of-moon bug lurking in there somewhere.

-Mike



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo
  2013-12-22  5:10                 ` Mike Galbraith
@ 2013-12-22  5:37                   ` Mike Galbraith
  0 siblings, 0 replies; 39+ messages in thread
From: Mike Galbraith @ 2013-12-22  5:37 UTC (permalink / raw)
  To: muli.baron; +Cc: linux-kernel, linux-rt-users

On Sun, 2013-12-22 at 06:10 +0100, Mike Galbraith wrote:

> DL980 with same rt7 patchset/config, but with nohz_full=55-63 can take a
> while to make up its mind, but does kick in (this boot anyway).

While rebuilding with distro config (lard=max, but nohz_full=off)...

[ 2230.960596] ------------[ cut here ]------------
[ 2231.008947] WARNING: CPU: 24 PID: 372 at kernel/time/tick-sched.c:161 can_stop_full_tick+0x1b3/0x1c0()
[ 2231.136916] Modules linked in: nfsv3 nfs_acl nfs fscache lockd sunrpc autofs4 edd af_packet bridge stp llc cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq fuse loop ipmi_si iTCO_wdt iTCO_vendor_support gpio_ich sg sr_mod cdrom joydev ipmi_msghandler netxen_nic bnx2 i7core_edac lpc_ich mfd_core edac_core acpi_power_meter pcspkr button ext3 jbd radeon ttm drm_kms_helper drm i2c_algo_bit thermal processor thermal_sys scsi_dh_rdac scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh ata_generic ata_piix hpsa cciss
[ 2231.633825] CPU: 24 PID: 372 Comm: ksoftirqd/24 Not tainted 3.12.6-rt7 #189
[ 2231.633827] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[ 2231.633831]  00000000ffffffff ffff88027459bbf8 ffffffff81573a2d ffffffff810b8583
[ 2231.633833]  00000000000000a1 ffff88027459bc38 ffffffff8104fe1c ffff88027459bc28
[ 2231.633834]  ffff88027710be40 ffff880277100000 0000000000000002 0000000000000000
[ 2231.633835] Call Trace:
[ 2231.633854]  [<ffffffff81573a2d>] dump_stack+0x4d/0xa8
[ 2231.633863]  [<ffffffff810b8583>] ? can_stop_full_tick+0x1b3/0x1c0
[ 2231.633872]  [<ffffffff8104fe1c>] warn_slowpath_common+0x8c/0xc0
[ 2231.633928]  [<ffffffff8104fe6a>] warn_slowpath_null+0x1a/0x20
[ 2231.633929]  [<ffffffff810b8583>] can_stop_full_tick+0x1b3/0x1c0
[ 2231.633931]  [<ffffffff810b8625>] __tick_nohz_full_check+0x95/0xb0
[ 2231.633933]  [<ffffffff810b864e>] nohz_full_kick_work_func+0xe/0x10
[ 2231.633941]  [<ffffffff8110a0a9>] __irq_work_run+0x69/0x80
[ 2231.633943]  [<ffffffff8110a0c9>] irq_work_run+0x9/0x10
[ 2231.633946]  [<ffffffff8105e1f4>] run_timer_softirq+0x24/0x2d0
[ 2231.633955]  [<ffffffff8108a2a5>] ? sched_clock_cpu+0xc5/0x110
[ 2231.633958]  [<ffffffff8108afc5>] ? account_system_time+0x175/0x1a0
[ 2231.633964]  [<ffffffff810560c6>] handle_softirq+0xa6/0x180
[ 2231.633966]  [<ffffffff810562f8>] do_current_softirqs+0x158/0x230
[ 2231.633968]  [<ffffffff81056403>] run_ksoftirqd+0x33/0x50
[ 2231.633971]  [<ffffffff8107ee82>] smpboot_thread_fn+0x232/0x360
[ 2231.633973]  [<ffffffff8107ec50>] ? smpboot_create_threads+0x80/0x80
[ 2231.633977]  [<ffffffff8107571e>] kthread+0xbe/0xd0
[ 2231.633978]  [<ffffffff81075660>] ? kthreadd+0x1d0/0x1d0
[ 2231.633988]  [<ffffffff8157fb2c>] ret_from_fork+0x7c/0xb0
[ 2231.633990]  [<ffffffff81075660>] ? kthreadd+0x1d0/0x1d0
[ 2231.633992] ---[ end trace 0000000000000002 ]---



^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2013-12-22  5:37 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-31 14:07 CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo Mike Galbraith
2013-11-06 17:49 ` Thomas Gleixner
2013-11-07  3:26   ` Mike Galbraith
2013-11-07  4:31     ` Mike Galbraith
2013-11-07 11:21       ` Thomas Gleixner
2013-11-07 12:59         ` Frederic Weisbecker
2013-11-07 13:13           ` Thomas Gleixner
2013-11-12  8:06             ` Mike Galbraith
2013-11-12  9:28               ` Thomas Gleixner
2013-11-15 16:30                 ` [PATCH] rtmutex: take the waiter lock with irqs off Sebastian Andrzej Siewior
2013-11-15 20:14                   ` [PATCH v2] " Sebastian Andrzej Siewior
2013-11-18 14:10                     ` Peter Zijlstra
2013-11-18 17:56                       ` Peter Zijlstra
2013-11-18 23:59                       ` Frederic Weisbecker
2013-11-19  8:30                         ` Peter Zijlstra
2013-11-22 13:59                       ` Sebastian Andrzej Siewior
2013-11-22 16:08                         ` Peter Zijlstra
2013-11-22 16:21                           ` Sebastian Andrzej Siewior
2013-11-22 17:44                             ` Sebastian Andrzej Siewior
2013-11-25 18:33                           ` Paul Gortmaker
2013-11-07 13:07         ` CONFIG_NO_HZ_FULL + CONFIG_PREEMPT_RT_FULL = nogo Mike Galbraith
2013-12-20 15:41         ` Sebastian Andrzej Siewior
2013-12-21  9:11           ` Mike Galbraith
2013-12-21 17:21             ` Muli Baron
2013-12-22  4:17               ` Mike Galbraith
2013-12-22  5:10                 ` Mike Galbraith
2013-12-22  5:37                   ` Mike Galbraith
2013-12-22  5:16                 ` Mike Galbraith
2013-11-08  3:23       ` Paul E. McKenney
2013-11-08  7:31         ` Mike Galbraith
2013-11-08 12:37           ` Paul E. McKenney
2013-11-08 13:26             ` Mike Galbraith
2013-11-08 14:03               ` Paul E. McKenney
2013-11-08 14:21                 ` Mike Galbraith
2013-11-08 14:29                 ` Frederic Weisbecker
2013-11-08 14:45                   ` Paul E. McKenney
2013-11-08 16:03                     ` Frederic Weisbecker
2013-11-08 14:53                   ` Mike Galbraith
2013-11-08 16:04                     ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).