Re: rcu_sched stalls in idle task introduced in pre-4.8?

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Rich Felker <dalias@libc.org>
Cc: linux-kernel@vger.kernel.org, john.stultz@linaro.org, tglx@linutronix.de
Subject: Re: rcu_sched stalls in idle task introduced in pre-4.8?
Date: Mon, 12 Sep 2016 08:36:04 -0700	[thread overview]
Message-ID: <20160912153604.GN3851@linux.vnet.ibm.com> (raw)
In-Reply-To: <20160912151345.GW15995@brightrain.aerifal.cx>

On Mon, Sep 12, 2016 at 11:13:45AM -0400, Rich Felker wrote:
> On Sat, Sep 10, 2016 at 03:19:38AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 08, 2016 at 06:16:53PM -0400, Rich Felker wrote:
> > > On Wed, Aug 03, 2016 at 09:16:31AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Aug 02, 2016 at 01:45:04PM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Aug 02, 2016 at 04:32:17PM -0400, Rich Felker wrote:
> > > > > > On Tue, Aug 02, 2016 at 12:48:02PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > > Does the problem reproduces easily?
> > > > > > 
> > > > > > Yes, it happens right after boot and repeats every 30-90 seconds or
> > > > > > so.
> > > > > 
> > > > > Well, that at least makes it easier to test any patches!
> > > > > 
> > > > > > > A bisection might be very helpful.
> > > > > > 
> > > > > > Bisection would require some manual work to setup because the whole
> > > > > > reason I was rebasing on Linus's tree was to adapt the drivers to
> > > > > > upstream infrastructure changes (the new cpuhp stuff replacing
> > > > > > notifier for cpu starting). The unfortunate way it was done, each
> > > > > > driver adds an enum to linux/cpuhotplug.h so all the patches have
> > > > > > gratuitous conflicts. In addition, for older revisions in Linus's
> > > > > > tree, there's at least one show-stopping (hang during boot) bug that
> > > > > > needs a cherry-pick to fix. There may be other small issues too. I
> > > > > > don't think they're at all insurmountible but it requires an annoying
> > > > > > amount of scripting.
> > > > > 
> > > > > I had to ask!  Might eventually be necessary, but let's see what we
> > > > > can learn from what you currently have.
> > > > 
> > > > And at first glance, my overnight run looks uglier than I would expect.
> > > > I am now running tests at v4.7, and will run other tests to see if
> > > > there really is a statistically significant degradation.  If there is,
> > > > then I might be able to bisect, though with nine-hour runs this could
> > > > take quite some time.
> > > 
> > > Any more thoughts on this? I'm testing v4.8-rc5 (plus jcore drivers
> > > not yet upstream) and it's still happening.
> > 
> > Not seeing it, but please do send me a recent splat from your dmesg and
> > your .config.
> > 
> > Because I am not seeing it, I also suggest inspecting your jcore drivers
> > with the information in Documentation/RCU/stallwarn.txt in mind.
> 
> I found a plausible cause -- it looks like the kernel timer system is
> programming the clock event device with intervals so low that another
> timer interrupt is pending before the first handler finishes, and it
> immediately interrupts the softirq part of the timer. Here's a snippet
> from ftrace event tracing:

That sort of thing could definitely result in RCU CPU stall warnings,
and much else besides!  Not so good for battery lifetime, for example,
if that is an issue in your case.

>           <idle>-0     [001] d.h.  2646.702790: irq_handler_entry: irq=72 name=jcore_pit
>           <idle>-0     [001] d.h.  2646.703398: softirq_raise: vec=1 [action=TIMER]
>           <idle>-0     [001] d.h.  2646.703607: softirq_raise: vec=9 [action=RCU]
>           <idle>-0     [001] d.h.  2646.703884: softirq_raise: vec=7 [action=SCHED]
>           <idle>-0     [001] d.h.  2646.704191: irq_handler_exit: irq=72 ret=handled
>           <idle>-0     [001] d.H.  2646.704509: irq_handler_entry: irq=72 name=jcore_pit
>           <idle>-0     [001] d.H.  2646.704990: softirq_raise: vec=1 [action=TIMER]
>           <idle>-0     [001] d.H.  2646.705182: softirq_raise: vec=9 [action=RCU]
>           <idle>-0     [001] d.H.  2646.705465: softirq_raise: vec=7 [action=SCHED]
>           <idle>-0     [001] d.H.  2646.705761: irq_handler_exit: irq=72 ret=handled
>           <idle>-0     [001] d.H.  2646.706071: irq_handler_entry: irq=72 name=jcore_pit
>           <idle>-0     [001] d.H.  2646.706328: irq_handler_exit: irq=72 ret=handled
>           <idle>-0     [001] ..s.  2646.706530: softirq_entry: vec=1 [action=TIMER]
>           <idle>-0     [001] ..s.  2646.706708: softirq_exit: vec=1 [action=TIMER]
>           <idle>-0     [001] ..s.  2646.706816: softirq_entry: vec=7 [action=SCHED]
>           <idle>-0     [001] ..s.  2646.707270: softirq_exit: vec=7 [action=SCHED]
>           <idle>-0     [001] ..s.  2646.707397: softirq_entry: vec=9 [action=RCU]
>           <idle>-0     [001] ..s.  2646.707641: softirq_exit: vec=9 [action=RCU]
>      ksoftirqd/1-14    [001] ..s.  2646.708569: softirq_entry: vec=1 [action=TIMER]
>      ksoftirqd/1-14    [001] ..s.  2646.708727: softirq_exit: vec=1 [action=TIMER]
>      ksoftirqd/1-14    [001] ..s.  2646.708829: softirq_entry: vec=7 [action=SCHED]
>      ksoftirqd/1-14    [001] ..s.  2646.709073: softirq_exit: vec=7 [action=SCHED]
>      ksoftirqd/1-14    [001] ..s.  2646.709180: softirq_entry: vec=9 [action=RCU]
>      ksoftirqd/1-14    [001] ..s.  2646.709882: softirq_exit: vec=9 [action=RCU]
> 
> Presumably it usually just hits one or two levels of nesting like
> above, but in the worst case it gets in some sort of feedback loop and
> the backlog grows until it lasts >21 sec. I was able to see that there
> are plenty of shorter stalls (too short for the stall detector to kick
> in) by removing the SD card and observing a multi-second delay before:
> 
> 	[  401.375518] mmc0: SPI card removed
> 
> The spi mmc host code that does this probe (no irq) is scheduled to
> run every 1s, so if the message takes >1s to appear (>0.5s avg),
> timers must be stalled.

Makes sense!

> Other interrupt activity (like serial or ethernet) breaks the stall
> immediately so it's not noticable in most interactive usage.

Interesting...  This other interrupt activity stops the timer from
posting so often, or just runs at a higher interrupt priority level?

> I was able to make it go away by increasing the min_delta for the
> clock even device to 2.5ms (chosen because I measured timer interrupts
> taking up to ~1.5ms -- keep in mind this is at 50 MHz) but hard-coding
> a number like that does not seem like a proper solution. The kernel
> timer subsystem should have some way to choose oneshot expiration
> appropriately based on feedback, or it should otherwise ensure that
> the feedback loop can't happen (maybe by waiting to program the next
> expiration from the softirq instead of the hard one, but I'm not sure
> if that's possible...?).

I am not sure what the timer guys will want to do, but it might be that
they program at the beginning to increase accuracy.  I could imagine
varying the default min_delta value based on the processor frequency,
but I have not idea whether or not that is a reasonable approach.

							Thanx, Paul