LKML Archive on lore.kernel.org
 help / color / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Sebastian Sewior <bigeasy@linutronix.de>,
	Anna-Maria Gleixner <anna-maria@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>
Subject: Re: [PATCH] hrtimer: Reset hrtimer cpu base proper on CPU hotplug
Date: Sat, 27 Jan 2018 16:53:17 -0800
Message-ID: <20180128005317.GA31914@linux.vnet.ibm.com> (raw)
In-Reply-To: <20180126220917.GI3741@linux.vnet.ibm.com>

On Fri, Jan 26, 2018 at 02:09:17PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 26, 2018 at 02:54:32PM +0100, Thomas Gleixner wrote:
> > The hrtimer interrupt code contains a hang detection and mitigation
> > mechanism, which prevents that a long delayed hrtimer interrupt causes a
> > continous retriggering of interrupts which prevent the system from making
> > progress. If a hang is detected then the timer hardware is programmed with
> > a certain delay into the future and a flag is set in the hrtimer cpu base
> > which prevents newly enqueued timers from reprogramming the timer hardware
> > prior to the chosen delay. The subsequent hrtimer interrupt after the delay
> > clears the flag and resumes normal operation.
> > 
> > If such a hang happens in the last hrtimer interrupt before a CPU is
> > unplugged then the hang_detected flag is set and stays that way when the
> > CPU is plugged in again. At that point the timer hardware is not armed and
> > it cannot be armed because the hang_detected flag is still active, so
> > nothing clears that flag. As a consequence the CPU does not receive hrtimer
> > interrupts and no timers expire on that CPU which results in RCU stalls and
> > other malfunctions.
> > 
> > Clear the flag along with some other less critical members of the hrtimer
> > cpu base to ensure starting from a clean state when a CPU is plugged in.
> > 
> > Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
> > root cause of that hard to reproduce heisenbug. Once understood it's
> > trivial and certainly justifies a brown paperbag.
> 
> Thank you very much, and I do know that feeling!  After reading the
> commit log, I feel significantly less incompetent for having failed to
> find this one.  ;-)  But it did pass rcutorture testing for a great many
> years, didn't it?  :-/
> 
> I have started an eight-hour seven-way test on the dreaded rcutorture
> TREE01 scenario.  In the meantime, off to the train!

And bozo here forgot to disable tracing, so the runs take much longer
than the stated time.  And because I applied against v4.15-rc9, which
lacks the recent code to suppress stalls when dumping the trace log,
all runs get RCU CPU stall warnings.  :-/

But I can filter out those tracing-induced stall warnings easily enough,
and thus far there have been 84 successful 30-minute runs out of 112
total, for no failures in 42 hours of TREE01 execution.  Given the base
failure rate of 0.33 per hour, the probability of this happening by
chance is something like ten to the minus sixth power, so:

Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

I will be running longer runs without tracing, but looking extremely
good thus far!  Thank you all very much!!!

							Thanx, Paul

  reply index

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-26 13:54 Thomas Gleixner
2018-01-26 22:09 ` Paul E. McKenney
2018-01-28  0:53   ` Paul E. McKenney [this message]
2018-01-29  8:20   ` Sebastian Sewior
2018-01-29  9:57     ` Paul E. McKenney
2018-01-29 23:43       ` Paul E. McKenney
2018-01-30 21:03         ` Thomas Gleixner
2018-01-31  0:52           ` Paul E. McKenney
2018-01-27 14:31 ` [tip:timers/urgent] " tip-bot for Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180128005317.GA31914@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=anna-maria@linutronix.de \
    --cc=bigeasy@linutronix.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git