linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NMIs induced by 'perf top' hogging all CPU time
@ 2013-05-09 23:29 Dave Hansen
  2013-05-10 10:29 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Hansen @ 2013-05-09 23:29 UTC (permalink / raw)
  To: Andi Kleen, Peter Zijlstra, LKML, eranian

If I boot a recent kernel (bb9055b) and run 'perf top' on my machine, it
hangs.  It is 100% reproducible; it happens every single time.  If I'm
lucky, I'll get some of the hardlockup detection messages on the console.

A little bit of digging showed that the reason it hangs is that we're
seeing an average x86 PMU NMI take around 0.5ms (and *up* to about 1ms)
to complete, and doing thousands of them a second.  IOW, we're spending
virtually all the CPU time just processing perf events in NMI context.

When we get a bunch of these in a row, and they happen to occur while
running an hrtimer, the hrtimer code notices and complains:

	[ 1623.552970] hrtimer: interrupt took 20,981,718 ns

My personal "best" is:

	[  273.247983] hrtimer: interrupt took 1,015,221,558 ns

My suspicion is that my particular machine is just really slow at poking
the PMU counters for some reason.  It could be its NUMA properties or
something even more subtle.  But, the result is still that it falls flat
on its face with the defaults.  Passing "-c" to perf to manually raise
the sample period works around this, but it's _awfully_ easy to bring
down a system this way.

Any ideas what we can do here?  I'm planning on trying to track down the
source of _why_ the NMIs are so slow, but it does seem like we should be
able to notice and back off on the NMI rate when we spend so much time
in there.

Andi suggested collecting some data about how the kernel was tuning the
sample period over time:

	https://www.sr71.net/~dave/intel/perf-hangs-201305/f1.html

There is one obviously goofy-looking points in there where the period
got set to 96!  We'd get another NMI 96 cycles after we exit the current
one!

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: NMIs induced by 'perf top' hogging all CPU time
  2013-05-09 23:29 NMIs induced by 'perf top' hogging all CPU time Dave Hansen
@ 2013-05-10 10:29 ` Peter Zijlstra
  2013-05-10 20:10   ` Dave Hansen
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2013-05-10 10:29 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andi Kleen, LKML, eranian

On Thu, May 09, 2013 at 04:29:16PM -0700, Dave Hansen wrote:
> If I boot a recent kernel (bb9055b) and run 'perf top' on my machine, it
> hangs.  It is 100% reproducible; it happens every single time.  If I'm
> lucky, I'll get some of the hardlockup detection messages on the console.

Are you implicitly saying it worked as expected on previous kernels?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: NMIs induced by 'perf top' hogging all CPU time
  2013-05-10 10:29 ` Peter Zijlstra
@ 2013-05-10 20:10   ` Dave Hansen
  0 siblings, 0 replies; 3+ messages in thread
From: Dave Hansen @ 2013-05-10 20:10 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andi Kleen, LKML, eranian

On 05/10/2013 03:29 AM, Peter Zijlstra wrote:
> On Thu, May 09, 2013 at 04:29:16PM -0700, Dave Hansen wrote:
>> If I boot a recent kernel (bb9055b) and run 'perf top' on my machine, it
>> hangs.  It is 100% reproducible; it happens every single time.  If I'm
>> lucky, I'll get some of the hardlockup detection messages on the console.
> 
> Are you implicitly saying it worked as expected on previous kernels?

Yes, it works on older kernels.  3.6.11 works, for instance.

This hardware is quite effective at finding bugs.  It makes bisecting
very challenging as not a lot of random mainline versions boot.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-05-10 20:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-09 23:29 NMIs induced by 'perf top' hogging all CPU time Dave Hansen
2013-05-10 10:29 ` Peter Zijlstra
2013-05-10 20:10   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).