All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Travis <mike.travis@hpe.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"H. Peter Anvin" <hpa@zytor.com>, Don Zickus <dzickus@redhat.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Dimitri Sivanich <dimitri.sivanich@hpe.com>,
	Frank Ramsay <frank.ramsay@hpe.com>,
	Russ Anderson <russ.anderson@hpe.com>,
	Tony Ernst <tony.ernst@hpe.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
Date: Wed, 8 Mar 2017 07:17:13 -0800	[thread overview]
Message-ID: <562edb0b-56f8-0ca4-c9ed-841a79e0cf2a@hpe.com> (raw)
In-Reply-To: <20170308102829.GA11864@gmail.com>



On 3/8/2017 2:28 AM, Ingo Molnar wrote:
> 
> * Mike Travis <mike.travis@hpe.com> wrote:
> 
>>
>>
>> On 3/6/2017 11:42 PM, Ingo Molnar wrote:
>>>
>>> * Mike Travis <mike.travis@hpe.com> wrote:
>>>
>>>> Add a new NMI call chain that is called last after all other NMI handlers
>>>> have been checked and did not "handle" the NMI.  This mimics the current
>>>> NMI_UNKNOWN call chain except it eliminates the WARNING message about
>>>> multiple NMI handlers registering on this call chain.
>>>>
>>>> This call chain dramatically lowers the NMI call frequency when high
>>>> frequency NMI tools are in use, notably the perf tools.  It is required
>>>> for NMI handlers that cannot sustain a high NMI call rate without
>>>> ramifications to the system operability.
>>>
>>> So how about we just turn off that warning instead? I don't remember the last time 
>>> it actually _helped_ us find any kernel or hardware bug - and it has caused tons 
>>> of problems...
>>
>> I can do that, with an even simpler patch...
>>
>>>
>>> It's not like we warn about excess regular IRQs either - we either handle them or 
>>> at most increase a counter somewhere. We could do the same for NMIs: introduce a 
>>> counter somewhere that counts the number of seemingly unhandled NMIs.
>>
>> Really "unknown" NMI errors are reported by either the "dazed and
>> confused" message or if the panic on unknown nmi is set, then the
>> system will panic.  So unknown NMI occurrences are already being
>> dealt with.
> 
> So I'd even remove the 'dazed and confused' message - has it ever helped us?

I can remove it though it seems to have become an institution, or more
correctly, a common term of reference. :)  It does precede the decision
to either attempt to continue system operation, or panic the system
immediately.

> If NMIs are generated but not handled properly then developers and users will 
> notice it just like when IRQs are lost: either through bad system behavior or via 
> weird stats in procfs. The kernel log should not get spammed.

Having some notice is probably a good thing even if for archaic reasons.
We recently discovered that an internal system error triggered an NMI
event.  Without any notice, the system would not have been suspected of
acting strangely, but data could potentially have been silently lost.
(NMI seems by far the least standard standard in the x86 architecture.)

Also, I don't think IRQs and NMIs are in the same league.  Missing an
IRQ means an expected I/O operation did not occur.  Prudent drivers can
set a timeout to notice missing interrupts.

Missing an NMI usually means that something unexpected occurred but was
not dealt with.  Losing perf interrupts is recoverable since there
will be another along shortly.  But missing an NMI due to a system
failure event is not.  (Why NMI is heavily overloaded, and not very
standardized.)

> 
> So if you could expose the lost NMI stats via procfs or debugfs then we could 
> remove both the warning and the dazed-and-confused spam on the system log.

I can add this.

> 
> This should make perf all around more usable on UV systems, right?

I'm not sure this is accurate.  Perf is currently very usable on UV.
But as we increase our online fault analysis procedures, this warning
message stood out as a glaring example of a false positive.  Note it
is not warning of anything except there is more than one NMI handler
registering on this "call after all other handlers have been called
and did not claim the NMI" chain.

So let me know if I should go ahead with the above (remove some or
all indications that an unclaimed NMI event occurred, and add a
reporting facility for NMI stats.)

Thanks!
Mike

  reply	other threads:[~2017-03-08 15:19 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-06 18:17 [PATCH 0/2] Add and Use NMI LAST call chain to eliminate WARNING message Mike Travis
2017-03-06 18:17 ` [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain Mike Travis
2017-03-07  7:42   ` Ingo Molnar
2017-03-07 15:22     ` Don Zickus
2017-03-07 16:00       ` Mike Travis
2017-03-07 16:07         ` Don Zickus
2017-03-07 16:13           ` Mike Travis
2017-03-07 15:38     ` Mike Travis
2017-03-08 10:28       ` Ingo Molnar
2017-03-08 15:17         ` Mike Travis [this message]
2017-03-06 18:17 ` [PATCH 2/2] x86/platform/uv: Use " Mike Travis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=562edb0b-56f8-0ca4-c9ed-841a79e0cf2a@hpe.com \
    --to=mike.travis@hpe.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=dimitri.sivanich@hpe.com \
    --cc=dzickus@redhat.com \
    --cc=frank.ramsay@hpe.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=russ.anderson@hpe.com \
    --cc=tglx@linutronix.de \
    --cc=tony.ernst@hpe.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.