All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Add and Use NMI LAST call chain to eliminate WARNING message
@ 2017-03-06 18:17 Mike Travis
  2017-03-06 18:17 ` [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain Mike Travis
  2017-03-06 18:17 ` [PATCH 2/2] x86/platform/uv: Use " Mike Travis
  0 siblings, 2 replies; 11+ messages in thread
From: Mike Travis @ 2017-03-06 18:17 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Don Zickus, Peter Zijlstra
  Cc: Dimitri Sivanich, Frank Ramsay, Russ Anderson, Tony Ernst, x86,
	linux-kernel


Patch 1 adds a new NMI_LAST call chain that mimics the current NMI_UNKNOWN
call chain except it eliminates the WARNING message about multiple NMI
handlers registering on this call chain.

Patch 2 moves the UV NMI handler from using the NMI_UNKNOWN call chain
to this new NMI_LAST call chain.

-- 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
  2017-03-06 18:17 [PATCH 0/2] Add and Use NMI LAST call chain to eliminate WARNING message Mike Travis
@ 2017-03-06 18:17 ` Mike Travis
  2017-03-07  7:42   ` Ingo Molnar
  2017-03-06 18:17 ` [PATCH 2/2] x86/platform/uv: Use " Mike Travis
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Travis @ 2017-03-06 18:17 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Don Zickus, Peter Zijlstra
  Cc: Dimitri Sivanich, Frank Ramsay, Russ Anderson, Tony Ernst, x86,
	linux-kernel

[-- Attachment #1: x86_add_nmi_remote_call --]
[-- Type: text/plain, Size: 2295 bytes --]

Add a new NMI call chain that is called last after all other NMI handlers
have been checked and did not "handle" the NMI.  This mimics the current
NMI_UNKNOWN call chain except it eliminates the WARNING message about
multiple NMI handlers registering on this call chain.

This call chain dramatically lowers the NMI call frequency when high
frequency NMI tools are in use, notably the perf tools.  It is required
for NMI handlers that cannot sustain a high NMI call rate without
ramifications to the system operability.


Signed-off-by: Mike Travis <mike.travis@hpe.com>
Reviewed-by: Russ Anderson <russ.anderson@hpe.com>
---
 arch/x86/include/asm/nmi.h |    1 +
 arch/x86/kernel/nmi.c      |   21 ++++++++++++++++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

--- linux-4.4.orig/arch/x86/include/asm/nmi.h
+++ linux-4.4/arch/x86/include/asm/nmi.h
@@ -28,6 +28,7 @@ enum {
 	NMI_UNKNOWN,
 	NMI_SERR,
 	NMI_IO_CHECK,
+	NMI_LAST,
 	NMI_MAX
 };
 
--- linux-4.4.orig/arch/x86/kernel/nmi.c
+++ linux-4.4/arch/x86/kernel/nmi.c
@@ -57,6 +57,10 @@ static struct nmi_desc nmi_desc[NMI_MAX]
 		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[3].lock),
 		.head = LIST_HEAD_INIT(nmi_desc[3].head),
 	},
+	{
+		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[4].lock),
+		.head = LIST_HEAD_INIT(nmi_desc[4].head),
+	},
 
 };
 
@@ -65,6 +69,7 @@ struct nmi_stats {
 	unsigned int unknown;
 	unsigned int external;
 	unsigned int swallow;
+	unsigned int last;
 };
 
 static DEFINE_PER_CPU(struct nmi_stats, nmi_stats);
@@ -312,6 +317,20 @@ unknown_nmi_error(unsigned char reason,
 }
 NOKPROBE_SYMBOL(unknown_nmi_error);
 
+static void check_nmi_last(unsigned char reason, struct pt_regs *regs)
+{
+	int handled;
+
+	/* Check low frequency, multiple CPU NMI handlers */
+	handled = nmi_handle(NMI_LAST, regs);
+	__this_cpu_add(nmi_stats.last, handled);
+	if (handled)
+		return;
+
+	unknown_nmi_error(reason, regs);
+}
+NOKPROBE_SYMBOL(check_nmi_last);
+
 static DEFINE_PER_CPU(bool, swallow_nmi);
 static DEFINE_PER_CPU(unsigned long, last_nmi_rip);
 
@@ -423,7 +442,7 @@ static void default_do_nmi(struct pt_reg
 	if (b2b && __this_cpu_read(swallow_nmi))
 		__this_cpu_add(nmi_stats.swallow, 1);
 	else
-		unknown_nmi_error(reason, regs);
+		check_nmi_last(reason, regs);
 }
 NOKPROBE_SYMBOL(default_do_nmi);
 

-- 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] x86/platform/uv: Use low priority low frequency NMI call chain
  2017-03-06 18:17 [PATCH 0/2] Add and Use NMI LAST call chain to eliminate WARNING message Mike Travis
  2017-03-06 18:17 ` [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain Mike Travis
@ 2017-03-06 18:17 ` Mike Travis
  1 sibling, 0 replies; 11+ messages in thread
From: Mike Travis @ 2017-03-06 18:17 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Don Zickus, Peter Zijlstra
  Cc: Dimitri Sivanich, Frank Ramsay, Russ Anderson, Tony Ernst, x86,
	linux-kernel

[-- Attachment #1: uv_use_nmi_remote --]
[-- Type: text/plain, Size: 1621 bytes --]

Use the new NMI call chain NMI_LAST instead of the NMI_UNKNOWN call
chain to eliminate the WARNING message about having multiple NMI
handlers registering on the NMI UNKNOWN call chain.

This "call only when not already claimed by another NMI handler" is
required as the UV architecture cannot sustain a high rate of reads
to the MMR's on the UV HUB's to verify if the NMI should be claimed.
This not only slows down the system responsiveness, it can result in
a system lockup.

Signed-off-by: Mike Travis <mike.travis@hpe.com>
Reviewed-by: Russ Anderson <russ.anderson@hpe.com>
---
 arch/x86/platform/uv/uv_nmi.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-4.4.orig/arch/x86/platform/uv/uv_nmi.c
+++ linux-4.4/arch/x86/platform/uv/uv_nmi.c
@@ -53,7 +53,7 @@
  * disrupts the UV Hub's primary mission of directing NumaLink traffic and
  * can cause system problems to occur.
  *
- * To do this we register our primary NMI notifier on the NMI_UNKNOWN
+ * To do this we register our primary NMI notifier on the NMI_LAST
  * chain.  This reduces the number of false NMI calls when the perf
  * tools are running which generate an enormous number of NMIs per
  * second (~4M/s for 1024 CPU threads).  Our secondary NMI handler is
@@ -992,7 +992,7 @@ static int uv_handle_nmi_ping(unsigned i
 
 static void uv_register_nmi_notifier(void)
 {
-	if (register_nmi_handler(NMI_UNKNOWN, uv_handle_nmi, 0, "uv"))
+	if (register_nmi_handler(NMI_LAST, uv_handle_nmi, 0, "uv"))
 		pr_warn("UV: NMI handler failed to register\n");
 
 	if (register_nmi_handler(NMI_LOCAL, uv_handle_nmi_ping, 0, "uvping"))

-- 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
  2017-03-06 18:17 ` [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain Mike Travis
@ 2017-03-07  7:42   ` Ingo Molnar
  2017-03-07 15:22     ` Don Zickus
  2017-03-07 15:38     ` Mike Travis
  0 siblings, 2 replies; 11+ messages in thread
From: Ingo Molnar @ 2017-03-07  7:42 UTC (permalink / raw)
  To: Mike Travis
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Don Zickus,
	Peter Zijlstra, Dimitri Sivanich, Frank Ramsay, Russ Anderson,
	Tony Ernst, x86, linux-kernel


* Mike Travis <mike.travis@hpe.com> wrote:

> Add a new NMI call chain that is called last after all other NMI handlers
> have been checked and did not "handle" the NMI.  This mimics the current
> NMI_UNKNOWN call chain except it eliminates the WARNING message about
> multiple NMI handlers registering on this call chain.
> 
> This call chain dramatically lowers the NMI call frequency when high
> frequency NMI tools are in use, notably the perf tools.  It is required
> for NMI handlers that cannot sustain a high NMI call rate without
> ramifications to the system operability.

So how about we just turn off that warning instead? I don't remember the last time 
it actually _helped_ us find any kernel or hardware bug - and it has caused tons 
of problems...

It's not like we warn about excess regular IRQs either - we either handle them or 
at most increase a counter somewhere. We could do the same for NMIs: introduce a 
counter somewhere that counts the number of seemingly unhandled NMIs.

But in any case, we should not spam the kernel log, neither with high, nor with 
low frequency.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
  2017-03-07  7:42   ` Ingo Molnar
@ 2017-03-07 15:22     ` Don Zickus
  2017-03-07 16:00       ` Mike Travis
  2017-03-07 15:38     ` Mike Travis
  1 sibling, 1 reply; 11+ messages in thread
From: Don Zickus @ 2017-03-07 15:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Travis, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Dimitri Sivanich, Frank Ramsay, Russ Anderson,
	Tony Ernst, x86, linux-kernel

On Tue, Mar 07, 2017 at 08:42:10AM +0100, Ingo Molnar wrote:
> 
> * Mike Travis <mike.travis@hpe.com> wrote:
> 
> > Add a new NMI call chain that is called last after all other NMI handlers
> > have been checked and did not "handle" the NMI.  This mimics the current
> > NMI_UNKNOWN call chain except it eliminates the WARNING message about
> > multiple NMI handlers registering on this call chain.
> > 
> > This call chain dramatically lowers the NMI call frequency when high
> > frequency NMI tools are in use, notably the perf tools.  It is required
> > for NMI handlers that cannot sustain a high NMI call rate without
> > ramifications to the system operability.
> 
> So how about we just turn off that warning instead? I don't remember the last time 
> it actually _helped_ us find any kernel or hardware bug - and it has caused tons 
> of problems...

Yeah, that is one way to solve it. :-)

Originally I put that in there because the unknown nmi handlers sometime do
not return, making it impossible for the second handler to run.

But you are right, it probably hasn't really helped find any problems.  I
wasn't aware of problems it was causing (not that I was looking through
emails to find them either).

Cheers,
Don

> 
> It's not like we warn about excess regular IRQs either - we either handle them or 
> at most increase a counter somewhere. We could do the same for NMIs: introduce a 
> counter somewhere that counts the number of seemingly unhandled NMIs.
> 
> But in any case, we should not spam the kernel log, neither with high, nor with 
> low frequency.
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
  2017-03-07  7:42   ` Ingo Molnar
  2017-03-07 15:22     ` Don Zickus
@ 2017-03-07 15:38     ` Mike Travis
  2017-03-08 10:28       ` Ingo Molnar
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Travis @ 2017-03-07 15:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Don Zickus,
	Peter Zijlstra, Dimitri Sivanich, Frank Ramsay, Russ Anderson,
	Tony Ernst, x86, linux-kernel



On 3/6/2017 11:42 PM, Ingo Molnar wrote:
> 
> * Mike Travis <mike.travis@hpe.com> wrote:
> 
>> Add a new NMI call chain that is called last after all other NMI handlers
>> have been checked and did not "handle" the NMI.  This mimics the current
>> NMI_UNKNOWN call chain except it eliminates the WARNING message about
>> multiple NMI handlers registering on this call chain.
>>
>> This call chain dramatically lowers the NMI call frequency when high
>> frequency NMI tools are in use, notably the perf tools.  It is required
>> for NMI handlers that cannot sustain a high NMI call rate without
>> ramifications to the system operability.
> 
> So how about we just turn off that warning instead? I don't remember the last time 
> it actually _helped_ us find any kernel or hardware bug - and it has caused tons 
> of problems...

I can do that, with an even simpler patch...

> 
> It's not like we warn about excess regular IRQs either - we either handle them or 
> at most increase a counter somewhere. We could do the same for NMIs: introduce a 
> counter somewhere that counts the number of seemingly unhandled NMIs.

Really "unknown" NMI errors are reported by either the "dazed and
confused" message or if the panic on unknown nmi is set, then the
system will panic.  So unknown NMI occurrences are already being
dealt with.

Plus the following stats are being collected, though I'm not sure of
any reporting facility:

struct nmi_stats {
        unsigned int normal;
        unsigned int unknown;
        unsigned int external;
        unsigned int swallow;
};

static DEFINE_PER_CPU(struct nmi_stats, nmi_stats);


> 
> But in any case, we should not spam the kernel log, neither with high, nor with 
> low frequency.
> 
> Thanks,
> 
> 	Ingo
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
  2017-03-07 15:22     ` Don Zickus
@ 2017-03-07 16:00       ` Mike Travis
  2017-03-07 16:07         ` Don Zickus
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Travis @ 2017-03-07 16:00 UTC (permalink / raw)
  To: Don Zickus, Ingo Molnar
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra,
	Dimitri Sivanich, Frank Ramsay, Russ Anderson, Tony Ernst, x86,
	linux-kernel



On 3/7/2017 7:22 AM, Don Zickus wrote:
> On Tue, Mar 07, 2017 at 08:42:10AM +0100, Ingo Molnar wrote:
>>
>> * Mike Travis <mike.travis@hpe.com> wrote:
>>
>>> Add a new NMI call chain that is called last after all other NMI handlers
>>> have been checked and did not "handle" the NMI.  This mimics the current
>>> NMI_UNKNOWN call chain except it eliminates the WARNING message about
>>> multiple NMI handlers registering on this call chain.
>>>
>>> This call chain dramatically lowers the NMI call frequency when high
>>> frequency NMI tools are in use, notably the perf tools.  It is required
>>> for NMI handlers that cannot sustain a high NMI call rate without
>>> ramifications to the system operability.
>>
>> So how about we just turn off that warning instead? I don't remember the last time 
>> it actually _helped_ us find any kernel or hardware bug - and it has caused tons 
>> of problems...
> 
> Yeah, that is one way to solve it. :-)

Actually just removing the WARNING indication and making it an
INFO message would be enough to quiet objections.  Is that enough,
or should the message be completely removed for UNKNOWN NMI
handlers, and left in place for IO_CHECK and SERR NMI handlers?

> 
> Originally I put that in there because the unknown nmi handlers sometime do
> not return, making it impossible for the second handler to run.

The only two external unknown NMI handlers that I know of is the UV
one and the KGDB one.  The KGDB one appears to be only claimed if it
is exiting an NMI_LOCAL event.  And the UV one is only claimed if
it as caused by a UV System NMI event.  So truly unknown NMI events
eventually get to the internal unknown nmi handler.

> But you are right, it probably hasn't really helped find any problems.  I
> wasn't aware of problems it was causing (not that I was looking through
> emails to find them either).
> 
> Cheers,
> Don
> 
>>
>> It's not like we warn about excess regular IRQs either - we either handle them or 
>> at most increase a counter somewhere. We could do the same for NMIs: introduce a 
>> counter somewhere that counts the number of seemingly unhandled NMIs.
>>
>> But in any case, we should not spam the kernel log, neither with high, nor with 
>> low frequency.
>>
>> Thanks,
>>
>> 	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
  2017-03-07 16:00       ` Mike Travis
@ 2017-03-07 16:07         ` Don Zickus
  2017-03-07 16:13           ` Mike Travis
  0 siblings, 1 reply; 11+ messages in thread
From: Don Zickus @ 2017-03-07 16:07 UTC (permalink / raw)
  To: Mike Travis
  Cc: Ingo Molnar, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Dimitri Sivanich, Frank Ramsay, Russ Anderson,
	Tony Ernst, x86, linux-kernel

On Tue, Mar 07, 2017 at 08:00:33AM -0800, Mike Travis wrote:
> 
> 
> On 3/7/2017 7:22 AM, Don Zickus wrote:
> > On Tue, Mar 07, 2017 at 08:42:10AM +0100, Ingo Molnar wrote:
> >>
> >> * Mike Travis <mike.travis@hpe.com> wrote:
> >>
> >>> Add a new NMI call chain that is called last after all other NMI handlers
> >>> have been checked and did not "handle" the NMI.  This mimics the current
> >>> NMI_UNKNOWN call chain except it eliminates the WARNING message about
> >>> multiple NMI handlers registering on this call chain.
> >>>
> >>> This call chain dramatically lowers the NMI call frequency when high
> >>> frequency NMI tools are in use, notably the perf tools.  It is required
> >>> for NMI handlers that cannot sustain a high NMI call rate without
> >>> ramifications to the system operability.
> >>
> >> So how about we just turn off that warning instead? I don't remember the last time 
> >> it actually _helped_ us find any kernel or hardware bug - and it has caused tons 
> >> of problems...
> > 
> > Yeah, that is one way to solve it. :-)
> 
> Actually just removing the WARNING indication and making it an
> INFO message would be enough to quiet objections.  Is that enough,

That might be useful to at least log in, but as Ingo said it just might be
overkill in the real world.

> or should the message be completely removed for UNKNOWN NMI
> handlers, and left in place for IO_CHECK and SERR NMI handlers?
> 
> > 
> > Originally I put that in there because the unknown nmi handlers sometime do
> > not return, making it impossible for the second handler to run.
> 
> The only two external unknown NMI handlers that I know of is the UV
> one and the KGDB one.  The KGDB one appears to be only claimed if it
> is exiting an NMI_LOCAL event.  And the UV one is only claimed if
> it as caused by a UV System NMI event.  So truly unknown NMI events
> eventually get to the internal unknown nmi handler.

Good to know.  I keep thinking of the hpwdt that wants to eat all NMIs, I
believe that still registers on the unknown nmi (drivers/watchdog/hpwdt.c).

Cheers,
Don

> 
> > But you are right, it probably hasn't really helped find any problems.  I
> > wasn't aware of problems it was causing (not that I was looking through
> > emails to find them either).
> > 
> > Cheers,
> > Don
> > 
> >>
> >> It's not like we warn about excess regular IRQs either - we either handle them or 
> >> at most increase a counter somewhere. We could do the same for NMIs: introduce a 
> >> counter somewhere that counts the number of seemingly unhandled NMIs.
> >>
> >> But in any case, we should not spam the kernel log, neither with high, nor with 
> >> low frequency.
> >>
> >> Thanks,
> >>
> >> 	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
  2017-03-07 16:07         ` Don Zickus
@ 2017-03-07 16:13           ` Mike Travis
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Travis @ 2017-03-07 16:13 UTC (permalink / raw)
  To: Don Zickus
  Cc: Ingo Molnar, Ingo Molnar, Thomas Gleixner, H. Peter Anvin,
	Peter Zijlstra, Dimitri Sivanich, Frank Ramsay, Russ Anderson,
	Tony Ernst, x86, linux-kernel



On 3/7/2017 8:07 AM, Don Zickus wrote:
> Good to know.  I keep thinking of the hpwdt that wants to eat all NMIs, I
> believe that still registers on the unknown nmi (drivers/watchdog/hpwdt.c).

Thanks for pointing that out.  I'm guessing this happens only on
HPE systems, which are now us... :)  I'll contact Thomas for more info.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
  2017-03-07 15:38     ` Mike Travis
@ 2017-03-08 10:28       ` Ingo Molnar
  2017-03-08 15:17         ` Mike Travis
  0 siblings, 1 reply; 11+ messages in thread
From: Ingo Molnar @ 2017-03-08 10:28 UTC (permalink / raw)
  To: Mike Travis
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Don Zickus,
	Peter Zijlstra, Dimitri Sivanich, Frank Ramsay, Russ Anderson,
	Tony Ernst, x86, linux-kernel


* Mike Travis <mike.travis@hpe.com> wrote:

> 
> 
> On 3/6/2017 11:42 PM, Ingo Molnar wrote:
> > 
> > * Mike Travis <mike.travis@hpe.com> wrote:
> > 
> >> Add a new NMI call chain that is called last after all other NMI handlers
> >> have been checked and did not "handle" the NMI.  This mimics the current
> >> NMI_UNKNOWN call chain except it eliminates the WARNING message about
> >> multiple NMI handlers registering on this call chain.
> >>
> >> This call chain dramatically lowers the NMI call frequency when high
> >> frequency NMI tools are in use, notably the perf tools.  It is required
> >> for NMI handlers that cannot sustain a high NMI call rate without
> >> ramifications to the system operability.
> > 
> > So how about we just turn off that warning instead? I don't remember the last time 
> > it actually _helped_ us find any kernel or hardware bug - and it has caused tons 
> > of problems...
> 
> I can do that, with an even simpler patch...
> 
> > 
> > It's not like we warn about excess regular IRQs either - we either handle them or 
> > at most increase a counter somewhere. We could do the same for NMIs: introduce a 
> > counter somewhere that counts the number of seemingly unhandled NMIs.
> 
> Really "unknown" NMI errors are reported by either the "dazed and
> confused" message or if the panic on unknown nmi is set, then the
> system will panic.  So unknown NMI occurrences are already being
> dealt with.

So I'd even remove the 'dazed and confused' message - has it ever helped us?

If NMIs are generated but not handled properly then developers and users will 
notice it just like when IRQs are lost: either through bad system behavior or via 
weird stats in procfs. The kernel log should not get spammed.

So if you could expose the lost NMI stats via procfs or debugfs then we could 
remove both the warning and the dazed-and-confused spam on the system log.

This should make perf all around more usable on UV systems, right?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain
  2017-03-08 10:28       ` Ingo Molnar
@ 2017-03-08 15:17         ` Mike Travis
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Travis @ 2017-03-08 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, Don Zickus,
	Peter Zijlstra, Dimitri Sivanich, Frank Ramsay, Russ Anderson,
	Tony Ernst, x86, linux-kernel



On 3/8/2017 2:28 AM, Ingo Molnar wrote:
> 
> * Mike Travis <mike.travis@hpe.com> wrote:
> 
>>
>>
>> On 3/6/2017 11:42 PM, Ingo Molnar wrote:
>>>
>>> * Mike Travis <mike.travis@hpe.com> wrote:
>>>
>>>> Add a new NMI call chain that is called last after all other NMI handlers
>>>> have been checked and did not "handle" the NMI.  This mimics the current
>>>> NMI_UNKNOWN call chain except it eliminates the WARNING message about
>>>> multiple NMI handlers registering on this call chain.
>>>>
>>>> This call chain dramatically lowers the NMI call frequency when high
>>>> frequency NMI tools are in use, notably the perf tools.  It is required
>>>> for NMI handlers that cannot sustain a high NMI call rate without
>>>> ramifications to the system operability.
>>>
>>> So how about we just turn off that warning instead? I don't remember the last time 
>>> it actually _helped_ us find any kernel or hardware bug - and it has caused tons 
>>> of problems...
>>
>> I can do that, with an even simpler patch...
>>
>>>
>>> It's not like we warn about excess regular IRQs either - we either handle them or 
>>> at most increase a counter somewhere. We could do the same for NMIs: introduce a 
>>> counter somewhere that counts the number of seemingly unhandled NMIs.
>>
>> Really "unknown" NMI errors are reported by either the "dazed and
>> confused" message or if the panic on unknown nmi is set, then the
>> system will panic.  So unknown NMI occurrences are already being
>> dealt with.
> 
> So I'd even remove the 'dazed and confused' message - has it ever helped us?

I can remove it though it seems to have become an institution, or more
correctly, a common term of reference. :)  It does precede the decision
to either attempt to continue system operation, or panic the system
immediately.

> If NMIs are generated but not handled properly then developers and users will 
> notice it just like when IRQs are lost: either through bad system behavior or via 
> weird stats in procfs. The kernel log should not get spammed.

Having some notice is probably a good thing even if for archaic reasons.
We recently discovered that an internal system error triggered an NMI
event.  Without any notice, the system would not have been suspected of
acting strangely, but data could potentially have been silently lost.
(NMI seems by far the least standard standard in the x86 architecture.)

Also, I don't think IRQs and NMIs are in the same league.  Missing an
IRQ means an expected I/O operation did not occur.  Prudent drivers can
set a timeout to notice missing interrupts.

Missing an NMI usually means that something unexpected occurred but was
not dealt with.  Losing perf interrupts is recoverable since there
will be another along shortly.  But missing an NMI due to a system
failure event is not.  (Why NMI is heavily overloaded, and not very
standardized.)

> 
> So if you could expose the lost NMI stats via procfs or debugfs then we could 
> remove both the warning and the dazed-and-confused spam on the system log.

I can add this.

> 
> This should make perf all around more usable on UV systems, right?

I'm not sure this is accurate.  Perf is currently very usable on UV.
But as we increase our online fault analysis procedures, this warning
message stood out as a glaring example of a false positive.  Note it
is not warning of anything except there is more than one NMI handler
registering on this "call after all other handlers have been called
and did not claim the NMI" chain.

So let me know if I should go ahead with the above (remove some or
all indications that an unclaimed NMI event occurred, and add a
reporting facility for NMI stats.)

Thanks!
Mike

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-03-08 15:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-06 18:17 [PATCH 0/2] Add and Use NMI LAST call chain to eliminate WARNING message Mike Travis
2017-03-06 18:17 ` [PATCH 1/2] x86/platform: Add a low priority low frequency NMI call chain Mike Travis
2017-03-07  7:42   ` Ingo Molnar
2017-03-07 15:22     ` Don Zickus
2017-03-07 16:00       ` Mike Travis
2017-03-07 16:07         ` Don Zickus
2017-03-07 16:13           ` Mike Travis
2017-03-07 15:38     ` Mike Travis
2017-03-08 10:28       ` Ingo Molnar
2017-03-08 15:17         ` Mike Travis
2017-03-06 18:17 ` [PATCH 2/2] x86/platform/uv: Use " Mike Travis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.