All of lore.kernel.org
 help / color / mirror / Atom feed
* [V3][PATCH 0/6] x86, nmi: new NMI handling routines
@ 2011-08-25 16:45 Don Zickus
  2011-08-25 16:45 ` [V3][PATCH 1/6] x86, nmi: split out nmi from traps.c Don Zickus
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Don Zickus @ 2011-08-25 16:45 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, Don Zickus

I had the pleasure of hosting Robert Richter this week at Red Hat.  One of
the issues he wanted to talk with me about was having the NMI handling 
routines execute all the NMI handlers for each NMI mainly for his AMD IBS
work.  But he also brought up another good point that because of the way NMIs
work, it is possible to lose them if multiple NMIs happen at the same time.

As a result, we sat around and discussed how we could go about executing
all the nmi handlers for each NMI to ensure that we would not lose any events.

We decided the best way to do this would be to have the NMI handlers break
away from the notifier routines and create our own.  This would allow us to
execute all the handlers without hacking up the notifier stuff and easily
track the number of events processed at a higher level to deal with the new
problemm of extra NMIs.

I spent some time hacking and came up with this patch.  I tested it on my
core2quad machine trying to enable all the NMI handler I could, mainly
perf and kgdb (and oprofile too when perf was disabled).  Everything seems
to work correctly.  If people are ok with this approach, I'll try and test
this on more machines.

More details about the patch are in the individual changelogs.

Don Zickus (6):
  x86, nmi: split out nmi from traps.c
  x86, nmi: create new NMI handler routines
  x86, nmi: wire up NMI handlers to new routines
  x86, nmi:  add in logic to handle multiple events and unknown NMIs
  x86, nmi: track NMI usage stats
  x86, nmi: print out NMI stats in /proc/interrupts

 arch/x86/include/asm/nmi.h              |   39 ++--
 arch/x86/include/asm/reboot.h           |    2 +-
 arch/x86/kernel/Makefile                |    2 +-
 arch/x86/kernel/apic/hw_nmi.c           |   27 +--
 arch/x86/kernel/apic/x2apic_uv_x.c      |   20 +--
 arch/x86/kernel/cpu/mcheck/mce-inject.c |   20 +-
 arch/x86/kernel/cpu/mcheck/mce.c        |    3 -
 arch/x86/kernel/cpu/perf_event.c        |   60 +----
 arch/x86/kernel/crash.c                 |    5 +-
 arch/x86/kernel/irq.c                   |    2 +
 arch/x86/kernel/kgdb.c                  |   60 ++++-
 arch/x86/kernel/nmi.c                   |  422 +++++++++++++++++++++++++++++++
 arch/x86/kernel/reboot.c                |   23 +--
 arch/x86/kernel/traps.c                 |  155 -----------
 arch/x86/oprofile/nmi_int.c             |   40 +--
 arch/x86/oprofile/nmi_timer_int.c       |   28 +--
 drivers/acpi/apei/ghes.c                |   22 +-
 drivers/char/ipmi/ipmi_watchdog.c       |   32 +--
 drivers/watchdog/hpwdt.c                |   23 +--
 19 files changed, 562 insertions(+), 423 deletions(-)
 create mode 100644 arch/x86/kernel/nmi.c

-- 
1.7.6


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [V3][PATCH 1/6] x86, nmi: split out nmi from traps.c
  2011-08-25 16:45 [V3][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
@ 2011-08-25 16:45 ` Don Zickus
  2011-08-25 16:45 ` [V3][PATCH 2/6] x86, nmi: create new NMI handler routines Don Zickus
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Don Zickus @ 2011-08-25 16:45 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, Don Zickus

The nmi stuff is changing a lot and adding more functionality.  Split it
out from the traps.c file so it doesn't continue to pollute that file.

This makes it easier to find and expand all the future nmi related work.

No real functional changes here.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/Makefile |    2 +-
 arch/x86/kernel/nmi.c    |  178 ++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c  |  155 ----------------------------------------
 3 files changed, 179 insertions(+), 156 deletions(-)
 create mode 100644 arch/x86/kernel/nmi.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 82f2912..8baca3c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -19,7 +19,7 @@ endif
 
 obj-y			:= process_$(BITS).o signal.o entry_$(BITS).o
 obj-y			+= traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
-obj-y			+= time.o ioport.o ldt.o dumpstack.o
+obj-y			+= time.o ioport.o ldt.o dumpstack.o nmi.o
 obj-y			+= setup.o x86_init.o i8259.o irqinit.o jump_label.o
 obj-$(CONFIG_IRQ_WORK)  += irq_work.o
 obj-y			+= probe_roms.o
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
new file mode 100644
index 0000000..68d758a
--- /dev/null
+++ b/arch/x86/kernel/nmi.c
@@ -0,0 +1,178 @@
+/*
+ *  Copyright (C) 1991, 1992  Linus Torvalds
+ *  Copyright (C) 2000, 2001, 2002 Andi Kleen, SuSE Labs
+ *
+ *  Pentium III FXSR, SSE support
+ *	Gareth Hughes <gareth@valinux.com>, May 2000
+ */
+
+/*
+ * Handle hardware traps and faults.
+ */
+#include <linux/spinlock.h>
+#include <linux/kprobes.h>
+#include <linux/kdebug.h>
+#include <linux/nmi.h>
+
+#if defined(CONFIG_EDAC)
+#include <linux/edac.h>
+#endif
+
+#include <linux/atomic.h>
+#include <asm/traps.h>
+#include <asm/mach_traps.h>
+
+static int ignore_nmis;
+
+int unknown_nmi_panic;
+/*
+ * Prevent NMI reason port (0x61) being accessed simultaneously, can
+ * only be used in NMI handler.
+ */
+static DEFINE_RAW_SPINLOCK(nmi_reason_lock);
+
+static int __init setup_unknown_nmi_panic(char *str)
+{
+	unknown_nmi_panic = 1;
+	return 1;
+}
+__setup("unknown_nmi_panic", setup_unknown_nmi_panic);
+
+static notrace __kprobes void
+pci_serr_error(unsigned char reason, struct pt_regs *regs)
+{
+	pr_emerg("NMI: PCI system error (SERR) for reason %02x on CPU %d.\n",
+		 reason, smp_processor_id());
+
+	/*
+	 * On some machines, PCI SERR line is used to report memory
+	 * errors. EDAC makes use of it.
+	 */
+#if defined(CONFIG_EDAC)
+	if (edac_handler_set()) {
+		edac_atomic_assert_error();
+		return;
+	}
+#endif
+
+	if (panic_on_unrecovered_nmi)
+		panic("NMI: Not continuing");
+
+	pr_emerg("Dazed and confused, but trying to continue\n");
+
+	/* Clear and disable the PCI SERR error line. */
+	reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_SERR;
+	outb(reason, NMI_REASON_PORT);
+}
+
+static notrace __kprobes void
+io_check_error(unsigned char reason, struct pt_regs *regs)
+{
+	unsigned long i;
+
+	pr_emerg(
+	"NMI: IOCK error (debug interrupt?) for reason %02x on CPU %d.\n",
+		 reason, smp_processor_id());
+	show_registers(regs);
+
+	if (panic_on_io_nmi)
+		panic("NMI IOCK error: Not continuing");
+
+	/* Re-enable the IOCK line, wait for a few seconds */
+	reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
+	outb(reason, NMI_REASON_PORT);
+
+	i = 20000;
+	while (--i) {
+		touch_nmi_watchdog();
+		udelay(100);
+	}
+
+	reason &= ~NMI_REASON_CLEAR_IOCHK;
+	outb(reason, NMI_REASON_PORT);
+}
+
+static notrace __kprobes void
+unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
+{
+	if (notify_die(DIE_NMIUNKNOWN, "nmi", regs, reason, 2, SIGINT) ==
+			NOTIFY_STOP)
+		return;
+#ifdef CONFIG_MCA
+	/*
+	 * Might actually be able to figure out what the guilty party
+	 * is:
+	 */
+	if (MCA_bus) {
+		mca_handle_nmi();
+		return;
+	}
+#endif
+	pr_emerg("Uhhuh. NMI received for unknown reason %02x on CPU %d.\n",
+		 reason, smp_processor_id());
+
+	pr_emerg("Do you have a strange power saving mode enabled?\n");
+	if (unknown_nmi_panic || panic_on_unrecovered_nmi)
+		panic("NMI: Not continuing");
+
+	pr_emerg("Dazed and confused, but trying to continue\n");
+}
+
+static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
+{
+	unsigned char reason = 0;
+
+	/*
+	 * CPU-specific NMI must be processed before non-CPU-specific
+	 * NMI, otherwise we may lose it, because the CPU-specific
+	 * NMI can not be detected/processed on other CPUs.
+	 */
+	if (notify_die(DIE_NMI, "nmi", regs, 0, 2, SIGINT) == NOTIFY_STOP)
+		return;
+
+	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
+	raw_spin_lock(&nmi_reason_lock);
+	reason = get_nmi_reason();
+
+	if (reason & NMI_REASON_MASK) {
+		if (reason & NMI_REASON_SERR)
+			pci_serr_error(reason, regs);
+		else if (reason & NMI_REASON_IOCHK)
+			io_check_error(reason, regs);
+#ifdef CONFIG_X86_32
+		/*
+		 * Reassert NMI in case it became active
+		 * meanwhile as it's edge-triggered:
+		 */
+		reassert_nmi();
+#endif
+		raw_spin_unlock(&nmi_reason_lock);
+		return;
+	}
+	raw_spin_unlock(&nmi_reason_lock);
+
+	unknown_nmi_error(reason, regs);
+}
+
+dotraplinkage notrace __kprobes void
+do_nmi(struct pt_regs *regs, long error_code)
+{
+	nmi_enter();
+
+	inc_irq_stat(__nmi_count);
+
+	if (!ignore_nmis)
+		default_do_nmi(regs);
+
+	nmi_exit();
+}
+
+void stop_nmi(void)
+{
+	ignore_nmis++;
+}
+
+void restart_nmi(void)
+{
+	ignore_nmis--;
+}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6913369..a8e3eb8 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -81,15 +81,6 @@ gate_desc idt_table[NR_VECTORS] __page_aligned_data = { { { { 0, 0 } } }, };
 DECLARE_BITMAP(used_vectors, NR_VECTORS);
 EXPORT_SYMBOL_GPL(used_vectors);
 
-static int ignore_nmis;
-
-int unknown_nmi_panic;
-/*
- * Prevent NMI reason port (0x61) being accessed simultaneously, can
- * only be used in NMI handler.
- */
-static DEFINE_RAW_SPINLOCK(nmi_reason_lock);
-
 static inline void conditional_sti(struct pt_regs *regs)
 {
 	if (regs->flags & X86_EFLAGS_IF)
@@ -307,152 +298,6 @@ gp_in_kernel:
 	die("general protection fault", regs, error_code);
 }
 
-static int __init setup_unknown_nmi_panic(char *str)
-{
-	unknown_nmi_panic = 1;
-	return 1;
-}
-__setup("unknown_nmi_panic", setup_unknown_nmi_panic);
-
-static notrace __kprobes void
-pci_serr_error(unsigned char reason, struct pt_regs *regs)
-{
-	pr_emerg("NMI: PCI system error (SERR) for reason %02x on CPU %d.\n",
-		 reason, smp_processor_id());
-
-	/*
-	 * On some machines, PCI SERR line is used to report memory
-	 * errors. EDAC makes use of it.
-	 */
-#if defined(CONFIG_EDAC)
-	if (edac_handler_set()) {
-		edac_atomic_assert_error();
-		return;
-	}
-#endif
-
-	if (panic_on_unrecovered_nmi)
-		panic("NMI: Not continuing");
-
-	pr_emerg("Dazed and confused, but trying to continue\n");
-
-	/* Clear and disable the PCI SERR error line. */
-	reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_SERR;
-	outb(reason, NMI_REASON_PORT);
-}
-
-static notrace __kprobes void
-io_check_error(unsigned char reason, struct pt_regs *regs)
-{
-	unsigned long i;
-
-	pr_emerg(
-	"NMI: IOCK error (debug interrupt?) for reason %02x on CPU %d.\n",
-		 reason, smp_processor_id());
-	show_registers(regs);
-
-	if (panic_on_io_nmi)
-		panic("NMI IOCK error: Not continuing");
-
-	/* Re-enable the IOCK line, wait for a few seconds */
-	reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
-	outb(reason, NMI_REASON_PORT);
-
-	i = 20000;
-	while (--i) {
-		touch_nmi_watchdog();
-		udelay(100);
-	}
-
-	reason &= ~NMI_REASON_CLEAR_IOCHK;
-	outb(reason, NMI_REASON_PORT);
-}
-
-static notrace __kprobes void
-unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
-{
-	if (notify_die(DIE_NMIUNKNOWN, "nmi", regs, reason, 2, SIGINT) ==
-			NOTIFY_STOP)
-		return;
-#ifdef CONFIG_MCA
-	/*
-	 * Might actually be able to figure out what the guilty party
-	 * is:
-	 */
-	if (MCA_bus) {
-		mca_handle_nmi();
-		return;
-	}
-#endif
-	pr_emerg("Uhhuh. NMI received for unknown reason %02x on CPU %d.\n",
-		 reason, smp_processor_id());
-
-	pr_emerg("Do you have a strange power saving mode enabled?\n");
-	if (unknown_nmi_panic || panic_on_unrecovered_nmi)
-		panic("NMI: Not continuing");
-
-	pr_emerg("Dazed and confused, but trying to continue\n");
-}
-
-static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
-{
-	unsigned char reason = 0;
-
-	/*
-	 * CPU-specific NMI must be processed before non-CPU-specific
-	 * NMI, otherwise we may lose it, because the CPU-specific
-	 * NMI can not be detected/processed on other CPUs.
-	 */
-	if (notify_die(DIE_NMI, "nmi", regs, 0, 2, SIGINT) == NOTIFY_STOP)
-		return;
-
-	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
-	raw_spin_lock(&nmi_reason_lock);
-	reason = get_nmi_reason();
-
-	if (reason & NMI_REASON_MASK) {
-		if (reason & NMI_REASON_SERR)
-			pci_serr_error(reason, regs);
-		else if (reason & NMI_REASON_IOCHK)
-			io_check_error(reason, regs);
-#ifdef CONFIG_X86_32
-		/*
-		 * Reassert NMI in case it became active
-		 * meanwhile as it's edge-triggered:
-		 */
-		reassert_nmi();
-#endif
-		raw_spin_unlock(&nmi_reason_lock);
-		return;
-	}
-	raw_spin_unlock(&nmi_reason_lock);
-
-	unknown_nmi_error(reason, regs);
-}
-
-dotraplinkage notrace __kprobes void
-do_nmi(struct pt_regs *regs, long error_code)
-{
-	nmi_enter();
-
-	inc_irq_stat(__nmi_count);
-
-	if (!ignore_nmis)
-		default_do_nmi(regs);
-
-	nmi_exit();
-}
-
-void stop_nmi(void)
-{
-	ignore_nmis++;
-}
-
-void restart_nmi(void)
-{
-	ignore_nmis--;
-}
-
 /* May run on IST stack. */
 dotraplinkage void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [V3][PATCH 2/6] x86, nmi: create new NMI handler routines
  2011-08-25 16:45 [V3][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
  2011-08-25 16:45 ` [V3][PATCH 1/6] x86, nmi: split out nmi from traps.c Don Zickus
@ 2011-08-25 16:45 ` Don Zickus
  2011-08-26  9:44   ` Peter Zijlstra
  2011-09-06 10:08   ` Robert Richter
  2011-08-25 16:45 ` [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines Don Zickus
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Don Zickus @ 2011-08-25 16:45 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, Don Zickus

The NMI handlers used to rely on the notifier infrastructure.  This worked
great until we wanted to support handling multiple events better.

One of the key ideas to the nmi handling is to process _all_ the handlers for
each NMI.  The reason behind this switch is because NMIs are edge triggered.
If enough NMIs are triggered, then they could be lost because the cpu can
only latch at most one NMI (besides the one currently being processed).

In order to deal with this we have decided to process all the NMI handlers
for each NMI.  This allows the handlers to determine if they recieved an
event or not (the ones that can not determine this will be left to fend
for themselves on the unknown NMI list).

As a result of this change it is now possible to have an extra NMI that
was destined to be received for an already processed event.  Because the
event was processed in the previous NMI, this NMI gets dropped and becomes
an 'unknown' NMI.  This of course will cause printks that scare people.

However, we prefer to have extra NMIs as opposed to losing NMIs and as such
are have developed a basic mechanism to catch most of them.  That will be
a later patch.

To accomplish this idea, I unhooked the nmi handlers from the notifier
routines and created a new mechanism loosely based on doIRQ.  The reason
for this is the notifier routines have a couple of shortcomings.  One we
could't guarantee all future NMI handlers used NOTIFY_OK instead of
NOTIFY_STOP.  Second, we couldn't keep track of the number of events being
handled in each routine (most only handle one, perf can handle more than one).
Third, I wanted to eventually display which nmi handlers are registered in
the system in /proc/interrupts to help see who is generating NMIs.

The patch below just implements the new infrastructure but doesn't wire it up
yet (that is the next patch).  Its design is based on doIRQ structs and the
atomic notifier routines.  So the rcu stuff in the patch isn't entirely untested
(as the notifier routines have soaked it) but it should be double checked in
case I copied the code wrong.

V2:
  - use kstrdup to copy/allocate device name
  - fix-up _GPL oops

V3:
  - fix leak in register_nmi_handler error path
  - removed _raw annotations from rcu_dereference

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/include/asm/nmi.h |   19 ++++++
 arch/x86/kernel/nmi.c      |  149 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 168 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 4886a68..6d04b28 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -42,6 +42,25 @@ void arch_trigger_all_cpu_backtrace(void);
 #define NMI_LOCAL_NORMAL_PRIOR	(NMI_LOCAL_BIT | NMI_NORMAL_PRIOR)
 #define NMI_LOCAL_LOW_PRIOR	(NMI_LOCAL_BIT | NMI_LOW_PRIOR)
 
+#define NMI_FLAG_FIRST	1
+
+enum {
+	NMI_LOCAL=0,
+	NMI_UNKNOWN,
+	NMI_EXTERNAL,
+	NMI_MAX
+};
+
+#define NMI_DONE	0
+#define NMI_HANDLED	1
+
+typedef int (*nmi_handler_t)(unsigned int, struct pt_regs *);
+
+int register_nmi_handler(unsigned int, nmi_handler_t, unsigned long,
+			 const char *);
+
+void unregister_nmi_handler(unsigned int, const char *);
+
 void stop_nmi(void);
 void restart_nmi(void);
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 68d758a..e65295c 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -13,6 +13,9 @@
 #include <linux/kprobes.h>
 #include <linux/kdebug.h>
 #include <linux/nmi.h>
+#include <linux/delay.h>
+#include <linux/hardirq.h>
+#include <linux/slab.h>
 
 #if defined(CONFIG_EDAC)
 #include <linux/edac.h>
@@ -21,6 +24,28 @@
 #include <linux/atomic.h>
 #include <asm/traps.h>
 #include <asm/mach_traps.h>
+#include <asm/nmi.h>
+
+#define NMI_MAX_NAMELEN	16
+struct nmiaction {
+	struct nmiaction __rcu *next;
+	nmi_handler_t handler;
+	unsigned int flags;
+	char *name;
+};
+
+struct nmi_desc {
+	spinlock_t lock;
+	struct nmiaction __rcu *head;
+};
+
+static struct nmi_desc nmi_desc[NMI_MAX] = 
+{
+	{	.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[0].lock), },
+	{	.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[1].lock), },
+	{	.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[2].lock), },
+
+};
 
 static int ignore_nmis;
 
@@ -38,6 +63,130 @@ static int __init setup_unknown_nmi_panic(char *str)
 }
 __setup("unknown_nmi_panic", setup_unknown_nmi_panic);
 
+#define nmi_to_desc(type) (&nmi_desc[type])
+
+static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
+{
+	struct nmi_desc *desc = nmi_to_desc(type);
+	struct nmiaction *next_a, *a, **ap = &desc->head;
+	int handled=0;
+
+	rcu_read_lock();
+
+	a = rcu_dereference(*ap);
+
+	/*
+	 * NMIs are edge-triggered, which means if you have enough
+	 * of them concurrently, you can lose some because only one
+	 * can be latched at any given time.  Walk the whole list
+	 * to handle those situations.
+	 */
+	while (a) {
+		next_a = rcu_dereference(a->next);
+
+		handled += a->handler(type, regs);
+
+		a = next_a;
+	}
+	rcu_read_unlock();
+
+	/* return total number of NMI events handled */
+	return handled;
+}
+
+static int __setup_nmi(unsigned int type, struct nmiaction *action)
+{
+	struct nmi_desc *desc = nmi_to_desc(type);
+	struct nmiaction **a = &(desc->head);
+	unsigned long flags;
+
+	spin_lock_irqsave(&desc->lock, flags);
+
+	/*
+	 * some handlers need to be executed first otherwise a fake
+	 * event confuses some handlers (kdump uses this flag)
+	 */
+	if (!(action->flags & NMI_FLAG_FIRST))
+		while ((*a) != NULL)
+			a = &((*a)->next);
+	
+	action->next = *a;
+	rcu_assign_pointer(*a, action);
+
+	spin_unlock_irqrestore(&desc->lock, flags);
+	return 0;
+}
+
+static struct nmiaction *__free_nmi(unsigned int type, const char *name)
+{
+	struct nmi_desc *desc = nmi_to_desc(type);
+	struct nmiaction *n, **np = &(desc->head);
+	unsigned long flags;
+
+	spin_lock_irqsave(&desc->lock, flags);
+
+	while ((*np) != NULL) {
+		n = *np;
+
+		/*
+		 * the name passed in to describe the nmi handler
+		 * is used as the lookup key
+		 */
+		if (!strcmp(n->name, name)) {
+			WARN(in_nmi(),
+				"Trying to free NMI (%s) from NMI context!\n", n->name);
+			rcu_assign_pointer(*np, n->next);
+			break;
+		}
+		np = &(n->next);
+	}
+
+	spin_unlock_irqrestore(&desc->lock, flags);
+	synchronize_rcu();
+	return (*np);
+}
+
+int register_nmi_handler(unsigned int type, nmi_handler_t handler,
+			unsigned long nmiflags, const char *devname)
+{
+	struct nmiaction *action;
+	int retval;
+
+	if (!handler)
+		return -EINVAL;
+
+	action = kzalloc(sizeof(struct nmiaction), GFP_KERNEL);
+	if (!action)
+		return -ENOMEM;
+
+	action->handler = handler;
+	action->flags = nmiflags;
+	action->name = kstrndup(devname, NMI_MAX_NAMELEN, GFP_KERNEL);
+
+	retval = __setup_nmi(type, action);
+
+	if (retval) {
+		kfree(action->name);
+		kfree(action);
+	}
+	
+	return retval;
+}
+EXPORT_SYMBOL_GPL(register_nmi_handler);
+
+void unregister_nmi_handler(unsigned int type, const char *name)
+{
+	struct nmiaction *a;
+
+	a = __free_nmi(type, name);
+	if (a) {
+		kfree(a->name);
+		kfree(a);
+	}
+}
+
+EXPORT_SYMBOL_GPL(unregister_nmi_handler);
+
 static notrace __kprobes void
 pci_serr_error(unsigned char reason, struct pt_regs *regs)
 {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-08-25 16:45 [V3][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
  2011-08-25 16:45 ` [V3][PATCH 1/6] x86, nmi: split out nmi from traps.c Don Zickus
  2011-08-25 16:45 ` [V3][PATCH 2/6] x86, nmi: create new NMI handler routines Don Zickus
@ 2011-08-25 16:45 ` Don Zickus
  2011-09-06 16:15   ` Robert Richter
  2011-09-06 17:20   ` Corey Minyard
  2011-08-25 16:45 ` [V3][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Don Zickus @ 2011-08-25 16:45 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, Don Zickus, Jason Wessel, Andi Kleen,
	Corey Minyard, Jack Steiner

Just convert all the files that have an nmi handler to the new routines.
Most of it is straight forward conversion.  A couple of places needed some
tweaking like kgdb which separates the debug notifier from the nmi handler
and mce removes a call to notify_die (as I couldn't figure out why it was
there).

The things that get converted are the registeration/unregistration routines
and the nmi handler itself has its args changed along with code removal
to check which list it is on (most are on one NMI list except for kgdb
which has both an NMI routine and an NMI Unknown routine).

Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/include/asm/nmi.h              |   20 ----------
 arch/x86/include/asm/reboot.h           |    2 +-
 arch/x86/kernel/apic/hw_nmi.c           |   27 +++-----------
 arch/x86/kernel/apic/x2apic_uv_x.c      |   20 ++--------
 arch/x86/kernel/cpu/mcheck/mce-inject.c |   20 ++++-------
 arch/x86/kernel/cpu/mcheck/mce.c        |    3 --
 arch/x86/kernel/cpu/perf_event.c        |   60 +++----------------------------
 arch/x86/kernel/crash.c                 |    5 +--
 arch/x86/kernel/kgdb.c                  |   60 +++++++++++++++++++++++--------
 arch/x86/kernel/nmi.c                   |   11 ++++--
 arch/x86/kernel/reboot.c                |   23 ++++--------
 arch/x86/oprofile/nmi_int.c             |   40 ++++++--------------
 arch/x86/oprofile/nmi_timer_int.c       |   28 +++-----------
 drivers/acpi/apei/ghes.c                |   22 ++++-------
 drivers/char/ipmi/ipmi_watchdog.c       |   32 +++++-----------
 drivers/watchdog/hpwdt.c                |   23 +++---------
 16 files changed, 125 insertions(+), 271 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 6d04b28..fc74547 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -22,26 +22,6 @@ void arch_trigger_all_cpu_backtrace(void);
 #define arch_trigger_all_cpu_backtrace arch_trigger_all_cpu_backtrace
 #endif
 
-/*
- * Define some priorities for the nmi notifier call chain.
- *
- * Create a local nmi bit that has a higher priority than
- * external nmis, because the local ones are more frequent.
- *
- * Also setup some default high/normal/low settings for
- * subsystems to registers with.  Using 4 bits to separate
- * the priorities.  This can go a lot higher if needed be.
- */
-
-#define NMI_LOCAL_SHIFT		16	/* randomly picked */
-#define NMI_LOCAL_BIT		(1ULL << NMI_LOCAL_SHIFT)
-#define NMI_HIGH_PRIOR		(1ULL << 8)
-#define NMI_NORMAL_PRIOR	(1ULL << 4)
-#define NMI_LOW_PRIOR		(1ULL << 0)
-#define NMI_LOCAL_HIGH_PRIOR	(NMI_LOCAL_BIT | NMI_HIGH_PRIOR)
-#define NMI_LOCAL_NORMAL_PRIOR	(NMI_LOCAL_BIT | NMI_NORMAL_PRIOR)
-#define NMI_LOCAL_LOW_PRIOR	(NMI_LOCAL_BIT | NMI_LOW_PRIOR)
-
 #define NMI_FLAG_FIRST	1
 
 enum {
diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 3250e3d..92f29706 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -23,7 +23,7 @@ void machine_real_restart(unsigned int type);
 #define MRR_BIOS	0
 #define MRR_APM		1
 
-typedef void (*nmi_shootdown_cb)(int, struct die_args*);
+typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
 void nmi_shootdown_cpus(nmi_shootdown_cb callback);
 
 #endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
index d5e57db0..31cb9ae 100644
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -60,22 +60,10 @@ void arch_trigger_all_cpu_backtrace(void)
 }
 
 static int __kprobes
-arch_trigger_all_cpu_backtrace_handler(struct notifier_block *self,
-			 unsigned long cmd, void *__args)
+arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
 {
-	struct die_args *args = __args;
-	struct pt_regs *regs;
 	int cpu;
 
-	switch (cmd) {
-	case DIE_NMI:
-		break;
-
-	default:
-		return NOTIFY_DONE;
-	}
-
-	regs = args->regs;
 	cpu = smp_processor_id();
 
 	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
@@ -86,21 +74,16 @@ arch_trigger_all_cpu_backtrace_handler(struct notifier_block *self,
 		show_regs(regs);
 		arch_spin_unlock(&lock);
 		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
-		return NOTIFY_STOP;
+		return NMI_HANDLED;
 	}
 
-	return NOTIFY_DONE;
+	return NMI_DONE;
 }
 
-static __read_mostly struct notifier_block backtrace_notifier = {
-	.notifier_call          = arch_trigger_all_cpu_backtrace_handler,
-	.next                   = NULL,
-	.priority               = NMI_LOCAL_LOW_PRIOR,
-};
-
 static int __init register_trigger_all_cpu_backtrace(void)
 {
-	register_die_notifier(&backtrace_notifier);
+	register_nmi_handler(NMI_LOCAL, arch_trigger_all_cpu_backtrace_handler,
+				0, "arch_bt");
 	return 0;
 }
 early_initcall(register_trigger_all_cpu_backtrace);
diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c
index adc66c3..88b0dbb 100644
--- a/arch/x86/kernel/apic/x2apic_uv_x.c
+++ b/arch/x86/kernel/apic/x2apic_uv_x.c
@@ -673,18 +673,11 @@ void __cpuinit uv_cpu_init(void)
 /*
  * When NMI is received, print a stack trace.
  */
-int uv_handle_nmi(struct notifier_block *self, unsigned long reason, void *data)
+int uv_handle_nmi(unsigned int reason, struct pt_regs *regs)
 {
 	unsigned long real_uv_nmi;
 	int bid;
 
-	if (reason != DIE_NMIUNKNOWN)
-		return NOTIFY_OK;
-
-	if (in_crash_kexec)
-		/* do nothing if entering the crash kernel */
-		return NOTIFY_OK;
-
 	/*
 	 * Each blade has an MMR that indicates when an NMI has been sent
 	 * to cpus on the blade. If an NMI is detected, atomically
@@ -705,7 +698,7 @@ int uv_handle_nmi(struct notifier_block *self, unsigned long reason, void *data)
 	}
 
 	if (likely(__get_cpu_var(cpu_last_nmi_count) == uv_blade_info[bid].nmi_count))
-		return NOTIFY_DONE;
+		return NMI_DONE;
 
 	__get_cpu_var(cpu_last_nmi_count) = uv_blade_info[bid].nmi_count;
 
@@ -718,17 +711,12 @@ int uv_handle_nmi(struct notifier_block *self, unsigned long reason, void *data)
 	dump_stack();
 	spin_unlock(&uv_nmi_lock);
 
-	return NOTIFY_STOP;
+	return NMI_HANDLED;
 }
 
-static struct notifier_block uv_dump_stack_nmi_nb = {
-	.notifier_call	= uv_handle_nmi,
-	.priority = NMI_LOCAL_LOW_PRIOR - 1,
-};
-
 void uv_register_nmi_notifier(void)
 {
-	if (register_die_notifier(&uv_dump_stack_nmi_nb))
+	if (register_nmi_handler(NMI_UNKNOWN, uv_handle_nmi, 0, "uv"))
 		printk(KERN_WARNING "UV NMI handler failed to register\n");
 }
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
index 0ed633c..6199232 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
@@ -78,27 +78,20 @@ static void raise_exception(struct mce *m, struct pt_regs *pregs)
 
 static cpumask_var_t mce_inject_cpumask;
 
-static int mce_raise_notify(struct notifier_block *self,
-			    unsigned long val, void *data)
+static int mce_raise_notify(unsigned int cmd, struct pt_regs *regs)
 {
-	struct die_args *args = (struct die_args *)data;
 	int cpu = smp_processor_id();
 	struct mce *m = &__get_cpu_var(injectm);
-	if (val != DIE_NMI || !cpumask_test_cpu(cpu, mce_inject_cpumask))
-		return NOTIFY_DONE;
+	if (!cpumask_test_cpu(cpu, mce_inject_cpumask))
+		return NMI_DONE;
 	cpumask_clear_cpu(cpu, mce_inject_cpumask);
 	if (m->inject_flags & MCJ_EXCEPTION)
-		raise_exception(m, args->regs);
+		raise_exception(m, regs);
 	else if (m->status)
 		raise_poll(m);
-	return NOTIFY_STOP;
+	return NMI_HANDLED;
 }
 
-static struct notifier_block mce_raise_nb = {
-	.notifier_call = mce_raise_notify,
-	.priority = NMI_LOCAL_NORMAL_PRIOR,
-};
-
 /* Inject mce on current CPU */
 static int raise_local(void)
 {
@@ -216,7 +209,8 @@ static int inject_init(void)
 		return -ENOMEM;
 	printk(KERN_INFO "Machine check injector initialized\n");
 	mce_chrdev_ops.write = mce_write;
-	register_die_notifier(&mce_raise_nb);
+	register_nmi_handler(NMI_LOCAL, mce_raise_notify, 0,
+				"mce_notify");
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 08363b0..3fc65b6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -908,9 +908,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 	percpu_inc(mce_exception_count);
 
-	if (notify_die(DIE_NMI, "machine check", regs, error_code,
-			   18, SIGKILL) == NOTIFY_STOP)
-		goto out;
 	if (!banks)
 		goto out;
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4ee3abf..767371f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1375,68 +1375,18 @@ struct pmu_nmi_state {
 static DEFINE_PER_CPU(struct pmu_nmi_state, pmu_nmi);
 
 static int __kprobes
-perf_event_nmi_handler(struct notifier_block *self,
-			 unsigned long cmd, void *__args)
+perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
-	struct die_args *args = __args;
-	unsigned int this_nmi;
 	int handled;
 
 	if (!atomic_read(&active_events))
-		return NOTIFY_DONE;
+		return NMI_DONE;
 
-	switch (cmd) {
-	case DIE_NMI:
-		break;
-	case DIE_NMIUNKNOWN:
-		this_nmi = percpu_read(irq_stat.__nmi_count);
-		if (this_nmi != __this_cpu_read(pmu_nmi.marked))
-			/* let the kernel handle the unknown nmi */
-			return NOTIFY_DONE;
-		/*
-		 * This one is a PMU back-to-back nmi. Two events
-		 * trigger 'simultaneously' raising two back-to-back
-		 * NMIs. If the first NMI handles both, the latter
-		 * will be empty and daze the CPU. So, we drop it to
-		 * avoid false-positive 'unknown nmi' messages.
-		 */
-		return NOTIFY_STOP;
-	default:
-		return NOTIFY_DONE;
-	}
-
-	handled = x86_pmu.handle_irq(args->regs);
-	if (!handled)
-		return NOTIFY_DONE;
-
-	this_nmi = percpu_read(irq_stat.__nmi_count);
-	if ((handled > 1) ||
-		/* the next nmi could be a back-to-back nmi */
-	    ((__this_cpu_read(pmu_nmi.marked) == this_nmi) &&
-	     (__this_cpu_read(pmu_nmi.handled) > 1))) {
-		/*
-		 * We could have two subsequent back-to-back nmis: The
-		 * first handles more than one counter, the 2nd
-		 * handles only one counter and the 3rd handles no
-		 * counter.
-		 *
-		 * This is the 2nd nmi because the previous was
-		 * handling more than one counter. We will mark the
-		 * next (3rd) and then drop it if unhandled.
-		 */
-		__this_cpu_write(pmu_nmi.marked, this_nmi + 1);
-		__this_cpu_write(pmu_nmi.handled, handled);
-	}
+	handled = x86_pmu.handle_irq(regs);
 
-	return NOTIFY_STOP;
+	return handled;
 }
 
-static __read_mostly struct notifier_block perf_event_nmi_notifier = {
-	.notifier_call		= perf_event_nmi_handler,
-	.next			= NULL,
-	.priority		= NMI_LOCAL_LOW_PRIOR,
-};
-
 static struct event_constraint unconstrained;
 static struct event_constraint emptyconstraint;
 
@@ -1557,7 +1507,7 @@ static int __init init_hw_perf_events(void)
 		((1LL << x86_pmu.num_counters_fixed)-1) << X86_PMC_IDX_FIXED;
 
 	perf_events_lapic_init();
-	register_die_notifier(&perf_event_nmi_notifier);
+	register_nmi_handler(NMI_LOCAL, perf_event_nmi_handler, 0, "PMI");
 
 	unconstrained = (struct event_constraint)
 		__EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_counters) - 1,
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 764c7c2..13ad899 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -32,15 +32,12 @@ int in_crash_kexec;
 
 #if defined(CONFIG_SMP) && defined(CONFIG_X86_LOCAL_APIC)
 
-static void kdump_nmi_callback(int cpu, struct die_args *args)
+static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
 {
-	struct pt_regs *regs;
 #ifdef CONFIG_X86_32
 	struct pt_regs fixed_regs;
 #endif
 
-	regs = args->regs;
-
 #ifdef CONFIG_X86_32
 	if (!user_mode_vm(regs)) {
 		crash_fixup_ss_esp(&fixed_regs, regs);
diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 00354d4..faba577 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -511,28 +511,37 @@ single_step_cont(struct pt_regs *regs, struct die_args *args)
 
 static int was_in_debug_nmi[NR_CPUS];
 
-static int __kgdb_notify(struct die_args *args, unsigned long cmd)
+static int kgdb_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
-	struct pt_regs *regs = args->regs;
-
 	switch (cmd) {
-	case DIE_NMI:
+	case NMI_LOCAL:
 		if (atomic_read(&kgdb_active) != -1) {
 			/* KGDB CPU roundup */
 			kgdb_nmicallback(raw_smp_processor_id(), regs);
 			was_in_debug_nmi[raw_smp_processor_id()] = 1;
 			touch_nmi_watchdog();
-			return NOTIFY_STOP;
+			return NMI_HANDLED;
 		}
-		return NOTIFY_DONE;
+		break;
 
-	case DIE_NMIUNKNOWN:
+	case NMI_UNKNOWN:
 		if (was_in_debug_nmi[raw_smp_processor_id()]) {
 			was_in_debug_nmi[raw_smp_processor_id()] = 0;
-			return NOTIFY_STOP;
+			return NMI_HANDLED;
 		}
-		return NOTIFY_DONE;
+		break;
+	default:
+		/* do nothing */
+		break;
+	}
+	return NMI_DONE;
+}
+
+static int __kgdb_notify(struct die_args *args, unsigned long cmd)
+{
+	struct pt_regs *regs = args->regs;
 
+	switch (cmd) {
 	case DIE_DEBUG:
 		if (atomic_read(&kgdb_cpu_doing_single_step) != -1) {
 			if (user_mode(regs))
@@ -590,11 +599,6 @@ kgdb_notify(struct notifier_block *self, unsigned long cmd, void *ptr)
 
 static struct notifier_block kgdb_notifier = {
 	.notifier_call	= kgdb_notify,
-
-	/*
-	 * Lowest-prio notifier priority, we want to be notified last:
-	 */
-	.priority	= NMI_LOCAL_LOW_PRIOR,
 };
 
 /**
@@ -605,7 +609,31 @@ static struct notifier_block kgdb_notifier = {
  */
 int kgdb_arch_init(void)
 {
-	return register_die_notifier(&kgdb_notifier);
+	int retval;
+
+	retval = register_die_notifier(&kgdb_notifier);
+	if (retval)
+		goto out;
+
+	retval = register_nmi_handler(NMI_LOCAL, kgdb_nmi_handler,
+					0, "kgdb");
+	if (retval)
+		goto out1;
+
+	retval = register_nmi_handler(NMI_UNKNOWN, kgdb_nmi_handler,
+					0, "kgdb");
+
+	if (retval)
+		goto out2;
+
+	return retval;
+
+out2:
+	unregister_nmi_handler(NMI_LOCAL, "kgdb");
+out1:
+	unregister_die_notifier(&kgdb_notifier);
+out:
+	return retval;
 }
 
 static void kgdb_hw_overflow_handler(struct perf_event *event,
@@ -673,6 +701,8 @@ void kgdb_arch_exit(void)
 			breakinfo[i].pev = NULL;
 		}
 	}
+	unregister_nmi_handler(NMI_UNKNOWN, "kgdb");
+	unregister_nmi_handler(NMI_LOCAL, "kgdb");
 	unregister_die_notifier(&kgdb_notifier);
 }
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index e65295c..45fcd82 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -1,6 +1,7 @@
 /*
  *  Copyright (C) 1991, 1992  Linus Torvalds
  *  Copyright (C) 2000, 2001, 2002 Andi Kleen, SuSE Labs
+ *  Copyright (C) 2011	Don Zickus Red Hat, Inc.
  *
  *  Pentium III FXSR, SSE support
  *	Gareth Hughes <gareth@valinux.com>, May 2000
@@ -244,8 +245,10 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
 static notrace __kprobes void
 unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 {
-	if (notify_die(DIE_NMIUNKNOWN, "nmi", regs, reason, 2, SIGINT) ==
-			NOTIFY_STOP)
+	int handled;
+
+	handled = nmi_handle(NMI_UNKNOWN, regs);
+	if (handled) 
 		return;
 #ifdef CONFIG_MCA
 	/*
@@ -270,13 +273,15 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 {
 	unsigned char reason = 0;
+	int handled;
 
 	/*
 	 * CPU-specific NMI must be processed before non-CPU-specific
 	 * NMI, otherwise we may lose it, because the CPU-specific
 	 * NMI can not be detected/processed on other CPUs.
 	 */
-	if (notify_die(DIE_NMI, "nmi", regs, 0, 2, SIGINT) == NOTIFY_STOP)
+	handled = nmi_handle(NMI_LOCAL, regs);
+	if (handled)
 		return;
 
 	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 9242436..adab340 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -464,7 +464,7 @@ static inline void kb_wait(void)
 	}
 }
 
-static void vmxoff_nmi(int cpu, struct die_args *args)
+static void vmxoff_nmi(int cpu, struct pt_regs *regs)
 {
 	cpu_emergency_vmxoff();
 }
@@ -736,14 +736,10 @@ static nmi_shootdown_cb shootdown_callback;
 
 static atomic_t waiting_for_crash_ipi;
 
-static int crash_nmi_callback(struct notifier_block *self,
-			unsigned long val, void *data)
+static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
 {
 	int cpu;
 
-	if (val != DIE_NMI)
-		return NOTIFY_OK;
-
 	cpu = raw_smp_processor_id();
 
 	/* Don't do anything if this handler is invoked on crashing cpu.
@@ -751,10 +747,10 @@ static int crash_nmi_callback(struct notifier_block *self,
 	 * an NMI if system was initially booted with nmi_watchdog parameter.
 	 */
 	if (cpu == crashing_cpu)
-		return NOTIFY_STOP;
+		return NMI_HANDLED;
 	local_irq_disable();
 
-	shootdown_callback(cpu, (struct die_args *)data);
+	shootdown_callback(cpu, regs);
 
 	atomic_dec(&waiting_for_crash_ipi);
 	/* Assume hlt works */
@@ -762,7 +758,7 @@ static int crash_nmi_callback(struct notifier_block *self,
 	for (;;)
 		cpu_relax();
 
-	return 1;
+	return NMI_HANDLED;
 }
 
 static void smp_send_nmi_allbutself(void)
@@ -770,12 +766,6 @@ static void smp_send_nmi_allbutself(void)
 	apic->send_IPI_allbutself(NMI_VECTOR);
 }
 
-static struct notifier_block crash_nmi_nb = {
-	.notifier_call = crash_nmi_callback,
-	/* we want to be the first one called */
-	.priority = NMI_LOCAL_HIGH_PRIOR+1,
-};
-
 /* Halt all other CPUs, calling the specified function on each of them
  *
  * This function can be used to halt all other CPUs on crash
@@ -794,7 +784,8 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
 
 	atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
 	/* Would it be better to replace the trap vector here? */
-	if (register_die_notifier(&crash_nmi_nb))
+	if (register_nmi_handler(NMI_LOCAL, crash_nmi_callback, 
+				 NMI_FLAG_FIRST, "crash"))
 		return;		/* return what? */
 	/* Ensure the new callback function is set before sending
 	 * out the NMI
diff --git a/arch/x86/oprofile/nmi_int.c b/arch/x86/oprofile/nmi_int.c
index 68894fd..adf8fb3 100644
--- a/arch/x86/oprofile/nmi_int.c
+++ b/arch/x86/oprofile/nmi_int.c
@@ -61,26 +61,15 @@ u64 op_x86_get_ctrl(struct op_x86_model_spec const *model,
 }
 
 
-static int profile_exceptions_notify(struct notifier_block *self,
-				     unsigned long val, void *data)
+static int profile_exceptions_notify(unsigned int val, struct pt_regs *regs)
 {
-	struct die_args *args = (struct die_args *)data;
-	int ret = NOTIFY_DONE;
-
-	switch (val) {
-	case DIE_NMI:
-		if (ctr_running)
-			model->check_ctrs(args->regs, &__get_cpu_var(cpu_msrs));
-		else if (!nmi_enabled)
-			break;
-		else
-			model->stop(&__get_cpu_var(cpu_msrs));
-		ret = NOTIFY_STOP;
-		break;
-	default:
-		break;
-	}
-	return ret;
+	if (ctr_running)
+		model->check_ctrs(regs, &__get_cpu_var(cpu_msrs));
+	else if (!nmi_enabled)
+		return NMI_DONE;
+	else
+		model->stop(&__get_cpu_var(cpu_msrs));
+	return NMI_HANDLED;
 }
 
 static void nmi_cpu_save_registers(struct op_msrs *msrs)
@@ -363,12 +352,6 @@ static void nmi_cpu_setup(void *dummy)
 	apic_write(APIC_LVTPC, APIC_DM_NMI);
 }
 
-static struct notifier_block profile_exceptions_nb = {
-	.notifier_call = profile_exceptions_notify,
-	.next = NULL,
-	.priority = NMI_LOCAL_LOW_PRIOR,
-};
-
 static void nmi_cpu_restore_registers(struct op_msrs *msrs)
 {
 	struct op_msr *counters = msrs->counters;
@@ -508,7 +491,8 @@ static int nmi_setup(void)
 	ctr_running = 0;
 	/* make variables visible to the nmi handler: */
 	smp_mb();
-	err = register_die_notifier(&profile_exceptions_nb);
+	err = register_nmi_handler(NMI_LOCAL, profile_exceptions_notify,
+					0, "oprofile");
 	if (err)
 		goto fail;
 
@@ -538,7 +522,7 @@ static void nmi_shutdown(void)
 	put_online_cpus();
 	/* make variables visible to the nmi handler: */
 	smp_mb();
-	unregister_die_notifier(&profile_exceptions_nb);
+	unregister_nmi_handler(NMI_LOCAL, "oprofile");
 	msrs = &get_cpu_var(cpu_msrs);
 	model->shutdown(msrs);
 	free_msrs();
diff --git a/arch/x86/oprofile/nmi_timer_int.c b/arch/x86/oprofile/nmi_timer_int.c
index 720bf5a..7f8052c 100644
--- a/arch/x86/oprofile/nmi_timer_int.c
+++ b/arch/x86/oprofile/nmi_timer_int.c
@@ -18,32 +18,16 @@
 #include <asm/apic.h>
 #include <asm/ptrace.h>
 
-static int profile_timer_exceptions_notify(struct notifier_block *self,
-					   unsigned long val, void *data)
+static int profile_timer_exceptions_notify(unsigned int val, struct pt_regs *regs)
 {
-	struct die_args *args = (struct die_args *)data;
-	int ret = NOTIFY_DONE;
-
-	switch (val) {
-	case DIE_NMI:
-		oprofile_add_sample(args->regs, 0);
-		ret = NOTIFY_STOP;
-		break;
-	default:
-		break;
-	}
-	return ret;
+	oprofile_add_sample(regs, 0);
+	return NMI_HANDLED;
 }
 
-static struct notifier_block profile_timer_exceptions_nb = {
-	.notifier_call = profile_timer_exceptions_notify,
-	.next = NULL,
-	.priority = NMI_LOW_PRIOR,
-};
-
 static int timer_start(void)
 {
-	if (register_die_notifier(&profile_timer_exceptions_nb))
+	if (register_nmi_handler(NMI_LOCAL, profile_timer_exceptions_notify,
+					0, "oprofile-timer"))
 		return 1;
 	return 0;
 }
@@ -51,7 +35,7 @@ static int timer_start(void)
 
 static void timer_stop(void)
 {
-	unregister_die_notifier(&profile_timer_exceptions_nb);
+	unregister_nmi_handler(NMI_LOCAL, "oprofile-timer");
 	synchronize_sched();  /* Allow already-started NMIs to complete. */
 }
 
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 0784f99..b8e08cb 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -50,6 +50,7 @@
 #include <acpi/hed.h>
 #include <asm/mce.h>
 #include <asm/tlbflush.h>
+#include <asm/nmi.h>
 
 #include "apei-internal.h"
 
@@ -749,15 +750,11 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 	}
 }
 
-static int ghes_notify_nmi(struct notifier_block *this,
-				  unsigned long cmd, void *data)
+static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 {
 	struct ghes *ghes, *ghes_global = NULL;
 	int sev, sev_global = -1;
-	int ret = NOTIFY_DONE;
-
-	if (cmd != DIE_NMI)
-		return ret;
+	int ret = NMI_DONE;
 
 	raw_spin_lock(&ghes_nmi_lock);
 	list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
@@ -770,10 +767,10 @@ static int ghes_notify_nmi(struct notifier_block *this,
 			sev_global = sev;
 			ghes_global = ghes;
 		}
-		ret = NOTIFY_STOP;
+		ret = NMI_HANDLED;
 	}
 
-	if (ret == NOTIFY_DONE)
+	if (ret == NMI_DONE)
 		goto out;
 
 	if (sev_global >= GHES_SEV_PANIC) {
@@ -825,10 +822,6 @@ static struct notifier_block ghes_notifier_sci = {
 	.notifier_call = ghes_notify_sci,
 };
 
-static struct notifier_block ghes_notifier_nmi = {
-	.notifier_call = ghes_notify_nmi,
-};
-
 static unsigned long ghes_esource_prealloc_size(
 	const struct acpi_hest_generic *generic)
 {
@@ -918,7 +911,8 @@ static int __devinit ghes_probe(struct platform_device *ghes_dev)
 		ghes_estatus_pool_expand(len);
 		mutex_lock(&ghes_list_mutex);
 		if (list_empty(&ghes_nmi))
-			register_die_notifier(&ghes_notifier_nmi);
+			register_nmi_handler(NMI_LOCAL, ghes_notify_nmi, 0,
+						"ghes");
 		list_add_rcu(&ghes->list, &ghes_nmi);
 		mutex_unlock(&ghes_list_mutex);
 		break;
@@ -964,7 +958,7 @@ static int __devexit ghes_remove(struct platform_device *ghes_dev)
 		mutex_lock(&ghes_list_mutex);
 		list_del_rcu(&ghes->list);
 		if (list_empty(&ghes_nmi))
-			unregister_die_notifier(&ghes_notifier_nmi);
+			unregister_nmi_handler(NMI_LOCAL, "ghes");
 		mutex_unlock(&ghes_list_mutex);
 		/*
 		 * To synchronize with NMI handler, ghes can only be
diff --git a/drivers/char/ipmi/ipmi_watchdog.c b/drivers/char/ipmi/ipmi_watchdog.c
index 3302586..3dcab56 100644
--- a/drivers/char/ipmi/ipmi_watchdog.c
+++ b/drivers/char/ipmi/ipmi_watchdog.c
@@ -1077,17 +1077,8 @@ static void ipmi_unregister_watchdog(int ipmi_intf)
 
 #ifdef HAVE_DIE_NMI
 static int
-ipmi_nmi(struct notifier_block *self, unsigned long val, void *data)
+ipmi_nmi(unsigned int val, struct pt_regs *regs)
 {
-	struct die_args *args = data;
-
-	if (val != DIE_NMIUNKNOWN)
-		return NOTIFY_OK;
-
-	/* Hack, if it's a memory or I/O error, ignore it. */
-	if (args->err & 0xc0)
-		return NOTIFY_OK;
-
 	/*
 	 * If we get here, it's an NMI that's not a memory or I/O
 	 * error.  We can't truly tell if it's from IPMI or not
@@ -1097,15 +1088,15 @@ ipmi_nmi(struct notifier_block *self, unsigned long val, void *data)
 
 	if (testing_nmi) {
 		testing_nmi = 2;
-		return NOTIFY_STOP;
+		return NMI_HANDLED;
 	}
 
 	/* If we are not expecting a timeout, ignore it. */
 	if (ipmi_watchdog_state == WDOG_TIMEOUT_NONE)
-		return NOTIFY_OK;
+		return NMI_DONE;
 
 	if (preaction_val != WDOG_PRETIMEOUT_NMI)
-		return NOTIFY_OK;
+		return NMI_DONE;
 
 	/*
 	 * If no one else handled the NMI, we assume it was the IPMI
@@ -1120,12 +1111,8 @@ ipmi_nmi(struct notifier_block *self, unsigned long val, void *data)
 			panic(PFX "pre-timeout");
 	}
 
-	return NOTIFY_STOP;
+	return NMI_HANDLED;
 }
-
-static struct notifier_block ipmi_nmi_handler = {
-	.notifier_call = ipmi_nmi
-};
 #endif
 
 static int wdog_reboot_handler(struct notifier_block *this,
@@ -1290,7 +1277,8 @@ static void check_parms(void)
 		}
 	}
 	if (do_nmi && !nmi_handler_registered) {
-		rv = register_die_notifier(&ipmi_nmi_handler);
+		rv = register_nmi_handler(NMI_UNKNOWN, ipmi_nmi, 0,
+						"ipmi");
 		if (rv) {
 			printk(KERN_WARNING PFX
 			       "Can't register nmi handler\n");
@@ -1298,7 +1286,7 @@ static void check_parms(void)
 		} else
 			nmi_handler_registered = 1;
 	} else if (!do_nmi && nmi_handler_registered) {
-		unregister_die_notifier(&ipmi_nmi_handler);
+		unregister_nmi_handler(NMI_UNKNOWN, "ipmi");
 		nmi_handler_registered = 0;
 	}
 #endif
@@ -1336,7 +1324,7 @@ static int __init ipmi_wdog_init(void)
 	if (rv) {
 #ifdef HAVE_DIE_NMI
 		if (nmi_handler_registered)
-			unregister_die_notifier(&ipmi_nmi_handler);
+			unregister_nmi_handler(NMI_UNKNOWN, "ipmi");
 #endif
 		atomic_notifier_chain_unregister(&panic_notifier_list,
 						 &wdog_panic_notifier);
@@ -1357,7 +1345,7 @@ static void __exit ipmi_wdog_exit(void)
 
 #ifdef HAVE_DIE_NMI
 	if (nmi_handler_registered)
-		unregister_die_notifier(&ipmi_nmi_handler);
+		unregister_nmi_handler(NMI_UNKNOWN, "ipmi");
 #endif
 
 	atomic_notifier_chain_unregister(&panic_notifier_list,
diff --git a/drivers/watchdog/hpwdt.c b/drivers/watchdog/hpwdt.c
index 410fba4..e0f6202 100644
--- a/drivers/watchdog/hpwdt.c
+++ b/drivers/watchdog/hpwdt.c
@@ -477,15 +477,12 @@ static int hpwdt_time_left(void)
 /*
  *	NMI Handler
  */
-static int hpwdt_pretimeout(struct notifier_block *nb, unsigned long ulReason,
+static int hpwdt_pretimeout(unsigned int ulReason, struct pt_regs *regs,
 				void *data)
 {
 	unsigned long rom_pl;
 	static int die_nmi_called;
 
-	if (ulReason != DIE_NMIUNKNOWN)
-		goto out;
-
 	if (!hpwdt_nmi_decoding)
 		goto out;
 
@@ -507,7 +504,7 @@ static int hpwdt_pretimeout(struct notifier_block *nb, unsigned long ulReason,
 		"Management Log for details.\n");
 
 out:
-	return NOTIFY_OK;
+	return NMI_DONE;
 }
 #endif /* CONFIG_HPWDT_NMI_DECODING */
 
@@ -647,13 +644,6 @@ static struct miscdevice hpwdt_miscdev = {
 	.fops = &hpwdt_fops,
 };
 
-#ifdef CONFIG_HPWDT_NMI_DECODING
-static struct notifier_block die_notifier = {
-	.notifier_call = hpwdt_pretimeout,
-	.priority = 0,
-};
-#endif /* CONFIG_HPWDT_NMI_DECODING */
-
 /*
  *	Init & Exit
  */
@@ -739,10 +729,9 @@ static int __devinit hpwdt_init_nmi_decoding(struct pci_dev *dev)
 	 * die notify list to handle a critical NMI. The default is to
 	 * be last so other users of the NMI signal can function.
 	 */
-	if (priority)
-		die_notifier.priority = 0x7FFFFFFF;
-
-	retval = register_die_notifier(&die_notifier);
+	retval = register_nmi_handler(NMI_UNKNOWN, hpwdt_pretimeout,
+					(priority) ? NMI_HANDLE_FIRST : 0,
+					"hpwdt");
 	if (retval != 0) {
 		dev_warn(&dev->dev,
 			"Unable to register a die notifier (err=%d).\n",
@@ -762,7 +751,7 @@ static int __devinit hpwdt_init_nmi_decoding(struct pci_dev *dev)
 
 static void hpwdt_exit_nmi_decoding(void)
 {
-	unregister_die_notifier(&die_notifier);
+	unregister_nmi_handler(NMI_UNKNOWN, "hpwdt");
 	if (cru_rom_addr)
 		iounmap(cru_rom_addr);
 }
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [V3][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-08-25 16:45 [V3][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
                   ` (2 preceding siblings ...)
  2011-08-25 16:45 ` [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines Don Zickus
@ 2011-08-25 16:45 ` Don Zickus
  2011-09-06 16:18   ` Robert Richter
  2011-08-25 16:45 ` [V3][PATCH 5/6] x86, nmi: track NMI usage stats Don Zickus
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Don Zickus @ 2011-08-25 16:45 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, Don Zickus

Previous patches allow the NMI subsystem to process multipe NMI events
in one NMI.  As previously discussed this can cause issues when an event
triggered another NMI but is processed in the current NMI.  This causes the
next NMI to go unprocessed and become an 'unknown' NMI.

Having this print 'unknown' NMI to the console would be inaccurate and
scare users.  As a result I have copied the 'skip unknown' NMI logic
developed by Robert Richter (and simplfied a little because we can
track _all_ NMIs better instead of just the perf ones) to the main
NMI handling routine.

It is fairly simple, if when processing an NMI, the nmi_handle routine returns
more than one event handled, then set a flag for future use.  This flag just
says there might be a possible unknown NMI.  If an unknown NMI does come in,
then it is skipped (swallowed).  Otherwise just clear the flag on the next NMI
if it has events processed.

The algorithm isn't 100% accurate but for most loads we have seen in perf it
captures a large majority of unknown NMIs.  Under high workloads there still
is the chance that unknown NMIs can trigger because you can time it just right
such that you are generating NMIs as fast as you can process them and go four
or five NMIs before seeing the unknown NMI.

Without involving the concept of time when tracking these 'possible' NMIs,
we may never be 100% reliable.  The idea with time being that back-to-back
NMIs immediately follow each other.  Anything more than a micro second or so
on modern machines between when the first NMI finished to when the second one
starts, probably indicates a completely new event.

V2:
  - forgot to add the 'read' code for swallow_nmi (went into next patch)

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/nmi.c |   29 +++++++++++++++++++++++++++--
 1 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 45fcd82..435dc71 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -270,6 +270,8 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 	pr_emerg("Dazed and confused, but trying to continue\n");
 }
 
+DEFINE_PER_CPU(bool, swallow_nmi);
+
 static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 {
 	unsigned char reason = 0;
@@ -281,8 +283,28 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 	 * NMI can not be detected/processed on other CPUs.
 	 */
 	handled = nmi_handle(NMI_LOCAL, regs);
-	if (handled)
+	if (handled) {
+		/*
+		 * When handling multiple NMI events, we are not
+		 * sure if the second NMI was dropped (because of
+		 * too many NMIs), piggy-backed on the same NMI
+		 * (perf) or is queued right behind this NMI.
+		 * In the last case, we may accidentally get an
+		 * unknown NMI because the event is already handled.
+		 * Flag for this condition and swallow it later.
+		 *
+		 * FIXME: This detection has holes in it mainly
+		 * because we can't tell _when_ the next NMI comes
+		 * in.  A multi-handled NMI event followed by an 
+		 * unknown NMI a second later, clearly should not
+		 * be swallowed.
+		 */
+		if (handled > 1)
+			__this_cpu_write(swallow_nmi, true);
+		else
+			__this_cpu_write(swallow_nmi, false);
 		return;
+	}
 
 	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
 	raw_spin_lock(&nmi_reason_lock);
@@ -305,7 +327,10 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 	}
 	raw_spin_unlock(&nmi_reason_lock);
 
-	unknown_nmi_error(reason, regs);
+	if (!__this_cpu_read(swallow_nmi))
+		unknown_nmi_error(reason, regs);
+
+	__this_cpu_write(swallow_nmi, false);
 }
 
 dotraplinkage notrace __kprobes void
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [V3][PATCH 5/6] x86, nmi: track NMI usage stats
  2011-08-25 16:45 [V3][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
                   ` (3 preceding siblings ...)
  2011-08-25 16:45 ` [V3][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
@ 2011-08-25 16:45 ` Don Zickus
  2011-08-25 16:45 ` [V3][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts Don Zickus
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Don Zickus @ 2011-08-25 16:45 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, Don Zickus

Now that the NMI handler are broken into lists, increment the appropriate
stats for each list.  This allows us to see what is going on when they
get printed out in the next patch.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/nmi.c |   20 +++++++++++++++++++-
 1 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 435dc71..4b32216 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -48,6 +48,15 @@ static struct nmi_desc nmi_desc[NMI_MAX] =
 
 };
 
+struct nmi_stats {
+	unsigned int normal;
+	unsigned int unknown;
+	unsigned int external;
+	unsigned int swallow;
+};
+
+static DEFINE_PER_CPU(struct nmi_stats, nmi_stats);
+
 static int ignore_nmis;
 
 int unknown_nmi_panic;
@@ -248,8 +257,13 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 	int handled;
 
 	handled = nmi_handle(NMI_UNKNOWN, regs);
-	if (handled) 
+	if (handled) {
+		__this_cpu_add(nmi_stats.unknown, handled);
 		return;
+	}
+
+	__this_cpu_add(nmi_stats.unknown, 1);
+
 #ifdef CONFIG_MCA
 	/*
 	 * Might actually be able to figure out what the guilty party
@@ -283,6 +297,7 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 	 * NMI can not be detected/processed on other CPUs.
 	 */
 	handled = nmi_handle(NMI_LOCAL, regs);
+	__this_cpu_add(nmi_stats.normal, handled);
 	if (handled) {
 		/*
 		 * When handling multiple NMI events, we are not
@@ -322,6 +337,7 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 		 */
 		reassert_nmi();
 #endif
+		__this_cpu_add(nmi_stats.external, 1);
 		raw_spin_unlock(&nmi_reason_lock);
 		return;
 	}
@@ -329,6 +345,8 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 
 	if (!__this_cpu_read(swallow_nmi))
 		unknown_nmi_error(reason, regs);
+	else
+		__this_cpu_add(nmi_stats.swallow, 1);
 
 	__this_cpu_write(swallow_nmi, false);
 }
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [V3][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts
  2011-08-25 16:45 [V3][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
                   ` (4 preceding siblings ...)
  2011-08-25 16:45 ` [V3][PATCH 5/6] x86, nmi: track NMI usage stats Don Zickus
@ 2011-08-25 16:45 ` Don Zickus
  2011-09-06 16:39   ` Robert Richter
  2011-08-26  9:44 ` [V3][PATCH 0/6] x86, nmi: new NMI handling routines Peter Zijlstra
  2011-09-06 16:43 ` Robert Richter
  7 siblings, 1 reply; 24+ messages in thread
From: Don Zickus @ 2011-08-25 16:45 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, Don Zickus

This is a cheap hack to add the stats to the middle of /proc/interrupts.
It is more of a conversation starter than anything as I am not sure
the right letters and place to put this stuff.

The benefit of these stats is a better breakdown of which list the NMIs
get handled in either a normal handler, unknown, or external.  It also
list the number of unknown NMIs swallowed to help check for false
positives or not.  Another benefit is the ability to actually see which
NMI handlers are currently registered in the system.

The output of 'cat /proc/interrupts/ will look like this:

<snip>
 58:        275          0        864          0   PCI-MSI-edge      eth0
NMI:       4161       4155        158       4194   Non-maskable interrupts
SWA:          0          0          0          0   Unknown NMIs swallowed
  0:       4161       4155        158       4194   NMI  PMI, arch_bt
UNK:          0          0          0          0   NMI
EXT:          0          0          0          0   NMI
LOC:      12653      13304      13974      12926   Local timer interrupts
SPU:          0          0          0          0   Spurious interrupts
PMI:          6          6          5          6   Performance monitoring interrupts
IWI:          0          0          0          0   IRQ work interrupts
RES:       1839       1897       1821       1854   Rescheduling interrupts
CAL:        524       2714        392        331   Function call interrupts
TLB:        217        146        593        576   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0   Machine check exceptions
MCP:          1          1          1          1   Machine check polls
ERR:          0
MIS:          0

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/include/asm/nmi.h |    2 +
 arch/x86/kernel/irq.c      |    2 +
 arch/x86/kernel/nmi.c      |   47 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index fc74547..a4f1945 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -24,6 +24,8 @@ void arch_trigger_all_cpu_backtrace(void);
 
 #define NMI_FLAG_FIRST	1
 
+void arch_show_nmi(struct seq_file *p, int prec);
+
 enum {
 	NMI_LOCAL=0,
 	NMI_UNKNOWN,
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 6c0802e..44d1cac 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -16,6 +16,7 @@
 #include <asm/idle.h>
 #include <asm/mce.h>
 #include <asm/hw_irq.h>
+#include <asm/nmi.h>
 
 atomic_t irq_err_count;
 
@@ -55,6 +56,7 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 	for_each_online_cpu(j)
 		seq_printf(p, "%10u ", irq_stats(j)->__nmi_count);
 	seq_printf(p, "  Non-maskable interrupts\n");
+	arch_show_nmi(p, prec);
 #ifdef CONFIG_X86_LOCAL_APIC
 	seq_printf(p, "%*s: ", prec, "LOC");
 	for_each_online_cpu(j)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4b32216..a3f9d0b 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -373,3 +373,50 @@ void restart_nmi(void)
 {
 	ignore_nmis--;
 }
+
+void arch_show_nmi(struct seq_file *p, int prec)
+{
+	int j;
+	struct nmiaction *action;
+
+	seq_printf(p, "%*s: ", prec, "SWA");
+        for_each_online_cpu(j)
+                seq_printf(p, "%10u ", per_cpu(nmi_stats.swallow, j));
+        seq_printf(p, "  Unknown NMIs swallowed\n");
+
+	seq_printf(p, "%*s: ", prec, "  0");
+        for_each_online_cpu(j)
+                seq_printf(p, "%10u ", per_cpu(nmi_stats.normal, j));
+        seq_printf(p, "  NMI");
+	action = (nmi_to_desc(NMI_LOCAL))->head;
+	if (action) {
+		seq_printf(p, "\t%s", action->name);
+		while ((action = action->next) != NULL)
+			seq_printf(p, ", %s", action->name);
+	}
+	seq_putc(p, '\n');
+
+	seq_printf(p, "%*s: ", prec, "UNK");
+        for_each_online_cpu(j)
+                seq_printf(p, "%10u ", per_cpu(nmi_stats.unknown, j));
+        seq_printf(p, "  NMI");
+	action = (nmi_to_desc(NMI_UNKNOWN))->head;
+	if (action) {
+		seq_printf(p, "\t%s", action->name);
+		while ((action = action->next) != NULL)
+			seq_printf(p, ", %s", action->name);
+	}
+	seq_putc(p, '\n');
+
+	seq_printf(p, "%*s: ", prec, "EXT");
+        for_each_online_cpu(j)
+                seq_printf(p, "%10u ", per_cpu(nmi_stats.external, j));
+        seq_printf(p, "  NMI");
+	action = (nmi_to_desc(NMI_EXTERNAL))->head;
+	if (action) {
+		seq_printf(p, "\t%s", action->name);
+		while ((action = action->next) != NULL)
+			seq_printf(p, ", %s", action->name);
+	}
+	seq_putc(p, '\n');
+}
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 2/6] x86, nmi: create new NMI handler routines
  2011-08-25 16:45 ` [V3][PATCH 2/6] x86, nmi: create new NMI handler routines Don Zickus
@ 2011-08-26  9:44   ` Peter Zijlstra
  2011-08-26 14:21     ` Don Zickus
  2011-09-06 10:08   ` Robert Richter
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2011-08-26  9:44 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, Robert Richter, ying.huang, LKML, paulmck

On Thu, 2011-08-25 at 12:45 -0400, Don Zickus wrote:
> +static int __setup_nmi(unsigned int type, struct nmiaction *action)
> +{
> +       struct nmi_desc *desc = nmi_to_desc(type);
> +       struct nmiaction **a = &(desc->head);
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&desc->lock, flags);
> +
> +       /*
> +        * some handlers need to be executed first otherwise a fake
> +        * event confuses some handlers (kdump uses this flag)
> +        */
> +       if (!(action->flags & NMI_FLAG_FIRST))
> +               while ((*a) != NULL)
> +                       a = &((*a)->next);
> +       
> +       action->next = *a;
> +       rcu_assign_pointer(*a, action);
> +
> +       spin_unlock_irqrestore(&desc->lock, flags);
> +       return 0;
> +}
> +
> +static struct nmiaction *__free_nmi(unsigned int type, const char *name)
> +{
> +       struct nmi_desc *desc = nmi_to_desc(type);
> +       struct nmiaction *n, **np = &(desc->head);
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&desc->lock, flags);
> +
> +       while ((*np) != NULL) {
> +               n = *np;
> +
> +               /*
> +                * the name passed in to describe the nmi handler
> +                * is used as the lookup key
> +                */
> +               if (!strcmp(n->name, name)) {
> +                       WARN(in_nmi(),
> +                               "Trying to free NMI (%s) from NMI context!\n", n->name);
> +                       rcu_assign_pointer(*np, n->next);
> +                       break;
> +               }
> +               np = &(n->next);
> +       }
> +
> +       spin_unlock_irqrestore(&desc->lock, flags);
> +       synchronize_rcu();
> +       return (*np);
> +} 

Probably not worth fixing, but if someone was silly enough to register
with an already existing name we're up shit creek, right? :-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 0/6] x86, nmi: new NMI handling routines
  2011-08-25 16:45 [V3][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
                   ` (5 preceding siblings ...)
  2011-08-25 16:45 ` [V3][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts Don Zickus
@ 2011-08-26  9:44 ` Peter Zijlstra
  2011-08-26 14:39   ` Don Zickus
  2011-09-06 16:43 ` Robert Richter
  7 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2011-08-26  9:44 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, Robert Richter, ying.huang, LKML, paulmck

On Thu, 2011-08-25 at 12:45 -0400, Don Zickus wrote:
> I spent some time hacking and came up with this patch.  I tested it on my
> core2quad machine trying to enable all the NMI handler I could, mainly
> perf and kgdb (and oprofile too when perf was disabled).  Everything seems
> to work correctly.  If people are ok with this approach, I'll try and test
> this on more machines. 

Right, code looks OK, the only worry that remains is overhead, always
running all handlers must cost..

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 2/6] x86, nmi: create new NMI handler routines
  2011-08-26  9:44   ` Peter Zijlstra
@ 2011-08-26 14:21     ` Don Zickus
  0 siblings, 0 replies; 24+ messages in thread
From: Don Zickus @ 2011-08-26 14:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: x86, Andi Kleen, Robert Richter, ying.huang, LKML, paulmck

On Fri, Aug 26, 2011 at 11:44:12AM +0200, Peter Zijlstra wrote:
> On Thu, 2011-08-25 at 12:45 -0400, Don Zickus wrote:
> > +static int __setup_nmi(unsigned int type, struct nmiaction *action)
> > +{
> > +       struct nmi_desc *desc = nmi_to_desc(type);
> > +       struct nmiaction **a = &(desc->head);
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&desc->lock, flags);
> > +
> > +       /*
> > +        * some handlers need to be executed first otherwise a fake
> > +        * event confuses some handlers (kdump uses this flag)
> > +        */
> > +       if (!(action->flags & NMI_FLAG_FIRST))
> > +               while ((*a) != NULL)
> > +                       a = &((*a)->next);
> > +       
> > +       action->next = *a;
> > +       rcu_assign_pointer(*a, action);
> > +
> > +       spin_unlock_irqrestore(&desc->lock, flags);
> > +       return 0;
> > +}
> > +
> > +static struct nmiaction *__free_nmi(unsigned int type, const char *name)
> > +{
> > +       struct nmi_desc *desc = nmi_to_desc(type);
> > +       struct nmiaction *n, **np = &(desc->head);
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&desc->lock, flags);
> > +
> > +       while ((*np) != NULL) {
> > +               n = *np;
> > +
> > +               /*
> > +                * the name passed in to describe the nmi handler
> > +                * is used as the lookup key
> > +                */
> > +               if (!strcmp(n->name, name)) {
> > +                       WARN(in_nmi(),
> > +                               "Trying to free NMI (%s) from NMI context!\n", n->name);
> > +                       rcu_assign_pointer(*np, n->next);
> > +                       break;
> > +               }
> > +               np = &(n->next);
> > +       }
> > +
> > +       spin_unlock_irqrestore(&desc->lock, flags);
> > +       synchronize_rcu();
> > +       return (*np);
> > +} 
> 
> Probably not worth fixing, but if someone was silly enough to register
> with an already existing name we're up shit creek, right? :-)

Hmm, I thought I put a check in there during registration to prevent that.
Guess not.  But yeah, I thought about doing that just to have one less
thing bite me later.

Cheers,
Don

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 0/6] x86, nmi: new NMI handling routines
  2011-08-26  9:44 ` [V3][PATCH 0/6] x86, nmi: new NMI handling routines Peter Zijlstra
@ 2011-08-26 14:39   ` Don Zickus
  2011-08-26 14:51     ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Don Zickus @ 2011-08-26 14:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: x86, Andi Kleen, Robert Richter, ying.huang, LKML, paulmck

On Fri, Aug 26, 2011 at 11:44:57AM +0200, Peter Zijlstra wrote:
> On Thu, 2011-08-25 at 12:45 -0400, Don Zickus wrote:
> > I spent some time hacking and came up with this patch.  I tested it on my
> > core2quad machine trying to enable all the NMI handler I could, mainly
> > perf and kgdb (and oprofile too when perf was disabled).  Everything seems
> > to work correctly.  If people are ok with this approach, I'll try and test
> > this on more machines. 
> 
> Right, code looks OK, the only worry that remains is overhead, always
> running all handlers must cost..

Yeah nothing is free.  My only counter argument is I removed the case
statements in the handlers, so it speeds things up a tiny bit.  Also most
machines only seem to have perf and the arch_backtrace handler registered,
with modern intel boxes probably registering the ghes handler too.

There really isn't much there, at least currently.  I would break up the
handler more if I knew a quicker way to distinguish between something like
a self-IPI NMI vs. an on-chip NMI like perf.  Then again those NMIs
probably aren't latched differently unlike the external one sitting in the
IOAPIC(??).

Cheers,
Don

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 0/6] x86, nmi: new NMI handling routines
  2011-08-26 14:39   ` Don Zickus
@ 2011-08-26 14:51     ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2011-08-26 14:51 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, Robert Richter, ying.huang, LKML, paulmck

On Fri, 2011-08-26 at 10:39 -0400, Don Zickus wrote:
> On Fri, Aug 26, 2011 at 11:44:57AM +0200, Peter Zijlstra wrote:
> > On Thu, 2011-08-25 at 12:45 -0400, Don Zickus wrote:
> > > I spent some time hacking and came up with this patch.  I tested it on my
> > > core2quad machine trying to enable all the NMI handler I could, mainly
> > > perf and kgdb (and oprofile too when perf was disabled).  Everything seems
> > > to work correctly.  If people are ok with this approach, I'll try and test
> > > this on more machines. 
> > 
> > Right, code looks OK, the only worry that remains is overhead, always
> > running all handlers must cost..
> 
> Yeah nothing is free.  My only counter argument is I removed the case
> statements in the handlers, so it speeds things up a tiny bit.  Also most
> machines only seem to have perf and the arch_backtrace handler registered,
> with modern intel boxes probably registering the ghes handler too.
> 
> There really isn't much there, at least currently.  I would break up the
> handler more if I knew a quicker way to distinguish between something like
> a self-IPI NMI vs. an on-chip NMI like perf.  Then again those NMIs
> probably aren't latched differently unlike the external one sitting in the
> IOAPIC(??).

Yeah, no clue really.. I still need to read up on those hardware specs
(scarce as they are). 

As it stands I think we don't have much choice in this and your proposed
solution is pretty much it, I mean we have a shared edge interrupt and
no sane way to tell who all triggered stuff. 

Short of locking all the hardware dudes in a room and not letting them
out until they fix that is ;-)

Anyway, aside from 6/6 which wants more comments (personally I think I
like you /proc/nmis suggestion best, leaving the single NMI line
in /proc/interrupts) I'm fine with these patches.

Thomas any opinions?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 2/6] x86, nmi: create new NMI handler routines
  2011-08-25 16:45 ` [V3][PATCH 2/6] x86, nmi: create new NMI handler routines Don Zickus
  2011-08-26  9:44   ` Peter Zijlstra
@ 2011-09-06 10:08   ` Robert Richter
  2011-09-06 16:53     ` Don Zickus
  1 sibling, 1 reply; 24+ messages in thread
From: Robert Richter @ 2011-09-06 10:08 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, Peter Zijlstra, ying.huang, LKML, paulmck

On 25.08.11 12:45:44, Don Zickus wrote:
> +int register_nmi_handler(unsigned int type, nmi_handler_t handler,
> +			unsigned long nmiflags, const char *devname)
> +{
> +	struct nmiaction *action;
> +	int retval;
> +
> +	if (!handler)
> +		return -EINVAL;
> +
> +	action = kzalloc(sizeof(struct nmiaction), GFP_KERNEL);
> +	if (!action)
> +		return -ENOMEM;
> +
> +	action->handler = handler;
> +	action->flags = nmiflags;
> +	action->name = kstrndup(devname, NMI_MAX_NAMELEN, GFP_KERNEL);

Null pointer check is missing here.

> +
> +	retval = __setup_nmi(type, action);
> +
> +	if (retval) {
> +		kfree(action->name);
> +		kfree(action);
> +	}
> +	
> +	return retval;
> +}
> +EXPORT_SYMBOL_GPL(register_nmi_handler);

-- 
Advanced Micro Devices, Inc.
Operating System Research Center


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-08-25 16:45 ` [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines Don Zickus
@ 2011-09-06 16:15   ` Robert Richter
  2011-09-06 16:52     ` Don Zickus
  2011-09-06 17:20   ` Corey Minyard
  1 sibling, 1 reply; 24+ messages in thread
From: Robert Richter @ 2011-09-06 16:15 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86, Andi Kleen, Peter Zijlstra, ying.huang, LKML, paulmck,
	Jason Wessel, Andi Kleen, Corey Minyard, Jack Steiner,
	Borislav Petkov, Tony Luck

On 25.08.11 12:45:45, Don Zickus wrote:
> Just convert all the files that have an nmi handler to the new routines.
> Most of it is straight forward conversion.  A couple of places needed some
> tweaking like kgdb which separates the debug notifier from the nmi handler
> and mce removes a call to notify_die (as I couldn't figure out why it was
> there).

For mce, see my comment below.

> 
> The things that get converted are the registeration/unregistration routines
> and the nmi handler itself has its args changed along with code removal
> to check which list it is on (most are on one NMI list except for kgdb
> which has both an NMI routine and an NMI Unknown routine).
> 
> Cc: Jason Wessel <jason.wessel@windriver.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Robert Richter <robert.richter@amd.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Corey Minyard <minyard@acm.org>
> Cc: Jack Steiner <steiner@sgi.com>
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---
>  arch/x86/include/asm/nmi.h              |   20 ----------
>  arch/x86/include/asm/reboot.h           |    2 +-
>  arch/x86/kernel/apic/hw_nmi.c           |   27 +++-----------
>  arch/x86/kernel/apic/x2apic_uv_x.c      |   20 ++--------
>  arch/x86/kernel/cpu/mcheck/mce-inject.c |   20 ++++-------
>  arch/x86/kernel/cpu/mcheck/mce.c        |    3 --
>  arch/x86/kernel/cpu/perf_event.c        |   60 +++----------------------------
>  arch/x86/kernel/crash.c                 |    5 +--
>  arch/x86/kernel/kgdb.c                  |   60 +++++++++++++++++++++++--------
>  arch/x86/kernel/nmi.c                   |   11 ++++--
>  arch/x86/kernel/reboot.c                |   23 ++++--------
>  arch/x86/oprofile/nmi_int.c             |   40 ++++++--------------
>  arch/x86/oprofile/nmi_timer_int.c       |   28 +++-----------
>  drivers/acpi/apei/ghes.c                |   22 ++++-------
>  drivers/char/ipmi/ipmi_watchdog.c       |   32 +++++-----------
>  drivers/watchdog/hpwdt.c                |   23 +++---------
>  16 files changed, 125 insertions(+), 271 deletions(-)


> diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c
> index adc66c3..88b0dbb 100644
> --- a/arch/x86/kernel/apic/x2apic_uv_x.c
> +++ b/arch/x86/kernel/apic/x2apic_uv_x.c
> @@ -673,18 +673,11 @@ void __cpuinit uv_cpu_init(void)
>  /*
>   * When NMI is received, print a stack trace.
>   */
> -int uv_handle_nmi(struct notifier_block *self, unsigned long reason, void *data)
> +int uv_handle_nmi(unsigned int reason, struct pt_regs *regs)
>  {
>         unsigned long real_uv_nmi;
>         int bid;
> 
> -       if (reason != DIE_NMIUNKNOWN)
> -               return NOTIFY_OK;
> -
> -       if (in_crash_kexec)
> -               /* do nothing if entering the crash kernel */
> -               return NOTIFY_OK;

Isn't this removed without a replacement so this check is missing now?
Code will be executed now in case of in_crash_kexec.

> -
>         /*
>          * Each blade has an MMR that indicates when an NMI has been sent
>          * to cpus on the blade. If an NMI is detected, atomically


> diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
> index 0ed633c..6199232 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
> @@ -78,27 +78,20 @@ static void raise_exception(struct mce *m, struct pt_regs *pregs)
> 
>  static cpumask_var_t mce_inject_cpumask;
> 
> -static int mce_raise_notify(struct notifier_block *self,
> -                           unsigned long val, void *data)
> +static int mce_raise_notify(unsigned int cmd, struct pt_regs *regs)
>  {
> -       struct die_args *args = (struct die_args *)data;
>         int cpu = smp_processor_id();
>         struct mce *m = &__get_cpu_var(injectm);
> -       if (val != DIE_NMI || !cpumask_test_cpu(cpu, mce_inject_cpumask))
> -               return NOTIFY_DONE;
> +       if (!cpumask_test_cpu(cpu, mce_inject_cpumask))
> +               return NMI_DONE;
>         cpumask_clear_cpu(cpu, mce_inject_cpumask);
>         if (m->inject_flags & MCJ_EXCEPTION)
> -               raise_exception(m, args->regs);
> +               raise_exception(m, regs);
>         else if (m->status)
>                 raise_poll(m);
> -       return NOTIFY_STOP;
> +       return NMI_HANDLED;
>  }
> 
> -static struct notifier_block mce_raise_nb = {
> -       .notifier_call = mce_raise_notify,
> -       .priority = NMI_LOCAL_NORMAL_PRIOR,
> -};
> -
>  /* Inject mce on current CPU */
>  static int raise_local(void)
>  {
> @@ -216,7 +209,8 @@ static int inject_init(void)
>                 return -ENOMEM;
>         printk(KERN_INFO "Machine check injector initialized\n");
>         mce_chrdev_ops.write = mce_write;
> -       register_die_notifier(&mce_raise_nb);
> +       register_nmi_handler(NMI_LOCAL, mce_raise_notify, 0,
> +                               "mce_notify");
>         return 0;
>  }
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 08363b0..3fc65b6 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -908,9 +908,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> 
>         percpu_inc(mce_exception_count);
> 
> -       if (notify_die(DIE_NMI, "machine check", regs, error_code,
> -                          18, SIGKILL) == NOTIFY_STOP)
> -               goto out;

Yes, this code is strange. I checked all the nmi handlers but couldn't
find one that is direct related to this call. But it could be to
handle IPIs even in the case of an mce to let backtrace and reboot
work. CC'ing mce guys.

I would rather add an nmi_handle() call here.

>         if (!banks)
>                 goto out;
> 
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 4ee3abf..767371f 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1375,68 +1375,18 @@ struct pmu_nmi_state {
>  static DEFINE_PER_CPU(struct pmu_nmi_state, pmu_nmi);
> 
>  static int __kprobes
> -perf_event_nmi_handler(struct notifier_block *self,
> -                        unsigned long cmd, void *__args)
> +perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
>  {
> -       struct die_args *args = __args;
> -       unsigned int this_nmi;
>         int handled;
> 
>         if (!atomic_read(&active_events))
> -               return NOTIFY_DONE;
> +               return NMI_DONE;
> 
> -       switch (cmd) {
> -       case DIE_NMI:
> -               break;
> -       case DIE_NMIUNKNOWN:
> -               this_nmi = percpu_read(irq_stat.__nmi_count);
> -               if (this_nmi != __this_cpu_read(pmu_nmi.marked))
> -                       /* let the kernel handle the unknown nmi */
> -                       return NOTIFY_DONE;
> -               /*
> -                * This one is a PMU back-to-back nmi. Two events
> -                * trigger 'simultaneously' raising two back-to-back
> -                * NMIs. If the first NMI handles both, the latter
> -                * will be empty and daze the CPU. So, we drop it to
> -                * avoid false-positive 'unknown nmi' messages.
> -                */
> -               return NOTIFY_STOP;
> -       default:
> -               return NOTIFY_DONE;
> -       }
> -
> -       handled = x86_pmu.handle_irq(args->regs);
> -       if (!handled)
> -               return NOTIFY_DONE;
> -
> -       this_nmi = percpu_read(irq_stat.__nmi_count);
> -       if ((handled > 1) ||
> -               /* the next nmi could be a back-to-back nmi */
> -           ((__this_cpu_read(pmu_nmi.marked) == this_nmi) &&
> -            (__this_cpu_read(pmu_nmi.handled) > 1))) {
> -               /*
> -                * We could have two subsequent back-to-back nmis: The
> -                * first handles more than one counter, the 2nd
> -                * handles only one counter and the 3rd handles no
> -                * counter.
> -                *
> -                * This is the 2nd nmi because the previous was
> -                * handling more than one counter. We will mark the
> -                * next (3rd) and then drop it if unhandled.
> -                */
> -               __this_cpu_write(pmu_nmi.marked, this_nmi + 1);
> -               __this_cpu_write(pmu_nmi.handled, handled);

You replace the back-to-back logic by using swallow_nmi later. The
case above is not covered. Suppose a 2-1-0 sequence of handled nmis.
Your new code only covers the 2-0 case and will trigger an unknown nmi
error for 2-1-0.

We should reimplement the logic above in nmi.c.

Also, struct pmu_nmi_state and friends can be removed here.

-Robert

> -       }
> +       handled = x86_pmu.handle_irq(regs);
> 
> -       return NOTIFY_STOP;
> +       return handled;
>  }

-- 
Advanced Micro Devices, Inc.
Operating System Research Center


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-08-25 16:45 ` [V3][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
@ 2011-09-06 16:18   ` Robert Richter
  0 siblings, 0 replies; 24+ messages in thread
From: Robert Richter @ 2011-09-06 16:18 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, Peter Zijlstra, ying.huang, LKML, paulmck

On 25.08.11 12:45:46, Don Zickus wrote:
> Previous patches allow the NMI subsystem to process multipe NMI events
> in one NMI.  As previously discussed this can cause issues when an event
> triggered another NMI but is processed in the current NMI.  This causes the
> next NMI to go unprocessed and become an 'unknown' NMI.
> 
> Having this print 'unknown' NMI to the console would be inaccurate and
> scare users.  As a result I have copied the 'skip unknown' NMI logic
> developed by Robert Richter (and simplfied a little because we can
> track _all_ NMIs better instead of just the perf ones) to the main
> NMI handling routine.
> 
> It is fairly simple, if when processing an NMI, the nmi_handle routine returns
> more than one event handled, then set a flag for future use.  This flag just
> says there might be a possible unknown NMI.  If an unknown NMI does come in,
> then it is skipped (swallowed).  Otherwise just clear the flag on the next NMI
> if it has events processed.
> 
> The algorithm isn't 100% accurate but for most loads we have seen in perf it
> captures a large majority of unknown NMIs.  Under high workloads there still
> is the chance that unknown NMIs can trigger because you can time it just right
> such that you are generating NMIs as fast as you can process them and go four
> or five NMIs before seeing the unknown NMI.
> 
> Without involving the concept of time when tracking these 'possible' NMIs,
> we may never be 100% reliable.  The idea with time being that back-to-back
> NMIs immediately follow each other.  Anything more than a micro second or so
> on modern machines between when the first NMI finished to when the second one
> starts, probably indicates a completely new event.
> 
> V2:
>   - forgot to add the 'read' code for swallow_nmi (went into next patch)
> 
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---
>  arch/x86/kernel/nmi.c |   29 +++++++++++++++++++++++++++--
>  1 files changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 45fcd82..435dc71 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -270,6 +270,8 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
>  	pr_emerg("Dazed and confused, but trying to continue\n");
>  }
>  
> +DEFINE_PER_CPU(bool, swallow_nmi);
> +
>  static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
>  {
>  	unsigned char reason = 0;
> @@ -281,8 +283,28 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
>  	 * NMI can not be detected/processed on other CPUs.
>  	 */
>  	handled = nmi_handle(NMI_LOCAL, regs);
> -	if (handled)
> +	if (handled) {
> +		/*
> +		 * When handling multiple NMI events, we are not
> +		 * sure if the second NMI was dropped (because of
> +		 * too many NMIs), piggy-backed on the same NMI
> +		 * (perf) or is queued right behind this NMI.
> +		 * In the last case, we may accidentally get an
> +		 * unknown NMI because the event is already handled.
> +		 * Flag for this condition and swallow it later.
> +		 *
> +		 * FIXME: This detection has holes in it mainly
> +		 * because we can't tell _when_ the next NMI comes
> +		 * in.  A multi-handled NMI event followed by an 
> +		 * unknown NMI a second later, clearly should not
> +		 * be swallowed.
> +		 */
> +		if (handled > 1)
> +			__this_cpu_write(swallow_nmi, true);
> +		else
> +			__this_cpu_write(swallow_nmi, false);

See my comment for patch 3/6.

>  		return;
> +	}
>  
>  	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
>  	raw_spin_lock(&nmi_reason_lock);
> @@ -305,7 +327,10 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
>  	}
>  	raw_spin_unlock(&nmi_reason_lock);
>  
> -	unknown_nmi_error(reason, regs);
> +	if (!__this_cpu_read(swallow_nmi))
> +		unknown_nmi_error(reason, regs);
> +
> +	__this_cpu_write(swallow_nmi, false);
>  }
>  
>  dotraplinkage notrace __kprobes void
> -- 
> 1.7.6
> 
> 

-- 
Advanced Micro Devices, Inc.
Operating System Research Center


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts
  2011-08-25 16:45 ` [V3][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts Don Zickus
@ 2011-09-06 16:39   ` Robert Richter
  2011-09-06 17:40     ` Don Zickus
  0 siblings, 1 reply; 24+ messages in thread
From: Robert Richter @ 2011-09-06 16:39 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, Peter Zijlstra, ying.huang, LKML, paulmck

On 25.08.11 12:45:48, Don Zickus wrote:
> This is a cheap hack to add the stats to the middle of /proc/interrupts.
> It is more of a conversation starter than anything as I am not sure
> the right letters and place to put this stuff.
> 
> The benefit of these stats is a better breakdown of which list the NMIs
> get handled in either a normal handler, unknown, or external.  It also
> list the number of unknown NMIs swallowed to help check for false
> positives or not.  Another benefit is the ability to actually see which
> NMI handlers are currently registered in the system.
> 
> The output of 'cat /proc/interrupts/ will look like this:
> 
> <snip>
>  58:        275          0        864          0   PCI-MSI-edge      eth0
> NMI:       4161       4155        158       4194   Non-maskable interrupts
> SWA:          0          0          0          0   Unknown NMIs swallowed
>   0:       4161       4155        158       4194   NMI  PMI, arch_bt

What does the "0" stand for?

We could easily provide a statistic for each NMI handler, which would
be more useful.

The syntax of the NMI printout is not yet perfect, so before adding it
as a new interface I would rather wait a bit to discuss this more.

People also could get confused because the handled count may be
different to nmi count. This should be documented more clearly, maybe
as event count instead of nmi count or so.

-Robert

> UNK:          0          0          0          0   NMI
> EXT:          0          0          0          0   NMI
> LOC:      12653      13304      13974      12926   Local timer interrupts
> SPU:          0          0          0          0   Spurious interrupts
> PMI:          6          6          5          6   Performance monitoring interrupts
> IWI:          0          0          0          0   IRQ work interrupts
> RES:       1839       1897       1821       1854   Rescheduling interrupts
> CAL:        524       2714        392        331   Function call interrupts
> TLB:        217        146        593        576   TLB shootdowns
> TRM:          0          0          0          0   Thermal event interrupts
> THR:          0          0          0          0   Threshold APIC interrupts
> MCE:          0          0          0          0   Machine check exceptions
> MCP:          1          1          1          1   Machine check polls
> ERR:          0
> MIS:          0

-- 
Advanced Micro Devices, Inc.
Operating System Research Center


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 0/6] x86, nmi: new NMI handling routines
  2011-08-25 16:45 [V3][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
                   ` (6 preceding siblings ...)
  2011-08-26  9:44 ` [V3][PATCH 0/6] x86, nmi: new NMI handling routines Peter Zijlstra
@ 2011-09-06 16:43 ` Robert Richter
  7 siblings, 0 replies; 24+ messages in thread
From: Robert Richter @ 2011-09-06 16:43 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, Peter Zijlstra, ying.huang, LKML, paulmck

On 25.08.11 12:45:42, Don Zickus wrote:
> I had the pleasure of hosting Robert Richter this week at Red Hat.  One of
> the issues he wanted to talk with me about was having the NMI handling 
> routines execute all the NMI handlers for each NMI mainly for his AMD IBS
> work.  But he also brought up another good point that because of the way NMIs
> work, it is possible to lose them if multiple NMIs happen at the same time.
> 
> As a result, we sat around and discussed how we could go about executing
> all the nmi handlers for each NMI to ensure that we would not lose any events.
> 
> We decided the best way to do this would be to have the NMI handlers break
> away from the notifier routines and create our own.  This would allow us to
> execute all the handlers without hacking up the notifier stuff and easily
> track the number of events processed at a higher level to deal with the new
> problemm of extra NMIs.
> 
> I spent some time hacking and came up with this patch.  I tested it on my
> core2quad machine trying to enable all the NMI handler I could, mainly
> perf and kgdb (and oprofile too when perf was disabled).  Everything seems
> to work correctly.  If people are ok with this approach, I'll try and test
> this on more machines.
> 
> More details about the patch are in the individual changelogs.

Don,

thanks for this patch set. So far it looks good to me. See my comments
posted. I don't think they require necessarily a new patch
version. Instead we could implement changes on top of it.

I will rebase my current perf ibs implementation on to of this
patches. It would be good to have the patches in tip after kernel.org
is back again.

Thanks,

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-09-06 16:15   ` Robert Richter
@ 2011-09-06 16:52     ` Don Zickus
  2011-09-07  9:03       ` Borislav Petkov
  0 siblings, 1 reply; 24+ messages in thread
From: Don Zickus @ 2011-09-06 16:52 UTC (permalink / raw)
  To: Robert Richter
  Cc: x86, Andi Kleen, Peter Zijlstra, ying.huang, LKML, paulmck,
	Jason Wessel, Andi Kleen, Corey Minyard, Jack Steiner,
	Borislav Petkov, Tony Luck

On Tue, Sep 06, 2011 at 06:15:45PM +0200, Robert Richter wrote:
> On 25.08.11 12:45:45, Don Zickus wrote:
> > Just convert all the files that have an nmi handler to the new routines.
> > Most of it is straight forward conversion.  A couple of places needed some
> > tweaking like kgdb which separates the debug notifier from the nmi handler
> > and mce removes a call to notify_die (as I couldn't figure out why it was
> > there).
> 
> For mce, see my comment below.
> 
> > 
> > The things that get converted are the registeration/unregistration routines
> > and the nmi handler itself has its args changed along with code removal
> > to check which list it is on (most are on one NMI list except for kgdb
> > which has both an NMI routine and an NMI Unknown routine).
> > 
> > Cc: Jason Wessel <jason.wessel@windriver.com>
> > Cc: Andi Kleen <ak@linux.intel.com>
> > Cc: Robert Richter <robert.richter@amd.com>
> > Cc: Huang Ying <ying.huang@intel.com>
> > Cc: Corey Minyard <minyard@acm.org>
> > Cc: Jack Steiner <steiner@sgi.com>
> > Signed-off-by: Don Zickus <dzickus@redhat.com>
> > ---
> >  arch/x86/include/asm/nmi.h              |   20 ----------
> >  arch/x86/include/asm/reboot.h           |    2 +-
> >  arch/x86/kernel/apic/hw_nmi.c           |   27 +++-----------
> >  arch/x86/kernel/apic/x2apic_uv_x.c      |   20 ++--------
> >  arch/x86/kernel/cpu/mcheck/mce-inject.c |   20 ++++-------
> >  arch/x86/kernel/cpu/mcheck/mce.c        |    3 --
> >  arch/x86/kernel/cpu/perf_event.c        |   60 +++----------------------------
> >  arch/x86/kernel/crash.c                 |    5 +--
> >  arch/x86/kernel/kgdb.c                  |   60 +++++++++++++++++++++++--------
> >  arch/x86/kernel/nmi.c                   |   11 ++++--
> >  arch/x86/kernel/reboot.c                |   23 ++++--------
> >  arch/x86/oprofile/nmi_int.c             |   40 ++++++--------------
> >  arch/x86/oprofile/nmi_timer_int.c       |   28 +++-----------
> >  drivers/acpi/apei/ghes.c                |   22 ++++-------
> >  drivers/char/ipmi/ipmi_watchdog.c       |   32 +++++-----------
> >  drivers/watchdog/hpwdt.c                |   23 +++---------
> >  16 files changed, 125 insertions(+), 271 deletions(-)
> 
> 
> > diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c
> > index adc66c3..88b0dbb 100644
> > --- a/arch/x86/kernel/apic/x2apic_uv_x.c
> > +++ b/arch/x86/kernel/apic/x2apic_uv_x.c
> > @@ -673,18 +673,11 @@ void __cpuinit uv_cpu_init(void)
> >  /*
> >   * When NMI is received, print a stack trace.
> >   */
> > -int uv_handle_nmi(struct notifier_block *self, unsigned long reason, void *data)
> > +int uv_handle_nmi(unsigned int reason, struct pt_regs *regs)
> >  {
> >         unsigned long real_uv_nmi;
> >         int bid;
> > 
> > -       if (reason != DIE_NMIUNKNOWN)
> > -               return NOTIFY_OK;
> > -
> > -       if (in_crash_kexec)
> > -               /* do nothing if entering the crash kernel */
> > -               return NOTIFY_OK;
> 
> Isn't this removed without a replacement so this check is missing now?
> Code will be executed now in case of in_crash_kexec.

This code was bogus to begin with.  The reason it was executed was because
the kexec stuff had a lower priority.  I fixed this but using a higer
priority in the old code and the flag NMI_FLAG_FIRST, with this change for
the kexec stuff.

SGI was trying to testing the higher priority fix when they ran into
issues with 2048 cpus.  The 1024 case worked.  I don't think they ever
got around to re-running the test case again to verify changing the
priorities obsoletes the above change.

> 
> > -
> >         /*
> >          * Each blade has an MMR that indicates when an NMI has been sent
> >          * to cpus on the blade. If an NMI is detected, atomically
> 
> 
> > diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
> > index 0ed633c..6199232 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
> > @@ -78,27 +78,20 @@ static void raise_exception(struct mce *m, struct pt_regs *pregs)
> > 
> >  static cpumask_var_t mce_inject_cpumask;
> > 
> > -static int mce_raise_notify(struct notifier_block *self,
> > -                           unsigned long val, void *data)
> > +static int mce_raise_notify(unsigned int cmd, struct pt_regs *regs)
> >  {
> > -       struct die_args *args = (struct die_args *)data;
> >         int cpu = smp_processor_id();
> >         struct mce *m = &__get_cpu_var(injectm);
> > -       if (val != DIE_NMI || !cpumask_test_cpu(cpu, mce_inject_cpumask))
> > -               return NOTIFY_DONE;
> > +       if (!cpumask_test_cpu(cpu, mce_inject_cpumask))
> > +               return NMI_DONE;
> >         cpumask_clear_cpu(cpu, mce_inject_cpumask);
> >         if (m->inject_flags & MCJ_EXCEPTION)
> > -               raise_exception(m, args->regs);
> > +               raise_exception(m, regs);
> >         else if (m->status)
> >                 raise_poll(m);
> > -       return NOTIFY_STOP;
> > +       return NMI_HANDLED;
> >  }
> > 
> > -static struct notifier_block mce_raise_nb = {
> > -       .notifier_call = mce_raise_notify,
> > -       .priority = NMI_LOCAL_NORMAL_PRIOR,
> > -};
> > -
> >  /* Inject mce on current CPU */
> >  static int raise_local(void)
> >  {
> > @@ -216,7 +209,8 @@ static int inject_init(void)
> >                 return -ENOMEM;
> >         printk(KERN_INFO "Machine check injector initialized\n");
> >         mce_chrdev_ops.write = mce_write;
> > -       register_die_notifier(&mce_raise_nb);
> > +       register_nmi_handler(NMI_LOCAL, mce_raise_notify, 0,
> > +                               "mce_notify");
> >         return 0;
> >  }
> > 
> > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> > index 08363b0..3fc65b6 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> > @@ -908,9 +908,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> > 
> >         percpu_inc(mce_exception_count);
> > 
> > -       if (notify_die(DIE_NMI, "machine check", regs, error_code,
> > -                          18, SIGKILL) == NOTIFY_STOP)
> > -               goto out;
> 
> Yes, this code is strange. I checked all the nmi handlers but couldn't
> find one that is direct related to this call. But it could be to
> handle IPIs even in the case of an mce to let backtrace and reboot
> work. CC'ing mce guys.
> 
> I would rather add an nmi_handle() call here.

I checked to and the code predates 2.6.12, so I have no idea why it was
there.  One of the reasons I wanted to remove it was to keep all the users
internal to the nmi.c file.  Also I remove most of the parameters from
notify_die as they were not being used.  I would hate to add them back in
because of an mce hack.

I'm sure after 4-5 years (whenever this was added), we can find a better
way to do whatever it is doing, no?

But if I have to support this call, it complicates all the changes I made
unnecessarily. :-(

> 
> >         if (!banks)
> >                 goto out;
> > 
> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > index 4ee3abf..767371f 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -1375,68 +1375,18 @@ struct pmu_nmi_state {
> >  static DEFINE_PER_CPU(struct pmu_nmi_state, pmu_nmi);
> > 
> >  static int __kprobes
> > -perf_event_nmi_handler(struct notifier_block *self,
> > -                        unsigned long cmd, void *__args)
> > +perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
> >  {
> > -       struct die_args *args = __args;
> > -       unsigned int this_nmi;
> >         int handled;
> > 
> >         if (!atomic_read(&active_events))
> > -               return NOTIFY_DONE;
> > +               return NMI_DONE;
> > 
> > -       switch (cmd) {
> > -       case DIE_NMI:
> > -               break;
> > -       case DIE_NMIUNKNOWN:
> > -               this_nmi = percpu_read(irq_stat.__nmi_count);
> > -               if (this_nmi != __this_cpu_read(pmu_nmi.marked))
> > -                       /* let the kernel handle the unknown nmi */
> > -                       return NOTIFY_DONE;
> > -               /*
> > -                * This one is a PMU back-to-back nmi. Two events
> > -                * trigger 'simultaneously' raising two back-to-back
> > -                * NMIs. If the first NMI handles both, the latter
> > -                * will be empty and daze the CPU. So, we drop it to
> > -                * avoid false-positive 'unknown nmi' messages.
> > -                */
> > -               return NOTIFY_STOP;
> > -       default:
> > -               return NOTIFY_DONE;
> > -       }
> > -
> > -       handled = x86_pmu.handle_irq(args->regs);
> > -       if (!handled)
> > -               return NOTIFY_DONE;
> > -
> > -       this_nmi = percpu_read(irq_stat.__nmi_count);
> > -       if ((handled > 1) ||
> > -               /* the next nmi could be a back-to-back nmi */
> > -           ((__this_cpu_read(pmu_nmi.marked) == this_nmi) &&
> > -            (__this_cpu_read(pmu_nmi.handled) > 1))) {
> > -               /*
> > -                * We could have two subsequent back-to-back nmis: The
> > -                * first handles more than one counter, the 2nd
> > -                * handles only one counter and the 3rd handles no
> > -                * counter.
> > -                *
> > -                * This is the 2nd nmi because the previous was
> > -                * handling more than one counter. We will mark the
> > -                * next (3rd) and then drop it if unhandled.
> > -                */
> > -               __this_cpu_write(pmu_nmi.marked, this_nmi + 1);
> > -               __this_cpu_write(pmu_nmi.handled, handled);
> 
> You replace the back-to-back logic by using swallow_nmi later. The
> case above is not covered. Suppose a 2-1-0 sequence of handled nmis.
> Your new code only covers the 2-0 case and will trigger an unknown nmi
> error for 2-1-0.

Ah yes.  It looks like I rushed that change (but that is why I broke it
out as a separate patch :-) ).

> 
> We should reimplement the logic above in nmi.c.
> 
> Also, struct pmu_nmi_state and friends can be removed here.

Ah yes.

Cheers,
Don

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 2/6] x86, nmi: create new NMI handler routines
  2011-09-06 10:08   ` Robert Richter
@ 2011-09-06 16:53     ` Don Zickus
  0 siblings, 0 replies; 24+ messages in thread
From: Don Zickus @ 2011-09-06 16:53 UTC (permalink / raw)
  To: Robert Richter; +Cc: x86, Andi Kleen, Peter Zijlstra, ying.huang, LKML, paulmck

On Tue, Sep 06, 2011 at 12:08:57PM +0200, Robert Richter wrote:
> On 25.08.11 12:45:44, Don Zickus wrote:
> > +int register_nmi_handler(unsigned int type, nmi_handler_t handler,
> > +			unsigned long nmiflags, const char *devname)
> > +{
> > +	struct nmiaction *action;
> > +	int retval;
> > +
> > +	if (!handler)
> > +		return -EINVAL;
> > +
> > +	action = kzalloc(sizeof(struct nmiaction), GFP_KERNEL);
> > +	if (!action)
> > +		return -ENOMEM;
> > +
> > +	action->handler = handler;
> > +	action->flags = nmiflags;
> > +	action->name = kstrndup(devname, NMI_MAX_NAMELEN, GFP_KERNEL);
> 
> Null pointer check is missing here.

Thanks.

Cheers,
Don

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-08-25 16:45 ` [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines Don Zickus
  2011-09-06 16:15   ` Robert Richter
@ 2011-09-06 17:20   ` Corey Minyard
  2011-09-06 17:49     ` Don Zickus
  1 sibling, 1 reply; 24+ messages in thread
From: Corey Minyard @ 2011-09-06 17:20 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang,
	LKML, paulmck, Jason Wessel, Andi Kleen, Jack Steiner

On 08/25/2011 11:45 AM, Don Zickus wrote:
> diff --git a/drivers/char/ipmi/ipmi_watchdog.c b/drivers/char/ipmi/ipmi_watchdog.c
> index 3302586..3dcab56 100644
> --- a/drivers/char/ipmi/ipmi_watchdog.c
> +++ b/drivers/char/ipmi/ipmi_watchdog.c
> @@ -1077,17 +1077,8 @@ static void ipmi_unregister_watchdog(int ipmi_intf)
>
>   #ifdef HAVE_DIE_NMI
>   static int
> -ipmi_nmi(struct notifier_block *self, unsigned long val, void *data)
> +ipmi_nmi(unsigned int val, struct pt_regs *regs)
>   {
> -	struct die_args *args = data;
> -
> -	if (val != DIE_NMIUNKNOWN)
> -		return NOTIFY_OK;
> -
> -	/* Hack, if it's a memory or I/O error, ignore it. */
> -	if (args->err&  0xc0)
> -		return NOTIFY_OK;
> -
>   	/*
>   	 * If we get here, it's an NMI that's not a memory or I/O
>   	 * error.  We can't truly tell if it's from IPMI or not
 From the comment above, this will not actually work for the IPMI 
watchdog.  The only way to tell if an NMI is from the IPMI controller is 
to send a message to the IPMI controller and wait for the response.  Not 
an option in an NMI handler.  So the IPMI watchdog driver relied on it 
being last, and if nothing else handled the NMI, then it must have been 
from the IPMI controller.

This is stupid, I know, but that's the way it works in IPMI.  If we 
decide to do what you are suggesting we will have to disable this 
function in the IPMI watchdog driver or we need some way to say "if 
nothing else handled the NMI, call this."

-corey


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts
  2011-09-06 16:39   ` Robert Richter
@ 2011-09-06 17:40     ` Don Zickus
  0 siblings, 0 replies; 24+ messages in thread
From: Don Zickus @ 2011-09-06 17:40 UTC (permalink / raw)
  To: Robert Richter; +Cc: x86, Andi Kleen, Peter Zijlstra, ying.huang, LKML, paulmck

On Tue, Sep 06, 2011 at 06:39:04PM +0200, Robert Richter wrote:
> On 25.08.11 12:45:48, Don Zickus wrote:
> > This is a cheap hack to add the stats to the middle of /proc/interrupts.
> > It is more of a conversation starter than anything as I am not sure
> > the right letters and place to put this stuff.
> > 
> > The benefit of these stats is a better breakdown of which list the NMIs
> > get handled in either a normal handler, unknown, or external.  It also
> > list the number of unknown NMIs swallowed to help check for false
> > positives or not.  Another benefit is the ability to actually see which
> > NMI handlers are currently registered in the system.
> > 
> > The output of 'cat /proc/interrupts/ will look like this:
> > 
> > <snip>
> >  58:        275          0        864          0   PCI-MSI-edge      eth0
> > NMI:       4161       4155        158       4194   Non-maskable interrupts
> > SWA:          0          0          0          0   Unknown NMIs swallowed
> >   0:       4161       4155        158       4194   NMI  PMI, arch_bt
> 
> What does the "0" stand for?

Just a place holder meaning NMI0 for the normal NMIs, but as you stated
below, it is more an event counter.  I wanted to use LOC for local but it
was already taken.

> 
> We could easily provide a statistic for each NMI handler, which would
> be more useful.

Of course, but I wasn't sure how to print that without making
/proc/interrupts look messy.  I suggested earlier to Peter that I thought
about a /proc/nmi file that could print that.

> 
> The syntax of the NMI printout is not yet perfect, so before adding it
> as a new interface I would rather wait a bit to discuss this more.
> 
> People also could get confused because the handled count may be
> different to nmi count. This should be documented more clearly, maybe
> as event count instead of nmi count or so.

That is why I put this patch last with a big blurb on top stating this
needs more feedback. :-)  The hack hook into /proc/interrupts probably
isn't reasonable either. 

Cheers,
Don

> 
> -Robert
> 
> > UNK:          0          0          0          0   NMI
> > EXT:          0          0          0          0   NMI
> > LOC:      12653      13304      13974      12926   Local timer interrupts
> > SPU:          0          0          0          0   Spurious interrupts
> > PMI:          6          6          5          6   Performance monitoring interrupts
> > IWI:          0          0          0          0   IRQ work interrupts
> > RES:       1839       1897       1821       1854   Rescheduling interrupts
> > CAL:        524       2714        392        331   Function call interrupts
> > TLB:        217        146        593        576   TLB shootdowns
> > TRM:          0          0          0          0   Thermal event interrupts
> > THR:          0          0          0          0   Threshold APIC interrupts
> > MCE:          0          0          0          0   Machine check exceptions
> > MCP:          1          1          1          1   Machine check polls
> > ERR:          0
> > MIS:          0
> 
> -- 
> Advanced Micro Devices, Inc.
> Operating System Research Center
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-09-06 17:20   ` Corey Minyard
@ 2011-09-06 17:49     ` Don Zickus
  2011-09-06 17:59       ` Corey Minyard
  0 siblings, 1 reply; 24+ messages in thread
From: Don Zickus @ 2011-09-06 17:49 UTC (permalink / raw)
  To: Corey Minyard
  Cc: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang,
	LKML, paulmck, Jason Wessel, Andi Kleen, Jack Steiner

On Tue, Sep 06, 2011 at 12:20:19PM -0500, Corey Minyard wrote:
> >+ipmi_nmi(unsigned int val, struct pt_regs *regs)
> >  {
> >-	struct die_args *args = data;
> >-
> >-	if (val != DIE_NMIUNKNOWN)
> >-		return NOTIFY_OK;
> >-
> >-	/* Hack, if it's a memory or I/O error, ignore it. */
> >-	if (args->err&  0xc0)
> >-		return NOTIFY_OK;
> >-
> >  	/*
> >  	 * If we get here, it's an NMI that's not a memory or I/O
> >  	 * error.  We can't truly tell if it's from IPMI or not
> From the comment above, this will not actually work for the IPMI
> watchdog.  The only way to tell if an NMI is from the IPMI
> controller is to send a message to the IPMI controller and wait for
> the response.  Not an option in an NMI handler.  So the IPMI
> watchdog driver relied on it being last, and if nothing else handled
> the NMI, then it must have been from the IPMI controller.
> 
> This is stupid, I know, but that's the way it works in IPMI.  If we
> decide to do what you are suggesting we will have to disable this
> function in the IPMI watchdog driver or we need some way to say "if
> nothing else handled the NMI, call this."

Hi Corey,

Thanks for the review.

I know and you aren't the only driver that has an issue like this, hpwdt
is in the same boat.  This is why I registered the driver against
NMI_UNKNOWN instead of NMI_LOCAL.  The ipmi driver should only be called
when no one else claims the NMI (including a possible external one).

The only drivers I know that have to do it this way are hpwdt, sgi and
ipmi.  By design, none of those drivers should be loaded at the same time
(as they are all uniquely hardware specific).

BTW, this is also why I removed the 'Hack, if it's a memory or I/O error',
because the ipmi handler should not be called for an external NMI now.

I hope that clears things up.

Cheers,
Don

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-09-06 17:49     ` Don Zickus
@ 2011-09-06 17:59       ` Corey Minyard
  0 siblings, 0 replies; 24+ messages in thread
From: Corey Minyard @ 2011-09-06 17:59 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang,
	LKML, paulmck, Jason Wessel, Andi Kleen, Jack Steiner

On 09/06/2011 12:49 PM, Don Zickus wrote:
> On Tue, Sep 06, 2011 at 12:20:19PM -0500, Corey Minyard wrote:
>>> +ipmi_nmi(unsigned int val, struct pt_regs *regs)
>>>   {
>>> -	struct die_args *args = data;
>>> -
>>> -	if (val != DIE_NMIUNKNOWN)
>>> -		return NOTIFY_OK;
>>> -
>>> -	/* Hack, if it's a memory or I/O error, ignore it. */
>>> -	if (args->err&   0xc0)
>>> -		return NOTIFY_OK;
>>> -
>>>   	/*
>>>   	 * If we get here, it's an NMI that's not a memory or I/O
>>>   	 * error.  We can't truly tell if it's from IPMI or not
>>  From the comment above, this will not actually work for the IPMI
>> watchdog.  The only way to tell if an NMI is from the IPMI
>> controller is to send a message to the IPMI controller and wait for
>> the response.  Not an option in an NMI handler.  So the IPMI
>> watchdog driver relied on it being last, and if nothing else handled
>> the NMI, then it must have been from the IPMI controller.
>>
>> This is stupid, I know, but that's the way it works in IPMI.  If we
>> decide to do what you are suggesting we will have to disable this
>> function in the IPMI watchdog driver or we need some way to say "if
>> nothing else handled the NMI, call this."
> Hi Corey,
>
> Thanks for the review.
>
> I know and you aren't the only driver that has an issue like this, hpwdt
> is in the same boat.  This is why I registered the driver against
> NMI_UNKNOWN instead of NMI_LOCAL.  The ipmi driver should only be called
> when no one else claims the NMI (including a possible external one).
>
> The only drivers I know that have to do it this way are hpwdt, sgi and
> ipmi.  By design, none of those drivers should be loaded at the same time
> (as they are all uniquely hardware specific).
>
> BTW, this is also why I removed the 'Hack, if it's a memory or I/O error',
> because the ipmi handler should not be called for an external NMI now.
>
> I hope that clears things up.
Ok, that should work.  I didn't read through the whole set of patches 
and I missed that.  This is good for me.  Thanks.

-corey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-09-06 16:52     ` Don Zickus
@ 2011-09-07  9:03       ` Borislav Petkov
  0 siblings, 0 replies; 24+ messages in thread
From: Borislav Petkov @ 2011-09-07  9:03 UTC (permalink / raw)
  To: Don Zickus
  Cc: Richter, Robert, x86, Andi Kleen, Peter Zijlstra, ying.huang,
	LKML, paulmck, Jason Wessel, Andi Kleen, Corey Minyard,
	Jack Steiner, Tony Luck

On Tue, Sep 06, 2011 at 12:52:53PM -0400, Don Zickus wrote:

[..]

> > > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> > > index 08363b0..3fc65b6 100644
> > > --- a/arch/x86/kernel/cpu/mcheck/mce.c
> > > +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> > > @@ -908,9 +908,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> > >
> > >         percpu_inc(mce_exception_count);
> > >
> > > -       if (notify_die(DIE_NMI, "machine check", regs, error_code,
> > > -                          18, SIGKILL) == NOTIFY_STOP)
> > > -               goto out;
> >
> > Yes, this code is strange. I checked all the nmi handlers but couldn't
> > find one that is direct related to this call. But it could be to
> > handle IPIs even in the case of an mce to let backtrace and reboot
> > work. CC'ing mce guys.
> >
> > I would rather add an nmi_handle() call here.
> 
> I checked to and the code predates 2.6.12, so I have no idea why it was
> there.  One of the reasons I wanted to remove it was to keep all the users
> internal to the nmi.c file.  Also I remove most of the parameters from
> notify_die as they were not being used.  I would hate to add them back in
> because of an mce hack.
> 
> I'm sure after 4-5 years (whenever this was added), we can find a better
> way to do whatever it is doing, no?
> 
> But if I have to support this call, it complicates all the changes I made
> unnecessarily. :-(

This code comes from a combined x86_64 update commit from 2003, AFAICT:

commit 3d71dbc9afbd7eecdc71e0329d6f16f2dcd48e39
Author: Andi Kleen <ak@suse.de>
Date:   Mon Mar 24 19:54:54 2003 -0800

    [PATCH] x86-64 updates

    Lots of x86-64 updates. Merge with 2.4 and NUMA works now. Also reenabled
    the preemptive kernel. And some other bug fixes.
    IOMMU disabled by default now because it has problems.

     - Add more CONFIG options for device driver debugging and iommu
       force/debug.  (don't enable iommu force currently)
     - Some S3/ACPI fixes/cleanups from Pavel.
     ....


and the file was called arch/x86_64/kernel/bluesmoke.c back then.
Unfortunately, nothing in the commit message hints at why it was added.
I guess it was some sort of a notification mechanism to warn the rest of
the kernel that we might die soon because we received an MCE, so that
prior can take some cleanup action before going down.

So I don't think we use it anywhere - originally Robert and I thought
that mce-inject.c relies on it indirectly but it does its own NMI
injection when the MCE needs to be broadcast and injected on all cores
and it also registers its own NMI notifier mce_raise_notify() which
you've already converted.

Tony, anything I'm missing?

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2011-09-07 16:49 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-25 16:45 [V3][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
2011-08-25 16:45 ` [V3][PATCH 1/6] x86, nmi: split out nmi from traps.c Don Zickus
2011-08-25 16:45 ` [V3][PATCH 2/6] x86, nmi: create new NMI handler routines Don Zickus
2011-08-26  9:44   ` Peter Zijlstra
2011-08-26 14:21     ` Don Zickus
2011-09-06 10:08   ` Robert Richter
2011-09-06 16:53     ` Don Zickus
2011-08-25 16:45 ` [V3][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines Don Zickus
2011-09-06 16:15   ` Robert Richter
2011-09-06 16:52     ` Don Zickus
2011-09-07  9:03       ` Borislav Petkov
2011-09-06 17:20   ` Corey Minyard
2011-09-06 17:49     ` Don Zickus
2011-09-06 17:59       ` Corey Minyard
2011-08-25 16:45 ` [V3][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
2011-09-06 16:18   ` Robert Richter
2011-08-25 16:45 ` [V3][PATCH 5/6] x86, nmi: track NMI usage stats Don Zickus
2011-08-25 16:45 ` [V3][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts Don Zickus
2011-09-06 16:39   ` Robert Richter
2011-09-06 17:40     ` Don Zickus
2011-08-26  9:44 ` [V3][PATCH 0/6] x86, nmi: new NMI handling routines Peter Zijlstra
2011-08-26 14:39   ` Don Zickus
2011-08-26 14:51     ` Peter Zijlstra
2011-09-06 16:43 ` Robert Richter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.