All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Yet another pass at machine check recovery
@ 2011-08-31 22:21 Luck, Tony
  2011-08-31 22:25 ` [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY Luck, Tony
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

Scaled down ambitions ... this version just goes for recovery of
data errors detected in user mode.

-Tony

 arch/x86/include/asm/thread_info.h        |    4 -
 arch/x86/kernel/cpu/mcheck/mce-severity.c |   14 ++++
 arch/x86/kernel/cpu/mcheck/mce.c          |   91 ++++++++++++++++--------------
 arch/x86/kernel/signal.c                  |    6 -
 mm/memory-failure.c                       |   39 +++++++-----
 5 files changed, 86 insertions(+), 68 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
@ 2011-08-31 22:25 ` Luck, Tony
  2011-09-07  9:11   ` Borislav Petkov
  2011-08-31 22:25 ` Luck, Tony
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

The basic flow of MCE handler is summarized as follows:
 1) from NMI context:
	check hardware error registers, determine error severity,
	and then panic or request non-NMI context by irq_work() to
	continue the system.
 2) from (irq) context:
	call non-NMI safe functions,
	wake up loggers and schedule work if required
 3) from worker thread:
	process some time-consuming works like memory poisoning.

TIF_MCE_NOTIFY flag is relatively legacy and have used to do tasks of
2) and 3) on the thread context that interrupted by MCE.  However now
use of irq_work() and work-queue is enough for these tasks, so this
patch removes duplicated tasks in mce_notify_process().

As the result there is no task to be done in the interrupted context,
but soon if SRAR is supported there would be some thread-specific thing
for action required.  So keep the flag for such possible future use,
until better mechanism is introduced.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

This is the combination of two patches by Seto-san - with Boris'
suggestion to not create a 2-line function mce_memory_failure_process()
but to leave those inline.
Message-ID: <4DFB1476.40804@jp.fujitsu.com>
[PATCH 6/8] x86, mce: introduce mce_memory_failure_process()
Message-ID: <4DFB1509.7020402@jp.fujitsu.com>
[PATCH 7/8] x86, mce: rework use of TIF_MCE_NOTIFY

 arch/x86/kernel/cpu/mcheck/mce.c |   35 ++++++++++++++++-------------------
 1 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 08363b0..91bb983 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1037,8 +1037,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	if (kill_it && tolerant < 3)
 		force_sig(SIGBUS, current);
 
-	/* notify userspace ASAP */
-	set_thread_flag(TIF_MCE_NOTIFY);
+	/* Trap this thread before returning to user, for action required */
+	if (worst == MCE_AR_SEVERITY)
+		set_thread_flag(TIF_MCE_NOTIFY);
 
 	if (worst > 0)
 		mce_report_event(regs);
@@ -1052,31 +1053,29 @@ EXPORT_SYMBOL_GPL(do_machine_check);
 /* dummy to break dependency. actual code is in mm/memory-failure.c */
 void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
 {
-	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
+	pr_err("Action optional memory failure at %lx ignored\n", pfn);
 }
 
 /*
- * Called after mce notification in process context. This code
- * is allowed to sleep. Call the high level VM handler to process
- * any corrupted pages.
- * Assume that the work queue code only calls this one at a time
- * per CPU.
- * Note we don't disable preemption, so this code might run on the wrong
- * CPU. In this case the event is picked up by the scheduled work queue.
- * This is merely a fast path to expedite processing in some common
- * cases.
+ * Called in process context that interrupted by MCE and marked with
+ * TIF_MCE_NOTFY, just before returning to errorneous userland.
+ * This code is allowed to sleep.
+ * Attempt possible recovery such as calling the high level VM handler to
+ * process any corrupted pages, and kill/signal current process if required.
  */
 void mce_notify_process(void)
 {
-	unsigned long pfn;
-	mce_notify_irq();
-	while (mce_ring_get(&pfn))
-		memory_failure(pfn, MCE_VECTOR);
+	clear_thread_flag(TIF_MCE_NOTIFY);
+
+	/* TBD: do recovery for action required event */
 }
 
 static void mce_process_work(struct work_struct *dummy)
 {
-	mce_notify_process();
+	unsigned long pfn;
+
+	while (mce_ring_get(&pfn))
+		memory_failure(pfn, MCE_VECTOR);
 }
 
 #ifdef CONFIG_X86_MCE_INTEL
@@ -1157,8 +1156,6 @@ int mce_notify_irq(void)
 	/* Not more than two messages every minute */
 	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
 
-	clear_thread_flag(TIF_MCE_NOTIFY);
-
 	if (test_and_clear_bit(0, &mce_need_notify)) {
 		/* wake processes polling /dev/mcelog */
 		wake_up_interruptible(&mce_chrdev_wait);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
  2011-08-31 22:25 ` [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY Luck, Tony
@ 2011-08-31 22:25 ` Luck, Tony
  2011-09-09  2:23   ` huang ying
  2011-08-31 22:25 ` [PATCH 2/5] mce: mask out undefined bits from MCi_ADDR Luck, Tony
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From:	"Luck, Tony" <tony.luck@intel.com>

From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

The basic flow of MCE handler is summarized as follows:
 1) from NMI context:
	check hardware error registers, determine error severity,
	and then panic or request non-NMI context by irq_work() to
	continue the system.
 2) from (irq) context:
	call non-NMI safe functions,
	wake up loggers and schedule work if required
 3) from worker thread:
	process some time-consuming works like memory poisoning.

TIF_MCE_NOTIFY flag is relatively legacy and have used to do tasks of
2) and 3) on the thread context that interrupted by MCE.  However now
use of irq_work() and work-queue is enough for these tasks, so this
patch removes duplicated tasks in mce_notify_process().

As the result there is no task to be done in the interrupted context,
but soon if SRAR is supported there would be some thread-specific thing
for action required.  So keep the flag for such possible future use,
until better mechanism is introduced.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

This is the combination of two patches by Seto-san - with Boris'
suggestion to not create a 2-line function mce_memory_failure_process()
but to leave those inline.
Message-ID: <4DFB1476.40804@jp.fujitsu.com>
[PATCH 6/8] x86, mce: introduce mce_memory_failure_process()
Message-ID: <4DFB1509.7020402@jp.fujitsu.com>
[PATCH 7/8] x86, mce: rework use of TIF_MCE_NOTIFY

 arch/x86/kernel/cpu/mcheck/mce.c |   35 ++++++++++++++++-------------------
 1 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 08363b0..91bb983 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1037,8 +1037,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	if (kill_it && tolerant < 3)
 		force_sig(SIGBUS, current);
 
-	/* notify userspace ASAP */
-	set_thread_flag(TIF_MCE_NOTIFY);
+	/* Trap this thread before returning to user, for action required */
+	if (worst == MCE_AR_SEVERITY)
+		set_thread_flag(TIF_MCE_NOTIFY);
 
 	if (worst > 0)
 		mce_report_event(regs);
@@ -1052,31 +1053,29 @@ EXPORT_SYMBOL_GPL(do_machine_check);
 /* dummy to break dependency. actual code is in mm/memory-failure.c */
 void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
 {
-	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
+	pr_err("Action optional memory failure at %lx ignored\n", pfn);
 }
 
 /*
- * Called after mce notification in process context. This code
- * is allowed to sleep. Call the high level VM handler to process
- * any corrupted pages.
- * Assume that the work queue code only calls this one at a time
- * per CPU.
- * Note we don't disable preemption, so this code might run on the wrong
- * CPU. In this case the event is picked up by the scheduled work queue.
- * This is merely a fast path to expedite processing in some common
- * cases.
+ * Called in process context that interrupted by MCE and marked with
+ * TIF_MCE_NOTFY, just before returning to errorneous userland.
+ * This code is allowed to sleep.
+ * Attempt possible recovery such as calling the high level VM handler to
+ * process any corrupted pages, and kill/signal current process if required.
  */
 void mce_notify_process(void)
 {
-	unsigned long pfn;
-	mce_notify_irq();
-	while (mce_ring_get(&pfn))
-		memory_failure(pfn, MCE_VECTOR);
+	clear_thread_flag(TIF_MCE_NOTIFY);
+
+	/* TBD: do recovery for action required event */
 }
 
 static void mce_process_work(struct work_struct *dummy)
 {
-	mce_notify_process();
+	unsigned long pfn;
+
+	while (mce_ring_get(&pfn))
+		memory_failure(pfn, MCE_VECTOR);
 }
 
 #ifdef CONFIG_X86_MCE_INTEL
@@ -1157,8 +1156,6 @@ int mce_notify_irq(void)
 	/* Not more than two messages every minute */
 	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
 
-	clear_thread_flag(TIF_MCE_NOTIFY);
-
 	if (test_and_clear_bit(0, &mce_need_notify)) {
 		/* wake processes polling /dev/mcelog */
 		wake_up_interruptible(&mce_chrdev_wait);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] mce: mask out undefined bits from MCi_ADDR
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
  2011-08-31 22:25 ` [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY Luck, Tony
  2011-08-31 22:25 ` Luck, Tony
@ 2011-08-31 22:25 ` Luck, Tony
  2011-09-05  9:19   ` Chen Gong
  2011-08-31 22:25 ` Luck, Tony
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From: Tony Luck <tony.luck@intel.com>

Move duplicate copies of the code that reads ADDR/MISC registers
to a function. Add masking code for systems that have undefined
low-order bits in the MCi_ADDR register.

Based on original code by Andi Kleen

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Andi originally posted this as two patches - one to move the common
code to the new function "mce_read_aux()", the second to add the
masking.
Seto-san objected to the masking on the grounds that the bits might
contain something useful - but after some thought, I agree with Andi
that it is better to drop undefined bits.

 arch/x86/kernel/cpu/mcheck/mce.c |   31 +++++++++++++++++++++++--------
 1 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 91bb983..1ce64c3 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -490,6 +490,27 @@ static void mce_report_event(struct pt_regs *regs)
 	irq_work_queue(&__get_cpu_var(mce_irq_work));
 }
 
+/*
+ * Read ADDR and MISC registers.
+ */
+static void mce_read_aux(struct mce *m, int i)
+{
+	if (m->status & MCI_STATUS_MISCV)
+		m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
+	if (m->status & MCI_STATUS_ADDRV) {
+		m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+
+		/*
+		 * Mask the reported address by the reported granuality.
+		 */
+		if (mce_ser && (m->status & MCI_STATUS_MISCV)) {
+			u8 shift = m->misc & 0x1f;
+			m->addr >>= shift;
+			m->addr <<= shift;
+		}
+	}
+}
+
 DEFINE_PER_CPU(unsigned, mce_poll_count);
 
 /*
@@ -540,10 +561,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		    (m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)))
 			continue;
 
-		if (m.status & MCI_STATUS_MISCV)
-			m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
-		if (m.status & MCI_STATUS_ADDRV)
-			m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+		mce_read_aux(&m, i);
 
 		if (!(flags & MCP_TIMESTAMP))
 			m.tsc = 0;
@@ -984,10 +1002,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (severity == MCE_AR_SEVERITY)
 			kill_it = 1;
 
-		if (m.status & MCI_STATUS_MISCV)
-			m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
-		if (m.status & MCI_STATUS_ADDRV)
-			m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+		mce_read_aux(&m, i);
 
 		/*
 		 * Action optional error. Queue address for later processing.
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] mce: mask out undefined bits from MCi_ADDR
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
                   ` (2 preceding siblings ...)
  2011-08-31 22:25 ` [PATCH 2/5] mce: mask out undefined bits from MCi_ADDR Luck, Tony
@ 2011-08-31 22:25 ` Luck, Tony
  2011-08-31 22:25 ` [PATCH 3/5] HWPOISON: Handle hwpoison in current process Luck, Tony
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From:	"Luck, Tony" <tony.luck@intel.com>

From: Tony Luck <tony.luck@intel.com>

Move duplicate copies of the code that reads ADDR/MISC registers
to a function. Add masking code for systems that have undefined
low-order bits in the MCi_ADDR register.

Based on original code by Andi Kleen

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Andi originally posted this as two patches - one to move the common
code to the new function "mce_read_aux()", the second to add the
masking.
Seto-san objected to the masking on the grounds that the bits might
contain something useful - but after some thought, I agree with Andi
that it is better to drop undefined bits.

 arch/x86/kernel/cpu/mcheck/mce.c |   31 +++++++++++++++++++++++--------
 1 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 91bb983..1ce64c3 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -490,6 +490,27 @@ static void mce_report_event(struct pt_regs *regs)
 	irq_work_queue(&__get_cpu_var(mce_irq_work));
 }
 
+/*
+ * Read ADDR and MISC registers.
+ */
+static void mce_read_aux(struct mce *m, int i)
+{
+	if (m->status & MCI_STATUS_MISCV)
+		m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
+	if (m->status & MCI_STATUS_ADDRV) {
+		m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+
+		/*
+		 * Mask the reported address by the reported granuality.
+		 */
+		if (mce_ser && (m->status & MCI_STATUS_MISCV)) {
+			u8 shift = m->misc & 0x1f;
+			m->addr >>= shift;
+			m->addr <<= shift;
+		}
+	}
+}
+
 DEFINE_PER_CPU(unsigned, mce_poll_count);
 
 /*
@@ -540,10 +561,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		    (m.status & (mce_ser ? MCI_STATUS_S : MCI_STATUS_UC)))
 			continue;
 
-		if (m.status & MCI_STATUS_MISCV)
-			m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
-		if (m.status & MCI_STATUS_ADDRV)
-			m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+		mce_read_aux(&m, i);
 
 		if (!(flags & MCP_TIMESTAMP))
 			m.tsc = 0;
@@ -984,10 +1002,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (severity == MCE_AR_SEVERITY)
 			kill_it = 1;
 
-		if (m.status & MCI_STATUS_MISCV)
-			m.misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
-		if (m.status & MCI_STATUS_ADDRV)
-			m.addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
+		mce_read_aux(&m, i);
 
 		/*
 		 * Action optional error. Queue address for later processing.
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/5] HWPOISON: Handle hwpoison in current process
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
                   ` (3 preceding siblings ...)
  2011-08-31 22:25 ` Luck, Tony
@ 2011-08-31 22:25 ` Luck, Tony
  2011-09-07  5:47   ` Chen Gong
  2011-08-31 22:26 ` Luck, Tony
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From: Andi Kleen <andi@firstfloor.org>

When hardware poison handles the current process use
a forced signal with _AR severity.

[Tony: changed some function names .. the "_ao" suffix was no longer meaningful]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 mm/memory-failure.c |   39 ++++++++++++++++++++++-----------------
 1 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2b43ba0..6ccb8a6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -186,33 +186,38 @@ int hwpoison_filter(struct page *p)
 EXPORT_SYMBOL_GPL(hwpoison_filter);
 
 /*
- * Send all the processes who have the page mapped an ``action optional''
- * signal.
+ * Send one process who has the page mapped a SIGBUS. It might
+ * be able to catch it and initiate its own task level recovery.
  */
-static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+static int user_recovery(struct task_struct *t, unsigned long addr, int trapno,
 			unsigned long pfn, struct page *page)
 {
 	struct siginfo si;
 	int ret;
 
 	printk(KERN_ERR
-		"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
-		pfn, t->comm, t->pid);
+		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
+	       pfn, t->comm, t->pid);
 	si.si_signo = SIGBUS;
 	si.si_errno = 0;
-	si.si_code = BUS_MCEERR_AO;
 	si.si_addr = (void *)addr;
 #ifdef __ARCH_SI_TRAPNO
 	si.si_trapno = trapno;
 #endif
 	si.si_addr_lsb = compound_trans_order(compound_head(page)) + PAGE_SHIFT;
-	/*
-	 * Don't use force here, it's convenient if the signal
-	 * can be temporarily blocked.
-	 * This could cause a loop when the user sets SIGBUS
-	 * to SIG_IGN, but hopefully no one will do that?
-	 */
-	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (t == current) {
+		si.si_code = BUS_MCEERR_AR;
+		ret = force_sig_info(SIGBUS, &si, t);
+	} else {
+		/*
+		 * Don't use force here, it's convenient if the signal
+		 * can be temporarily blocked.
+		 * This could cause a loop when the user sets SIGBUS
+		 * to SIG_IGN, but hopefully noone will do that?
+		 */
+		si.si_code = BUS_MCEERR_AO;
+		ret = send_sig_info(SIGBUS, &si, t);
+	}
 	if (ret < 0)
 		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
 		       t->comm, t->pid, ret);
@@ -330,14 +335,14 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 }
 
 /*
- * Kill the processes that have been collected earlier.
+ * Signal the processes that have been collected earlier.
  *
  * Only do anything when DOIT is set, otherwise just free the list
  * (this is used for clean pages which do not need killing)
  * Also when FAIL is set do a force kill because something went
  * wrong earlier.
  */
-static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+static void kill_procs(struct list_head *to_kill, int doit, int trapno,
 			  int fail, struct page *page, unsigned long pfn)
 {
 	struct to_kill *tk, *next;
@@ -362,7 +367,7 @@ static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
 			 * check for that, but we need to tell the
 			 * process anyways.
 			 */
-			else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
+			else if (user_recovery(tk->tsk, tk->addr, trapno,
 					      pfn, page) < 0)
 				printk(KERN_ERR
 		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
@@ -961,7 +966,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * use a more force-full uncatchable kill to prevent
 	 * any accesses to the poisoned memory.
 	 */
-	kill_procs_ao(&tokill, !!PageDirty(ppage), trapno,
+	kill_procs(&tokill, !!PageDirty(ppage), trapno,
 		      ret != SWAP_SUCCESS, p, pfn);
 
 	return ret;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/5] HWPOISON: Handle hwpoison in current process
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
                   ` (4 preceding siblings ...)
  2011-08-31 22:25 ` [PATCH 3/5] HWPOISON: Handle hwpoison in current process Luck, Tony
@ 2011-08-31 22:26 ` Luck, Tony
  2011-08-31 22:26 ` [PATCH 4/5] mce: remove TIF_MCE_NOTIFY Luck, Tony
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From:	"Luck, Tony" <tony.luck@intel.com>

From: Andi Kleen <andi@firstfloor.org>

When hardware poison handles the current process use
a forced signal with _AR severity.

[Tony: changed some function names .. the "_ao" suffix was no longer meaningful]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 mm/memory-failure.c |   39 ++++++++++++++++++++++-----------------
 1 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2b43ba0..6ccb8a6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -186,33 +186,38 @@ int hwpoison_filter(struct page *p)
 EXPORT_SYMBOL_GPL(hwpoison_filter);
 
 /*
- * Send all the processes who have the page mapped an ``action optional''
- * signal.
+ * Send one process who has the page mapped a SIGBUS. It might
+ * be able to catch it and initiate its own task level recovery.
  */
-static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+static int user_recovery(struct task_struct *t, unsigned long addr, int trapno,
 			unsigned long pfn, struct page *page)
 {
 	struct siginfo si;
 	int ret;
 
 	printk(KERN_ERR
-		"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
-		pfn, t->comm, t->pid);
+		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
+	       pfn, t->comm, t->pid);
 	si.si_signo = SIGBUS;
 	si.si_errno = 0;
-	si.si_code = BUS_MCEERR_AO;
 	si.si_addr = (void *)addr;
 #ifdef __ARCH_SI_TRAPNO
 	si.si_trapno = trapno;
 #endif
 	si.si_addr_lsb = compound_trans_order(compound_head(page)) + PAGE_SHIFT;
-	/*
-	 * Don't use force here, it's convenient if the signal
-	 * can be temporarily blocked.
-	 * This could cause a loop when the user sets SIGBUS
-	 * to SIG_IGN, but hopefully no one will do that?
-	 */
-	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (t == current) {
+		si.si_code = BUS_MCEERR_AR;
+		ret = force_sig_info(SIGBUS, &si, t);
+	} else {
+		/*
+		 * Don't use force here, it's convenient if the signal
+		 * can be temporarily blocked.
+		 * This could cause a loop when the user sets SIGBUS
+		 * to SIG_IGN, but hopefully noone will do that?
+		 */
+		si.si_code = BUS_MCEERR_AO;
+		ret = send_sig_info(SIGBUS, &si, t);
+	}
 	if (ret < 0)
 		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
 		       t->comm, t->pid, ret);
@@ -330,14 +335,14 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 }
 
 /*
- * Kill the processes that have been collected earlier.
+ * Signal the processes that have been collected earlier.
  *
  * Only do anything when DOIT is set, otherwise just free the list
  * (this is used for clean pages which do not need killing)
  * Also when FAIL is set do a force kill because something went
  * wrong earlier.
  */
-static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+static void kill_procs(struct list_head *to_kill, int doit, int trapno,
 			  int fail, struct page *page, unsigned long pfn)
 {
 	struct to_kill *tk, *next;
@@ -362,7 +367,7 @@ static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
 			 * check for that, but we need to tell the
 			 * process anyways.
 			 */
-			else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
+			else if (user_recovery(tk->tsk, tk->addr, trapno,
 					      pfn, page) < 0)
 				printk(KERN_ERR
 		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
@@ -961,7 +966,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * use a more force-full uncatchable kill to prevent
 	 * any accesses to the poisoned memory.
 	 */
-	kill_procs_ao(&tokill, !!PageDirty(ppage), trapno,
+	kill_procs(&tokill, !!PageDirty(ppage), trapno,
 		      ret != SWAP_SUCCESS, p, pfn);
 
 	return ret;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] mce: remove TIF_MCE_NOTIFY
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
                   ` (5 preceding siblings ...)
  2011-08-31 22:26 ` Luck, Tony
@ 2011-08-31 22:26 ` Luck, Tony
  2011-09-07  9:23   ` Borislav Petkov
  2011-08-31 22:26 ` Luck, Tony
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From: Tony Luck <tony.luck@intel.com>

Hidetoshi Seto has re-worked the notification code to use a
worker thread for propagating notifications. So TIF_MCE_NOTIFY
is unused. Previous plans to use this to somehow switch from
"machine check context" to a more friendly environment to
deal with immediate errors look like they will never pan out.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/thread_info.h |    4 +---
 arch/x86/kernel/cpu/mcheck/mce.c   |   18 ------------------
 arch/x86/kernel/signal.c           |    6 ------
 3 files changed, 1 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index a1fe5c1..9cfde9f 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -82,7 +82,6 @@ struct thread_info {
 #define TIF_SYSCALL_EMU		6	/* syscall emulation active */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
-#define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
@@ -105,7 +104,6 @@ struct thread_info {
 #define _TIF_SYSCALL_EMU	(1 << TIF_SYSCALL_EMU)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
-#define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
@@ -140,7 +138,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
-	(_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME |	\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |				\
 	 _TIF_USER_RETURN_NOTIFY)
 
 /* flags to check in __switch_to() */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 1ce64c3..135e12d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1052,10 +1052,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	if (kill_it && tolerant < 3)
 		force_sig(SIGBUS, current);
 
-	/* Trap this thread before returning to user, for action required */
-	if (worst == MCE_AR_SEVERITY)
-		set_thread_flag(TIF_MCE_NOTIFY);
-
 	if (worst > 0)
 		mce_report_event(regs);
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
@@ -1071,20 +1067,6 @@ void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
 	pr_err("Action optional memory failure at %lx ignored\n", pfn);
 }
 
-/*
- * Called in process context that interrupted by MCE and marked with
- * TIF_MCE_NOTFY, just before returning to errorneous userland.
- * This code is allowed to sleep.
- * Attempt possible recovery such as calling the high level VM handler to
- * process any corrupted pages, and kill/signal current process if required.
- */
-void mce_notify_process(void)
-{
-	clear_thread_flag(TIF_MCE_NOTIFY);
-
-	/* TBD: do recovery for action required event */
-}
-
 static void mce_process_work(struct work_struct *dummy)
 {
 	unsigned long pfn;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 54ddaeb2..306fd6d 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -814,12 +814,6 @@ static void do_signal(struct pt_regs *regs)
 void
 do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
-#ifdef CONFIG_X86_MCE
-	/* notify userspace of pending MCEs */
-	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_process();
-#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
-
 	/* deal with pending signal delivery */
 	if (thread_info_flags & _TIF_SIGPENDING)
 		do_signal(regs);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] mce: remove TIF_MCE_NOTIFY
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
                   ` (6 preceding siblings ...)
  2011-08-31 22:26 ` [PATCH 4/5] mce: remove TIF_MCE_NOTIFY Luck, Tony
@ 2011-08-31 22:26 ` Luck, Tony
  2011-08-31 22:26 ` [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode Luck, Tony
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From:	"Luck, Tony" <tony.luck@intel.com>

From: Tony Luck <tony.luck@intel.com>

Hidetoshi Seto has re-worked the notification code to use a
worker thread for propagating notifications. So TIF_MCE_NOTIFY
is unused. Previous plans to use this to somehow switch from
"machine check context" to a more friendly environment to
deal with immediate errors look like they will never pan out.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/include/asm/thread_info.h |    4 +---
 arch/x86/kernel/cpu/mcheck/mce.c   |   18 ------------------
 arch/x86/kernel/signal.c           |    6 ------
 3 files changed, 1 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index a1fe5c1..9cfde9f 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -82,7 +82,6 @@ struct thread_info {
 #define TIF_SYSCALL_EMU		6	/* syscall emulation active */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_SECCOMP		8	/* secure computing */
-#define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
@@ -105,7 +104,6 @@ struct thread_info {
 #define _TIF_SYSCALL_EMU	(1 << TIF_SYSCALL_EMU)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
-#define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
@@ -140,7 +138,7 @@ struct thread_info {
 
 /* Only used for 64 bit */
 #define _TIF_DO_NOTIFY_MASK						\
-	(_TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_NOTIFY_RESUME |	\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |				\
 	 _TIF_USER_RETURN_NOTIFY)
 
 /* flags to check in __switch_to() */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 1ce64c3..135e12d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1052,10 +1052,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	if (kill_it && tolerant < 3)
 		force_sig(SIGBUS, current);
 
-	/* Trap this thread before returning to user, for action required */
-	if (worst == MCE_AR_SEVERITY)
-		set_thread_flag(TIF_MCE_NOTIFY);
-
 	if (worst > 0)
 		mce_report_event(regs);
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
@@ -1071,20 +1067,6 @@ void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
 	pr_err("Action optional memory failure at %lx ignored\n", pfn);
 }
 
-/*
- * Called in process context that interrupted by MCE and marked with
- * TIF_MCE_NOTFY, just before returning to errorneous userland.
- * This code is allowed to sleep.
- * Attempt possible recovery such as calling the high level VM handler to
- * process any corrupted pages, and kill/signal current process if required.
- */
-void mce_notify_process(void)
-{
-	clear_thread_flag(TIF_MCE_NOTIFY);
-
-	/* TBD: do recovery for action required event */
-}
-
 static void mce_process_work(struct work_struct *dummy)
 {
 	unsigned long pfn;
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 54ddaeb2..306fd6d 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -814,12 +814,6 @@ static void do_signal(struct pt_regs *regs)
 void
 do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
-#ifdef CONFIG_X86_MCE
-	/* notify userspace of pending MCEs */
-	if (thread_info_flags & _TIF_MCE_NOTIFY)
-		mce_notify_process();
-#endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
-
 	/* deal with pending signal delivery */
 	if (thread_info_flags & _TIF_SIGPENDING)
 		do_signal(regs);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
                   ` (7 preceding siblings ...)
  2011-08-31 22:26 ` Luck, Tony
@ 2011-08-31 22:26 ` Luck, Tony
  2011-09-07  6:05   ` Chen Gong
  2011-08-31 22:26 ` Luck, Tony
  2011-08-31 22:41 ` [PATCH 0/5] Yet another pass at machine check recovery Valdis.Kletnieks
  10 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From: Tony Luck <tony.luck@intel.com>

Two new entries in the mce severity table - one notes that data errors
observed by innocent bystanders (who happen to share a machine check
bank with the cpu experiencing the error) should be left alone by using
the "KEEP" severity.

Then inline in the do_machine_check() handler we process the user-mode
data error that was marked at MCE_AR_SEVERITY.  Even though we are in
"machine check context" it is almost safe to do so. We have already
released all the other cpus from rendezvous and we know that the cpu
with the error was executing user code - so it cannot have interrupts
locked out, or hold any locks. I.e. this is almost equivalent to a
page fault. Only difference (and risk) is that on x86_64 we are still
on the machine check stack - so if another machine check arrives, we
are toast (we didn't clear MCG_STATUS - yet, so cpu will reset rather
than taking a nested machine check on the same stack).

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Using the "KEEP" state avoids the complexity of my earlier solution
that sorted the cpus by severity and ran the more serious ones first.

 arch/x86/kernel/cpu/mcheck/mce-severity.c |   14 ++++++++++-
 arch/x86/kernel/cpu/mcheck/mce.c          |   35 ++++++++++++++++++++--------
 2 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 7395d5f..c4d8b24 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -54,6 +54,7 @@ static struct severity {
 #define  MASK(x, y)	.mask = x, .result = y
 #define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
 #define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
+#define	MCI_ADDR (MCI_STATUS_ADDRV|MCI_STATUS_MISCV)
 #define MCACOD 0xffff
 
 	MCESEV(
@@ -102,11 +103,22 @@ static struct severity {
 		SER, BITCLR(MCI_STATUS_S)
 		),
 
-	/* AR add known MCACODs here */
 	MCESEV(
 		PANIC, "Action required with lost events",
 		SER, BITSET(MCI_STATUS_OVER|MCI_UC_SAR)
 		),
+
+	/* known AR MCACODs: */
+	MCESEV(
+		KEEP, "HT thread notices Action required: data load error",
+		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
+		MCGMASK(MCG_STATUS_EIPV, 0)
+		),
+	MCESEV(
+		AR, "Action required: data load error",
+		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
+		USER
+		),
 	MCESEV(
 		PANIC, "Action required: unknown MCACOD",
 		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 135e12d..2c59a34 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -996,12 +996,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			continue;
 		}
 
-		/*
-		 * Kill on action required.
-		 */
-		if (severity == MCE_AR_SEVERITY)
-			kill_it = 1;
-
 		mce_read_aux(&m, i);
 
 		/*
@@ -1022,6 +1016,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		}
 	}
 
+	m = *final;
+
 	if (!no_way_out)
 		mce_clear_state(toclear);
 
@@ -1040,7 +1036,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * support MCE broadcasting or it has been disabled.
 	 */
 	if (no_way_out && tolerant < 3)
-		mce_panic("Fatal machine check on current CPU", final, msg);
+		mce_panic("Fatal machine check on current CPU", &m, msg);
 
 	/*
 	 * If the error seems to be unrecoverable, something should be
@@ -1049,11 +1045,24 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * high, don't try to do anything at all.
 	 */
 
-	if (kill_it && tolerant < 3)
+	if (worst != MCE_AR_SEVERITY && kill_it && tolerant < 3)
 		force_sig(SIGBUS, current);
 
 	if (worst > 0)
 		mce_report_event(regs);
+
+	if (worst == MCE_AR_SEVERITY) {
+		unsigned long pfn = m.addr >> PAGE_SHIFT;
+
+		pr_err("Uncorrected hardware memory error in user-access at %llx",
+			m.addr);
+		if (__memory_failure(pfn, MCE_VECTOR, 0) < 0) {
+			pr_err("Memory error not recovered");
+			force_sig(SIGBUS, current);
+		} else
+			pr_err("Memory error recovered");
+	}
+
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 out:
 	atomic_dec(&mce_entry);
@@ -1061,12 +1070,18 @@ out:
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
-/* dummy to break dependency. actual code is in mm/memory-failure.c */
-void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
+#ifndef CONFIG_MEMORY_FAILURE
+void memory_failure(unsigned long pfn, int vector)
 {
 	pr_err("Action optional memory failure at %lx ignored\n", pfn);
 }
 
+int __memory_failure(unsigned long pfn, int trapno, int flags)
+{
+	return -ENXIO;
+}
+#endif
+
 static void mce_process_work(struct work_struct *dummy)
 {
 	unsigned long pfn;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
                   ` (8 preceding siblings ...)
  2011-08-31 22:26 ` [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode Luck, Tony
@ 2011-08-31 22:26 ` Luck, Tony
  2011-08-31 22:41 ` [PATCH 0/5] Yet another pass at machine check recovery Valdis.Kletnieks
  10 siblings, 0 replies; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:26 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar, Borislav Petkov, Hidetoshi Seto

From:	"Luck, Tony" <tony.luck@intel.com>

From: Tony Luck <tony.luck@intel.com>

Two new entries in the mce severity table - one notes that data errors
observed by innocent bystanders (who happen to share a machine check
bank with the cpu experiencing the error) should be left alone by using
the "KEEP" severity.

Then inline in the do_machine_check() handler we process the user-mode
data error that was marked at MCE_AR_SEVERITY.  Even though we are in
"machine check context" it is almost safe to do so. We have already
released all the other cpus from rendezvous and we know that the cpu
with the error was executing user code - so it cannot have interrupts
locked out, or hold any locks. I.e. this is almost equivalent to a
page fault. Only difference (and risk) is that on x86_64 we are still
on the machine check stack - so if another machine check arrives, we
are toast (we didn't clear MCG_STATUS - yet, so cpu will reset rather
than taking a nested machine check on the same stack).

Signed-off-by: Tony Luck <tony.luck@intel.com>
---

Using the "KEEP" state avoids the complexity of my earlier solution
that sorted the cpus by severity and ran the more serious ones first.

 arch/x86/kernel/cpu/mcheck/mce-severity.c |   14 ++++++++++-
 arch/x86/kernel/cpu/mcheck/mce.c          |   35 ++++++++++++++++++++--------
 2 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 7395d5f..c4d8b24 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -54,6 +54,7 @@ static struct severity {
 #define  MASK(x, y)	.mask = x, .result = y
 #define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
 #define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
+#define	MCI_ADDR (MCI_STATUS_ADDRV|MCI_STATUS_MISCV)
 #define MCACOD 0xffff
 
 	MCESEV(
@@ -102,11 +103,22 @@ static struct severity {
 		SER, BITCLR(MCI_STATUS_S)
 		),
 
-	/* AR add known MCACODs here */
 	MCESEV(
 		PANIC, "Action required with lost events",
 		SER, BITSET(MCI_STATUS_OVER|MCI_UC_SAR)
 		),
+
+	/* known AR MCACODs: */
+	MCESEV(
+		KEEP, "HT thread notices Action required: data load error",
+		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
+		MCGMASK(MCG_STATUS_EIPV, 0)
+		),
+	MCESEV(
+		AR, "Action required: data load error",
+		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
+		USER
+		),
 	MCESEV(
 		PANIC, "Action required: unknown MCACOD",
 		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 135e12d..2c59a34 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -996,12 +996,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			continue;
 		}
 
-		/*
-		 * Kill on action required.
-		 */
-		if (severity == MCE_AR_SEVERITY)
-			kill_it = 1;
-
 		mce_read_aux(&m, i);
 
 		/*
@@ -1022,6 +1016,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		}
 	}
 
+	m = *final;
+
 	if (!no_way_out)
 		mce_clear_state(toclear);
 
@@ -1040,7 +1036,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * support MCE broadcasting or it has been disabled.
 	 */
 	if (no_way_out && tolerant < 3)
-		mce_panic("Fatal machine check on current CPU", final, msg);
+		mce_panic("Fatal machine check on current CPU", &m, msg);
 
 	/*
 	 * If the error seems to be unrecoverable, something should be
@@ -1049,11 +1045,24 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	 * high, don't try to do anything at all.
 	 */
 
-	if (kill_it && tolerant < 3)
+	if (worst != MCE_AR_SEVERITY && kill_it && tolerant < 3)
 		force_sig(SIGBUS, current);
 
 	if (worst > 0)
 		mce_report_event(regs);
+
+	if (worst == MCE_AR_SEVERITY) {
+		unsigned long pfn = m.addr >> PAGE_SHIFT;
+
+		pr_err("Uncorrected hardware memory error in user-access at %llx",
+			m.addr);
+		if (__memory_failure(pfn, MCE_VECTOR, 0) < 0) {
+			pr_err("Memory error not recovered");
+			force_sig(SIGBUS, current);
+		} else
+			pr_err("Memory error recovered");
+	}
+
 	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
 out:
 	atomic_dec(&mce_entry);
@@ -1061,12 +1070,18 @@ out:
 }
 EXPORT_SYMBOL_GPL(do_machine_check);
 
-/* dummy to break dependency. actual code is in mm/memory-failure.c */
-void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
+#ifndef CONFIG_MEMORY_FAILURE
+void memory_failure(unsigned long pfn, int vector)
 {
 	pr_err("Action optional memory failure at %lx ignored\n", pfn);
 }
 
+int __memory_failure(unsigned long pfn, int trapno, int flags)
+{
+	return -ENXIO;
+}
+#endif
+
 static void mce_process_work(struct work_struct *dummy)
 {
 	unsigned long pfn;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5] Yet another pass at machine check recovery
  2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
                   ` (9 preceding siblings ...)
  2011-08-31 22:26 ` Luck, Tony
@ 2011-08-31 22:41 ` Valdis.Kletnieks
  2011-08-31 22:54   ` Luck, Tony
  10 siblings, 1 reply; 25+ messages in thread
From: Valdis.Kletnieks @ 2011-08-31 22:41 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

[-- Attachment #1: Type: text/plain, Size: 354 bytes --]

On Wed, 31 Aug 2011 15:21:37 PDT, "Luck, Tony" said:
> Scaled down ambitions ... this version just goes for recovery of
> data errors detected in user mode.

Two slightly different versions showed up here almost on top of each other.
Which one should we be looking at? Or do you want us to ignore both and you do
another one with REPOST in the subject?


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 0/5] Yet another pass at machine check recovery
  2011-08-31 22:41 ` [PATCH 0/5] Yet another pass at machine check recovery Valdis.Kletnieks
@ 2011-08-31 22:54   ` Luck, Tony
  0 siblings, 0 replies; 25+ messages in thread
From: Luck, Tony @ 2011-08-31 22:54 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

> Two slightly different versions showed up here almost on top of each other.
> Which one should we be looking at? Or do you want us to ignore both and you do
> another one with REPOST in the subject?

My script that added "in-reply-to" parts 1-5 to make them thread
from part 0 got a little over-enthusiastic and started recursing
on its own output.

The second version is identical (patchwise) to the first - it just
has a bogus "From: "Luck, Tony" line added.

I don't think I need to clutter the list with a repost.

Sorry for the noise and confusion - the bad script has been
told to go sit in the corner with a paper bag on its head.

-Tony


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] mce: mask out undefined bits from MCi_ADDR
  2011-08-31 22:25 ` [PATCH 2/5] mce: mask out undefined bits from MCi_ADDR Luck, Tony
@ 2011-09-05  9:19   ` Chen Gong
  2011-09-06 20:15     ` Luck, Tony
  0 siblings, 1 reply; 25+ messages in thread
From: Chen Gong @ 2011-09-05  9:19 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

于 2011/9/1 6:25, Luck, Tony 写道:
> From: Tony Luck<tony.luck@intel.com>
>
> Move duplicate copies of the code that reads ADDR/MISC registers
> to a function. Add masking code for systems that have undefined
> low-order bits in the MCi_ADDR register.
>
> Based on original code by Andi Kleen
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> ---
>
> Andi originally posted this as two patches - one to move the common
> code to the new function "mce_read_aux()", the second to add the
> masking.
> Seto-san objected to the masking on the grounds that the bits might
> contain something useful - but after some thought, I agree with Andi
> that it is better to drop undefined bits.
>
>   arch/x86/kernel/cpu/mcheck/mce.c |   31 +++++++++++++++++++++++--------
>   1 files changed, 23 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 91bb983..1ce64c3 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -490,6 +490,27 @@ static void mce_report_event(struct pt_regs *regs)
>   	irq_work_queue(&__get_cpu_var(mce_irq_work));
>   }
>
> +/*
> + * Read ADDR and MISC registers.
> + */
> +static void mce_read_aux(struct mce *m, int i)
> +{
> +	if (m->status&  MCI_STATUS_MISCV)
> +		m->misc = mce_rdmsrl(MSR_IA32_MCx_MISC(i));
> +	if (m->status&  MCI_STATUS_ADDRV) {
> +		m->addr = mce_rdmsrl(MSR_IA32_MCx_ADDR(i));
> +
> +		/*
> +		 * Mask the reported address by the reported granuality.
> +		 */
> +		if (mce_ser&&  (m->status&  MCI_STATUS_MISCV)) {
> +			u8 shift = m->misc&  0x1f;

According to SDM, here it should be "m->misc & 0x3f"


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 2/5] mce: mask out undefined bits from MCi_ADDR
  2011-09-05  9:19   ` Chen Gong
@ 2011-09-06 20:15     ` Luck, Tony
  0 siblings, 0 replies; 25+ messages in thread
From: Luck, Tony @ 2011-09-06 20:15 UTC (permalink / raw)
  To: Chen Gong; +Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 474 bytes --]

>> +		if (mce_ser&&  (m->status&  MCI_STATUS_MISCV)) {
>> +			u8 shift = m->misc&  0x1f;
>
> According to SDM, here it should be "m->misc & 0x3f"

Correct. You, or somebody pointed this out before ... and I managed to
drop the fix while merging & rebasing ... will try to get it applied
this time.

Thanks

-Tony

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] HWPOISON: Handle hwpoison in current process
  2011-08-31 22:25 ` [PATCH 3/5] HWPOISON: Handle hwpoison in current process Luck, Tony
@ 2011-09-07  5:47   ` Chen Gong
  0 siblings, 0 replies; 25+ messages in thread
From: Chen Gong @ 2011-09-07  5:47 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

于 2011/9/1 6:25, Luck, Tony 写道:
> From: Andi Kleen<andi@firstfloor.org>
>
> When hardware poison handles the current process use
> a forced signal with _AR severity.
>
> [Tony: changed some function names .. the "_ao" suffix was no longer meaningful]
>
> Signed-off-by: Andi Kleen<ak@linux.intel.com>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> ---
>   mm/memory-failure.c |   39 ++++++++++++++++++++++-----------------
>   1 files changed, 22 insertions(+), 17 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2b43ba0..6ccb8a6 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -186,33 +186,38 @@ int hwpoison_filter(struct page *p)
>   EXPORT_SYMBOL_GPL(hwpoison_filter);
>
>   /*
> - * Send all the processes who have the page mapped an ``action optional''
> - * signal.
> + * Send one process who has the page mapped a SIGBUS. It might
> + * be able to catch it and initiate its own task level recovery.
>    */
> -static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
> +static int user_recovery(struct task_struct *t, unsigned long addr, int trapno,
>   			unsigned long pfn, struct page *page)
>   {
>   	struct siginfo si;
>   	int ret;
>
>   	printk(KERN_ERR
> -		"MCE %#lx: Killing %s:%d early due to hardware memory corruption\n",
> -		pfn, t->comm, t->pid);
> +		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
> +	       pfn, t->comm, t->pid);
>   	si.si_signo = SIGBUS;
>   	si.si_errno = 0;
> -	si.si_code = BUS_MCEERR_AO;
>   	si.si_addr = (void *)addr;
>   #ifdef __ARCH_SI_TRAPNO
>   	si.si_trapno = trapno;
>   #endif
>   	si.si_addr_lsb = compound_trans_order(compound_head(page)) + PAGE_SHIFT;
> -	/*
> -	 * Don't use force here, it's convenient if the signal
> -	 * can be temporarily blocked.
> -	 * This could cause a loop when the user sets SIGBUS
> -	 * to SIG_IGN, but hopefully no one will do that?
> -	 */
> -	ret = send_sig_info(SIGBUS,&si, t);  /* synchronous? */
> +	if (t == current) {
> +		si.si_code = BUS_MCEERR_AR;
> +		ret = force_sig_info(SIGBUS,&si, t);
> +	} else {
> +		/*
> +		 * Don't use force here, it's convenient if the signal
> +		 * can be temporarily blocked.
> +		 * This could cause a loop when the user sets SIGBUS
> +		 * to SIG_IGN, but hopefully noone will do that?
                                              ^^^^^
spelling error


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode
  2011-08-31 22:26 ` [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode Luck, Tony
@ 2011-09-07  6:05   ` Chen Gong
  2011-09-07 13:25     ` Borislav Petkov
  2011-09-08  3:05     ` Minskey Guo
  0 siblings, 2 replies; 25+ messages in thread
From: Chen Gong @ 2011-09-07  6:05 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

于 2011/9/1 6:26, Luck, Tony 写道:
> From: Tony Luck<tony.luck@intel.com>
>
> Two new entries in the mce severity table - one notes that data errors
> observed by innocent bystanders (who happen to share a machine check
> bank with the cpu experiencing the error) should be left alone by using
> the "KEEP" severity.
>
> Then inline in the do_machine_check() handler we process the user-mode
> data error that was marked at MCE_AR_SEVERITY.  Even though we are in
> "machine check context" it is almost safe to do so. We have already
> released all the other cpus from rendezvous and we know that the cpu
> with the error was executing user code - so it cannot have interrupts
> locked out, or hold any locks. I.e. this is almost equivalent to a
> page fault. Only difference (and risk) is that on x86_64 we are still
> on the machine check stack - so if another machine check arrives, we
> are toast (we didn't clear MCG_STATUS - yet, so cpu will reset rather
> than taking a nested machine check on the same stack).
>
> Signed-off-by: Tony Luck<tony.luck@intel.com>
> ---
>
> Using the "KEEP" state avoids the complexity of my earlier solution
> that sorted the cpus by severity and ran the more serious ones first.
>
>   arch/x86/kernel/cpu/mcheck/mce-severity.c |   14 ++++++++++-
>   arch/x86/kernel/cpu/mcheck/mce.c          |   35 ++++++++++++++++++++--------
>   2 files changed, 38 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> index 7395d5f..c4d8b24 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> @@ -54,6 +54,7 @@ static struct severity {
>   #define  MASK(x, y)	.mask = x, .result = y
>   #define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
>   #define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
> +#define	MCI_ADDR (MCI_STATUS_ADDRV|MCI_STATUS_MISCV)
>   #define MCACOD 0xffff
>
>   	MCESEV(
> @@ -102,11 +103,22 @@ static struct severity {
>   		SER, BITCLR(MCI_STATUS_S)
>   		),
>
> -	/* AR add known MCACODs here */
>   	MCESEV(
>   		PANIC, "Action required with lost events",
>   		SER, BITSET(MCI_STATUS_OVER|MCI_UC_SAR)
>   		),
> +
> +	/* known AR MCACODs: */
> +	MCESEV(
> +		KEEP, "HT thread notices Action required: data load error",
> +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
> +		MCGMASK(MCG_STATUS_EIPV, 0)
> +		),
> +	MCESEV(
> +		AR, "Action required: data load error",
> +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
> +		USER
> +		),

I don't think *AR* makes sense here because the following codes have a 
assumption that it means *user space* condition. If so, in the future a 
new *AR* severity for kernel usage is created, we can't distinguish 
which one can call "memory_failure" as below. At least, it should have a 
suffix such as AR_USER/AR_KERN:

enum severity_level {
         MCE_NO_SEVERITY,
         MCE_KEEP_SEVERITY,
         MCE_SOME_SEVERITY,
         MCE_AO_SEVERITY,
         MCE_UC_SEVERITY,
         MCE_AR_USER_SEVERITY,
	MCE_AR_KERN_SEVERITY,
         MCE_PANIC_SEVERITY,
};


>   	MCESEV(
>   		PANIC, "Action required: unknown MCACOD",
>   		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR)
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 135e12d..2c59a34 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -996,12 +996,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>   			continue;
>   		}
>
> -		/*
> -		 * Kill on action required.
> -		 */
> -		if (severity == MCE_AR_SEVERITY)
> -			kill_it = 1;
> -
>   		mce_read_aux(&m, i);
>
>   		/*
> @@ -1022,6 +1016,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>   		}
>   	}
>
> +	m = *final;
> +
>   	if (!no_way_out)
>   		mce_clear_state(toclear);
>
> @@ -1040,7 +1036,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>   	 * support MCE broadcasting or it has been disabled.
>   	 */
>   	if (no_way_out&&  tolerant<  3)
> -		mce_panic("Fatal machine check on current CPU", final, msg);
> +		mce_panic("Fatal machine check on current CPU",&m, msg);
>
>   	/*
>   	 * If the error seems to be unrecoverable, something should be
> @@ -1049,11 +1045,24 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>   	 * high, don't try to do anything at all.
>   	 */
>
> -	if (kill_it&&  tolerant<  3)
> +	if (worst != MCE_AR_SEVERITY&&  kill_it&&  tolerant<  3)
>   		force_sig(SIGBUS, current);
>
>   	if (worst>  0)
>   		mce_report_event(regs);
> +
> +	if (worst == MCE_AR_SEVERITY) {
> +		unsigned long pfn = m.addr>>  PAGE_SHIFT;
> +
> +		pr_err("Uncorrected hardware memory error in user-access at %llx",
> +			m.addr);

print in the MCE handler maybe makes a deadlock ? say, when other CPUs 
are printing something, suddently they received MCE broadcast from 
Monarch CPU, when Monarch CPU runs above codes, a deadlock happens ?
Please fix me if I miss something :-)

> +		if (__memory_failure(pfn, MCE_VECTOR, 0)<  0) {
> +			pr_err("Memory error not recovered");
> +			force_sig(SIGBUS, current);
> +		} else
> +			pr_err("Memory error recovered");
> +	}

as you mentioned in the comment, the biggest concern is that when 
__memory_failure runs too long, if another MCE happens at the same time, 
(assuming this MCE is happened on its sibling CPU which has the same 
banks), the 2nd MCE will crash the system. Why not delaying the process 
in a safer context, such as using user_return_notifer ?

> +
>   	mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
>   out:
>   	atomic_dec(&mce_entry);
> @@ -1061,12 +1070,18 @@ out:
>   }
>   EXPORT_SYMBOL_GPL(do_machine_check);
>
> -/* dummy to break dependency. actual code is in mm/memory-failure.c */
> -void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
> +#ifndef CONFIG_MEMORY_FAILURE
> +void memory_failure(unsigned long pfn, int vector)
>   {
>   	pr_err("Action optional memory failure at %lx ignored\n", pfn);
>   }
>
> +int __memory_failure(unsigned long pfn, int trapno, int flags)
> +{
> +	return -ENXIO;
> +}
> +#endif
> +
>   static void mce_process_work(struct work_struct *dummy)
>   {
>   	unsigned long pfn;


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY
  2011-08-31 22:25 ` [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY Luck, Tony
@ 2011-09-07  9:11   ` Borislav Petkov
  0 siblings, 0 replies; 25+ messages in thread
From: Borislav Petkov @ 2011-09-07  9:11 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

On Wed, Aug 31, 2011 at 06:25:34PM -0400, Luck, Tony wrote:
> From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

Looks ok, just minor nitpicks and spelling fixes below.

> The basic flow of MCE handler is summarized as follows:
>  1) from NMI context:
> 	check hardware error registers, determine error severity,
> 	and then panic or request non-NMI context by irq_work() to
> 	continue the system.
>  2) from (irq) context:
> 	call non-NMI safe functions,
> 	wake up loggers and schedule work if required
>  3) from worker thread:

	from process context:

> 	process some time-consuming works like memory poisoning.

Scrub paragraph style:

> TIF_MCE_NOTIFY flag is relatively legacy and have used to do tasks of

TIF_MCE_NOTIFY is a legacy flag and has been used to do 2) and 3) in

> 2) and 3) on the thread context that interrupted by MCE.  However now

the context of the process which got interrupted by an MCE.

> use of irq_work() and work-queue is enough for these tasks, so this
> patch removes duplicated tasks in mce_notify_process().

Now that irq_work()/workqueues are used for those tasks, remove
duplicated work from mce_notify_process().

> As the result there is no task to be done in the interrupted context,

	       , there is no work to be done in the interrupted task's context

> but soon if SRAR is supported there would be some thread-specific thing

								    handling of Action Required errors.

> for action required.  So keep the flag for such possible future use,
> until better mechanism is introduced.
> 
> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> This is the combination of two patches by Seto-san - with Boris'
> suggestion to not create a 2-line function mce_memory_failure_process()
> but to leave those inline.
> Message-ID: <4DFB1476.40804@jp.fujitsu.com>
> [PATCH 6/8] x86, mce: introduce mce_memory_failure_process()
> Message-ID: <4DFB1509.7020402@jp.fujitsu.com>
> [PATCH 7/8] x86, mce: rework use of TIF_MCE_NOTIFY

You need a double "Link: http://lkml.kernel.org/r/..." tag here.

> 
>  arch/x86/kernel/cpu/mcheck/mce.c |   35 ++++++++++++++++-------------------
>  1 files changed, 16 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 08363b0..91bb983 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1037,8 +1037,9 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	if (kill_it && tolerant < 3)
>  		force_sig(SIGBUS, current);
>  
> -	/* notify userspace ASAP */
> -	set_thread_flag(TIF_MCE_NOTIFY);
> +	/* Trap this thread before returning to user, for action required */
> +	if (worst == MCE_AR_SEVERITY)
> +		set_thread_flag(TIF_MCE_NOTIFY);
>  
>  	if (worst > 0)
>  		mce_report_event(regs);
> @@ -1052,31 +1053,29 @@ EXPORT_SYMBOL_GPL(do_machine_check);
>  /* dummy to break dependency. actual code is in mm/memory-failure.c */
>  void __attribute__((weak)) memory_failure(unsigned long pfn, int vector)
>  {
> -	printk(KERN_ERR "Action optional memory failure at %lx ignored\n", pfn);
> +	pr_err("Action optional memory failure at %lx ignored\n", pfn);
>  }
>  
>  /*
> - * Called after mce notification in process context. This code
> - * is allowed to sleep. Call the high level VM handler to process
> - * any corrupted pages.
> - * Assume that the work queue code only calls this one at a time
> - * per CPU.
> - * Note we don't disable preemption, so this code might run on the wrong
> - * CPU. In this case the event is picked up by the scheduled work queue.
> - * This is merely a fast path to expedite processing in some common
> - * cases.
> + * Called in process context that interrupted by MCE and marked with
> + * TIF_MCE_NOTFY, just before returning to errorneous userland.

Polish expression:

"Called in process context which got interrupted by an MCE and marked
with TIF_MCE_NOTIFY, just before..."

> + * This code is allowed to sleep.
> + * Attempt possible recovery such as calling the high level VM handler to
> + * process any corrupted pages, and kill/signal current process if required.

This last sentence needs to go over mce_process_work() below.

>   */
>  void mce_notify_process(void)
>  {
> -	unsigned long pfn;
> -	mce_notify_irq();
> -	while (mce_ring_get(&pfn))
> -		memory_failure(pfn, MCE_VECTOR);
> +	clear_thread_flag(TIF_MCE_NOTIFY);
> +
> +	/* TBD: do recovery for action required event */
>  }
>  
>  static void mce_process_work(struct work_struct *dummy)
>  {
> -	mce_notify_process();
> +	unsigned long pfn;
> +
> +	while (mce_ring_get(&pfn))
> +		memory_failure(pfn, MCE_VECTOR);
>  }
>  
>  #ifdef CONFIG_X86_MCE_INTEL
> @@ -1157,8 +1156,6 @@ int mce_notify_irq(void)
>  	/* Not more than two messages every minute */
>  	static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
>  
> -	clear_thread_flag(TIF_MCE_NOTIFY);
> -
>  	if (test_and_clear_bit(0, &mce_need_notify)) {
>  		/* wake processes polling /dev/mcelog */
>  		wake_up_interruptible(&mce_chrdev_wait);

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] mce: remove TIF_MCE_NOTIFY
  2011-08-31 22:26 ` [PATCH 4/5] mce: remove TIF_MCE_NOTIFY Luck, Tony
@ 2011-09-07  9:23   ` Borislav Petkov
  0 siblings, 0 replies; 25+ messages in thread
From: Borislav Petkov @ 2011-09-07  9:23 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

On Wed, Aug 31, 2011 at 06:26:05PM -0400, Luck, Tony wrote:
> From: Tony Luck <tony.luck@intel.com>
> 
> Hidetoshi Seto has re-worked the notification code to use a
> worker thread for propagating notifications. So TIF_MCE_NOTIFY
> is unused. Previous plans to use this to somehow switch from
> "machine check context" to a more friendly environment to
> deal with immediate errors look like they will never pan out.

Err, I don't get it: patch 1/5 says:

"So keep the [TIF_MCE_NOTIFY] flag for such possible future use, until
better mechanism is introduced."

and here you remove it. AFAIR, last time we agreed on the irq_work ->
realtime recovery thread approach so why not drop this flag in patch 1/5
and forget this one.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode
  2011-09-07  6:05   ` Chen Gong
@ 2011-09-07 13:25     ` Borislav Petkov
  2011-09-07 13:50       ` Chen Gong
  2011-09-08  3:05     ` Minskey Guo
  1 sibling, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2011-09-07 13:25 UTC (permalink / raw)
  To: Chen Gong
  Cc: Luck, Tony, linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

On Wed, Sep 07, 2011 at 02:05:38AM -0400, Chen Gong wrote:

[..]

> > +	/* known AR MCACODs: */
> > +	MCESEV(
> > +		KEEP, "HT thread notices Action required: data load error",
> > +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
> > +		MCGMASK(MCG_STATUS_EIPV, 0)
> > +		),
> > +	MCESEV(
> > +		AR, "Action required: data load error",
> > +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
> > +		USER
> > +		),
> 
> I don't think *AR* makes sense here because the following codes have a 
> assumption that it means *user space* condition. If so, in the future a 
> new *AR* severity for kernel usage is created, we can't distinguish 
> which one can call "memory_failure" as below. At least, it should have a 
> suffix such as AR_USER/AR_KERN:
> 
> enum severity_level {
>          MCE_NO_SEVERITY,
>          MCE_KEEP_SEVERITY,
>          MCE_SOME_SEVERITY,
>          MCE_AO_SEVERITY,
>          MCE_UC_SEVERITY,
>          MCE_AR_USER_SEVERITY,
> 	MCE_AR_KERN_SEVERITY,
>          MCE_PANIC_SEVERITY,
> };

Are you saying you need action required handling for when the data load
error happens in kernel space? If so, I don't see how you can replay the
data load (assuming this is a data load from DRAM). In that case, we're
fatal and need to panic. If it is a different type of data load coming
from a lower cache level, then we could be able to recover...?

[..]

> > +	if (worst == MCE_AR_SEVERITY) {
> > +		unsigned long pfn = m.addr>>  PAGE_SHIFT;
> > +
> > +		pr_err("Uncorrected hardware memory error in user-access at %llx",
> > +			m.addr);
> 
> print in the MCE handler maybe makes a deadlock ? say, when other CPUs 
> are printing something, suddently they received MCE broadcast from 
> Monarch CPU, when Monarch CPU runs above codes, a deadlock happens ?
> Please fix me if I miss something :-)

sounds like it can happen if the other CPUs have grabbed some console
semaphore/mutex (I don't know what exactly we're using there) and the
monarch tries to grab it.

> > +		if (__memory_failure(pfn, MCE_VECTOR, 0)<  0) {
> > +			pr_err("Memory error not recovered");
> > +			force_sig(SIGBUS, current);
> > +		} else
> > +			pr_err("Memory error recovered");
> > +	}
>
> as you mentioned in the comment, the biggest concern is that when
> __memory_failure runs too long, if another MCE happens at the same
> time, (assuming this MCE is happened on its sibling CPU which has the
> same banks), the 2nd MCE will crash the system. Why not delaying the
> process in a safer context, such as using user_return_notifer ?

The user_return_notifier won't work, as we concluded in the last
discussion round: http://marc.info/?l=linux-kernel&m=130765542330349

AFAIR, we want to have a realtime thread dealing with that recovery
so that we exit #MC context as fast as possible. The code then should
be able to deal with a follow-up #MC. Tony, whatever happened to that
approach?

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode
  2011-09-07 13:25     ` Borislav Petkov
@ 2011-09-07 13:50       ` Chen Gong
  0 siblings, 0 replies; 25+ messages in thread
From: Chen Gong @ 2011-09-07 13:50 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, linux-kernel, Ingo Molnar, Hidetoshi Seto

于 9/7/2011 9:25 PM, Borislav Petkov 写道:
> On Wed, Sep 07, 2011 at 02:05:38AM -0400, Chen Gong wrote:
>
> [..]
>
>>> +	/* known AR MCACODs: */
>>> +	MCESEV(
>>> +		KEEP, "HT thread notices Action required: data load error",
>>> +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
>>> +		MCGMASK(MCG_STATUS_EIPV, 0)
>>> +		),
>>> +	MCESEV(
>>> +		AR, "Action required: data load error",
>>> +		SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|0x0134),
>>> +		USER
>>> +		),
>>
>> I don't think *AR* makes sense here because the following codes have a
>> assumption that it means *user space* condition. If so, in the future a
>> new *AR* severity for kernel usage is created, we can't distinguish
>> which one can call "memory_failure" as below. At least, it should have a
>> suffix such as AR_USER/AR_KERN:
>>
>> enum severity_level {
>>           MCE_NO_SEVERITY,
>>           MCE_KEEP_SEVERITY,
>>           MCE_SOME_SEVERITY,
>>           MCE_AO_SEVERITY,
>>           MCE_UC_SEVERITY,
>>           MCE_AR_USER_SEVERITY,
>> 	MCE_AR_KERN_SEVERITY,
>>           MCE_PANIC_SEVERITY,
>> };
>
> Are you saying you need action required handling for when the data load
> error happens in kernel space? If so, I don't see how you can replay the
> data load (assuming this is a data load from DRAM). In that case, we're
> fatal and need to panic. If it is a different type of data load coming
> from a lower cache level, then we could be able to recover...?
>
> [..]
>

Yep, what I talk is data load error in kenel space. In fact, I'm not 
sure what we can do except panic :-), IIRC, Tony ever said in some 
situations kernel can be recovered. If it is true, we must distinguish
these two different scenarios. In *user space* case, memory_failure can
be called, but on the contrary, it can't.

>>> +	if (worst == MCE_AR_SEVERITY) {
>>> +		unsigned long pfn = m.addr>>   PAGE_SHIFT;
>>> +
>>> +		pr_err("Uncorrected hardware memory error in user-access at %llx",
>>> +			m.addr);
>>
>> print in the MCE handler maybe makes a deadlock ? say, when other CPUs
>> are printing something, suddently they received MCE broadcast from
>> Monarch CPU, when Monarch CPU runs above codes, a deadlock happens ?
>> Please fix me if I miss something :-)
>
> sounds like it can happen if the other CPUs have grabbed some console
> semaphore/mutex (I don't know what exactly we're using there) and the
> monarch tries to grab it.
>
>>> +		if (__memory_failure(pfn, MCE_VECTOR, 0)<   0) {
>>> +			pr_err("Memory error not recovered");
>>> +			force_sig(SIGBUS, current);
>>> +		} else
>>> +			pr_err("Memory error recovered");
>>> +	}
>>
>> as you mentioned in the comment, the biggest concern is that when
>> __memory_failure runs too long, if another MCE happens at the same
>> time, (assuming this MCE is happened on its sibling CPU which has the
>> same banks), the 2nd MCE will crash the system. Why not delaying the
>> process in a safer context, such as using user_return_notifer ?
>
> The user_return_notifier won't work, as we concluded in the last
> discussion round: http://marc.info/?l=linux-kernel&m=130765542330349
>
> AFAIR, we want to have a realtime thread dealing with that recovery
> so that we exit #MC context as fast as possible. The code then should
> be able to deal with a follow-up #MC. Tony, whatever happened to that
> approach?
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode
  2011-09-07  6:05   ` Chen Gong
  2011-09-07 13:25     ` Borislav Petkov
@ 2011-09-08  3:05     ` Minskey Guo
  2011-09-08  5:16       ` Luck, Tony
  1 sibling, 1 reply; 25+ messages in thread
From: Minskey Guo @ 2011-09-08  3:05 UTC (permalink / raw)
  To: Chen Gong
  Cc: Luck, Tony, linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto



On 09/07/2011 02:05 PM, Chen Gong wrote:
> 于 2011/9/1 6:26, Luck, Tony 写道:
>> From: Tony Luck<tony.luck@intel.com>
>>
>> Two new entries in the mce severity table - one notes that data errors
>> observed by innocent bystanders (who happen to share a machine check
>> bank with the cpu experiencing the error) should be left alone by using
>> the "KEEP" severity.
>>
>> Then inline in the do_machine_check() handler we process the user-mode
>> data error that was marked at MCE_AR_SEVERITY.  Even though we are in
>> "machine check context" it is almost safe to do so. We have already
>> released all the other cpus from rendezvous and we know that the cpu
>> with the error was executing user code - so it cannot have interrupts
>> locked out, or hold any locks. I.e. this is almost equivalent to a
>> page fault. Only difference (and risk) is that on x86_64 we are still
>> on the machine check stack - so if another machine check arrives, we
>> are toast (we didn't clear MCG_STATUS - yet, so cpu will reset rather
>> than taking a nested machine check on the same stack).
>>
>> Signed-off-by: Tony Luck<tony.luck@intel.com>
>> ---
>>
>> Using the "KEEP" state avoids the complexity of my earlier solution
>> that sorted the cpus by severity and ran the more serious ones first.
>>
>>   arch/x86/kernel/cpu/mcheck/mce-severity.c |   14 ++++++++++-
>>   arch/x86/kernel/cpu/mcheck/mce.c          |   35 
>> ++++++++++++++++++++--------
>>   2 files changed, 38 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c 
>> b/arch/x86/kernel/cpu/mcheck/mce-severity.c
>> index 7395d5f..c4d8b24 100644
>> --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
>> +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
>> @@ -54,6 +54,7 @@ static struct severity {
>>   #define  MASK(x, y)    .mask = x, .result = y
>>   #define MCI_UC_S (MCI_STATUS_UC|MCI_STATUS_S)
>>   #define MCI_UC_SAR (MCI_STATUS_UC|MCI_STATUS_S|MCI_STATUS_AR)
>> +#define    MCI_ADDR (MCI_STATUS_ADDRV|MCI_STATUS_MISCV)
>>   #define MCACOD 0xffff
>>
>>       MCESEV(
>> @@ -102,11 +103,22 @@ static struct severity {
>>           SER, BITCLR(MCI_STATUS_S)
>>           ),
>>
>> -    /* AR add known MCACODs here */
>>       MCESEV(
>>           PANIC, "Action required with lost events",
>>           SER, BITSET(MCI_STATUS_OVER|MCI_UC_SAR)
>>           ),
>> +
>> +    /* known AR MCACODs: */
>> +    MCESEV(
>> +        KEEP, "HT thread notices Action required: data load error",
>> +        SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, 
>> MCI_UC_SAR|MCI_ADDR|0x0134),
>> +        MCGMASK(MCG_STATUS_EIPV, 0)
>> +        ),
>> +    MCESEV(
>> +        AR, "Action required: data load error",
>> +        SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, 
>> MCI_UC_SAR|MCI_ADDR|0x0134),
>> +        USER
>> +        ),
>
> I don't think *AR* makes sense here because the following codes have a 
> assumption that it means *user space* condition. If so, in the future 
> a new *AR* severity for kernel usage is created, we can't distinguish 
> which one can call "memory_failure" as below. At least, it should have 
> a suffix such as AR_USER/AR_KERN:
>
> enum severity_level {
>         MCE_NO_SEVERITY,
>         MCE_KEEP_SEVERITY,
>         MCE_SOME_SEVERITY,
>         MCE_AO_SEVERITY,
>         MCE_UC_SEVERITY,
>         MCE_AR_USER_SEVERITY,
>     MCE_AR_KERN_SEVERITY,
>         MCE_PANIC_SEVERITY,
> };
>
>
>>       MCESEV(
>>           PANIC, "Action required: unknown MCACOD",
>>           SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR, MCI_UC_SAR)
>> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c 
>> b/arch/x86/kernel/cpu/mcheck/mce.c
>> index 135e12d..2c59a34 100644
>> --- a/arch/x86/kernel/cpu/mcheck/mce.c
>> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
>> @@ -996,12 +996,6 @@ void do_machine_check(struct pt_regs *regs, long 
>> error_code)
>>               continue;
>>           }
>>
>> -        /*
>> -         * Kill on action required.
>> -         */
>> -        if (severity == MCE_AR_SEVERITY)
>> -            kill_it = 1;
>> -
>>           mce_read_aux(&m, i);
>>
>>           /*
>> @@ -1022,6 +1016,8 @@ void do_machine_check(struct pt_regs *regs, 
>> long error_code)
>>           }
>>       }
>>
>> +    m = *final;
>> +
>>       if (!no_way_out)
>>           mce_clear_state(toclear);
>>
>> @@ -1040,7 +1036,7 @@ void do_machine_check(struct pt_regs *regs, 
>> long error_code)
>>        * support MCE broadcasting or it has been disabled.
>>        */
>>       if (no_way_out&&  tolerant<  3)
>> -        mce_panic("Fatal machine check on current CPU", final, msg);
>> +        mce_panic("Fatal machine check on current CPU",&m, msg);
>>
>>       /*
>>        * If the error seems to be unrecoverable, something should be
>> @@ -1049,11 +1045,24 @@ void do_machine_check(struct pt_regs *regs, 
>> long error_code)
>>        * high, don't try to do anything at all.
>>        */
>>
>> -    if (kill_it&&  tolerant<  3)
>> +    if (worst != MCE_AR_SEVERITY&&  kill_it&&  tolerant<  3)
>>           force_sig(SIGBUS, current);
>>
>>       if (worst>  0)
>>           mce_report_event(regs);
>> +
>> +    if (worst == MCE_AR_SEVERITY) {
>> +        unsigned long pfn = m.addr>>  PAGE_SHIFT;
>> +
>> +        pr_err("Uncorrected hardware memory error in user-access at 
>> %llx",
>> +            m.addr);
>
> print in the MCE handler maybe makes a deadlock ? say, when other CPUs 
> are printing something, suddently they received MCE broadcast from 
> Monarch CPU, when Monarch CPU runs above codes, a deadlock happens ?
> Please fix me if I miss something :-)
>
>> +        if (__memory_failure(pfn, MCE_VECTOR, 0)<  0) {
>> +            pr_err("Memory error not recovered");
>> +            force_sig(SIGBUS, current);
>> +        } else
>> +            pr_err("Memory error recovered");
>> +    }
>
> as you mentioned in the comment, the biggest concern is that when 
> __memory_failure runs too long, if another MCE happens at the same 
> time, (assuming this MCE is happened on its sibling CPU which has the 
> same banks), the 2nd MCE will crash the system. Why not delaying the 
> process in a safer context, such as using user_return_notifer ?
>

besides, I somewhat suspect that calling __memory_failure()
in do_machine_check() will cause deadlock.

    __memory_failure() handling calls  some routines, such
as  is_free_buddy_page(),  which needs to acquire the spin
lock,  zone->lock. How can we guarantee that other CPUs
haven't acquired the lock when receiving #mc broadcast
and entering #mc handlers ?

     Moreover,  there are too many printk in __memory_failure()
which can cause deadlock.


-minskey


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode
  2011-09-08  3:05     ` Minskey Guo
@ 2011-09-08  5:16       ` Luck, Tony
  2011-09-08  9:25         ` Minskey Guo
  0 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2011-09-08  5:16 UTC (permalink / raw)
  To: Minskey Guo, Chen Gong
  Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1230 bytes --]

>    __memory_failure() handling calls  some routines, such
> as  is_free_buddy_page(),  which needs to acquire the spin
> lock,  zone->lock. How can we guarantee that other CPUs
> haven't acquired the lock when receiving #mc broadcast
> and entering #mc handlers ?

By the time I call __memory_failure() - the other cpus have
been released from mce handler - so they are back executing
normal code.

But Chen Gong's earlier comments made me look again at entry_64.S
code - ane I realized that I missed seeing code in the return
path from do_machine_check() that switched from MCE stack to
regular kernel stack before processing TIF_MCE_NOTIFY.

I may go back and re-visit a path that I looked at to change
do_machine_check from "void" return to "unsigned long" and have
it return the address for the "AR" case and "0" otherwise.
Then we could switch out of machine check stack to non-mce
context to call __memory_failure().  When I looked at this
before the entry_64.S path looked plausible. The 32-bit
path looked to be painful (too many macros in entry_32.S)


-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode
  2011-09-08  5:16       ` Luck, Tony
@ 2011-09-08  9:25         ` Minskey Guo
  0 siblings, 0 replies; 25+ messages in thread
From: Minskey Guo @ 2011-09-08  9:25 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Chen Gong, linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

On 09/08/2011 01:16 PM, Luck, Tony wrote:
>>     __memory_failure() handling calls  some routines, such
>> as  is_free_buddy_page(),  which needs to acquire the spin
>> lock,  zone->lock. How can we guarantee that other CPUs
>> haven't acquired the lock when receiving #mc broadcast
>> and entering #mc handlers ?
> By the time I call __memory_failure() - the other cpus have
> been released from mce handler - so they are back executing
> normal code.
Oh, yes,  I just realized that mce_end() released other
cpus. So, printk/lock is not an issue here.


> But Chen Gong's earlier comments made me look again at entry_64.S
> code - ane I realized that I missed seeing code in the return
> path from do_machine_check() that switched from MCE stack to
> regular kernel stack before processing TIF_MCE_NOTIFY.
>
> I may go back and re-visit a path that I looked at to change
> do_machine_check from "void" return to "unsigned long" and have
> it return the address for the "AR" case and "0" otherwise.
> Then we could switch out of machine check stack to non-mce
> context to call __memory_failure().  When I looked at this
> before the entry_64.S path looked plausible. The 32-bit
> path looked to be painful (too many macros in entry_32.S)
>
Why do you plan to switch out of machine check stack while
call __memory_failure() in do_machine_check(),  what's the
benefits ?

thanks
-minskey


> -Tony
> N�����r��y���b�X��ǧv�^�)޺{.n�+����{����zX��\x17��ܨ}���Ơz�&j:+v���\a����zZ+��+zf���h���~����i���z�\x1e�w���?����&�)ߢ^[f��^jǫy�m��@A�a��\x7f�\f0��h�\x0f�i\x7f


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY
  2011-08-31 22:25 ` Luck, Tony
@ 2011-09-09  2:23   ` huang ying
  0 siblings, 0 replies; 25+ messages in thread
From: huang ying @ 2011-09-09  2:23 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, Ingo Molnar, Borislav Petkov, Hidetoshi Seto

On Thu, Sep 1, 2011 at 6:25 AM, Luck, Tony <tony.luck@intel.com> wrote:
> From:   "Luck, Tony" <tony.luck@intel.com>
>
> From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
>
> The basic flow of MCE handler is summarized as follows:
>  1) from NMI context:
>        check hardware error registers, determine error severity,
>        and then panic or request non-NMI context by irq_work() to
>        continue the system.
>  2) from (irq) context:
>        call non-NMI safe functions,
>        wake up loggers and schedule work if required
>  3) from worker thread:
>        process some time-consuming works like memory poisoning.
>
> TIF_MCE_NOTIFY flag is relatively legacy and have used to do tasks of
> 2) and 3) on the thread context that interrupted by MCE.  However now
> use of irq_work() and work-queue is enough for these tasks, so this
> patch removes duplicated tasks in mce_notify_process().

Sorry for late.

Why do you think work-queue is enough for these tasks?  Work-queue
thread may have low priority.  But the hardware error should be
processed as soon as possible for better error containment.  So I
think it is still a good idea to process error before returning to
user space.  That likes scheduling to the work directly before
returning to user space.  We can use "return to user notifier" to
implement this.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-09-09  2:23 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-31 22:21 [PATCH 0/5] Yet another pass at machine check recovery Luck, Tony
2011-08-31 22:25 ` [PATCH 1/5] x86, mce: rework use of TIF_MCE_NOTIFY Luck, Tony
2011-09-07  9:11   ` Borislav Petkov
2011-08-31 22:25 ` Luck, Tony
2011-09-09  2:23   ` huang ying
2011-08-31 22:25 ` [PATCH 2/5] mce: mask out undefined bits from MCi_ADDR Luck, Tony
2011-09-05  9:19   ` Chen Gong
2011-09-06 20:15     ` Luck, Tony
2011-08-31 22:25 ` Luck, Tony
2011-08-31 22:25 ` [PATCH 3/5] HWPOISON: Handle hwpoison in current process Luck, Tony
2011-09-07  5:47   ` Chen Gong
2011-08-31 22:26 ` Luck, Tony
2011-08-31 22:26 ` [PATCH 4/5] mce: remove TIF_MCE_NOTIFY Luck, Tony
2011-09-07  9:23   ` Borislav Petkov
2011-08-31 22:26 ` Luck, Tony
2011-08-31 22:26 ` [PATCH 5/5] mce: recover from "action required" errors reported in data path in usermode Luck, Tony
2011-09-07  6:05   ` Chen Gong
2011-09-07 13:25     ` Borislav Petkov
2011-09-07 13:50       ` Chen Gong
2011-09-08  3:05     ` Minskey Guo
2011-09-08  5:16       ` Luck, Tony
2011-09-08  9:25         ` Minskey Guo
2011-08-31 22:26 ` Luck, Tony
2011-08-31 22:41 ` [PATCH 0/5] Yet another pass at machine check recovery Valdis.Kletnieks
2011-08-31 22:54   ` Luck, Tony

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.