linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/4] Revisit MCE handling for UE Errors
@ 2017-09-12  4:38 Balbir Singh
  2017-09-12  4:38 ` [PATCH v1 1/4] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Balbir Singh @ 2017-09-12  4:38 UTC (permalink / raw)
  To: mpe; +Cc: npiggin, mahesh, linuxppc-dev, Balbir Singh

This patch series is designed to hook up memory_failure on
UE errors, this is specially helpful for user_mode UE errors.

The first patch is a cleanup patch, it removes dead code.
I could not find any users of get_mce_fault_addr().
The second patch walks kernel/user mode page tables in
real mode to extract the effective address of the instruction
that caused the UE error and the effective address it was
trying to access (for load/store). The third patch hooks
up the pfn for instruction UE errors (ierror).

The fourth patch hooks up memory_failure to the MCE patch.

TODO:
Log the address in NVRAM, so that we can recover from
bad pages at boot and keep the blacklist persistent.

Changelog v2:
	- address review comments from Nick and Mahesh
	(initialization of pfn and more comments on failure
	when addr_to_pfn() or anaylse_instr() fail)
	- Hookup ierrors to the framework as well
	(comments from Mahesh)

Balbir Singh (4):
  powerpc/mce.c: Remove unused function get_mce_fault_addr()
  powerpc/mce: Hookup derror (load/store) UE errors
  powerpc/mce: Hookup ierror (instruction) UE errors
  powerpc/mce: hookup memory_failure for UE errors

 arch/powerpc/include/asm/mce.h  |   4 +-
 arch/powerpc/kernel/mce.c       | 108 ++++++++++++++++++++++++----------------
 arch/powerpc/kernel/mce_power.c | 105 +++++++++++++++++++++++++++++++++++---
 3 files changed, 163 insertions(+), 54 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v1 1/4] powerpc/mce.c: Remove unused function get_mce_fault_addr()
  2017-09-12  4:38 [PATCH v1 0/4] Revisit MCE handling for UE Errors Balbir Singh
@ 2017-09-12  4:38 ` Balbir Singh
  2017-09-12  4:38 ` [PATCH v1 2/4] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Balbir Singh @ 2017-09-12  4:38 UTC (permalink / raw)
  To: mpe; +Cc: npiggin, mahesh, linuxppc-dev, Balbir Singh

There are no users of get_mce_fault_addr()

Fixes: b63a0ff ("powerpc/powernv: Machine check exception handling.")

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/include/asm/mce.h |  2 --
 arch/powerpc/kernel/mce.c      | 39 ---------------------------------------
 2 files changed, 41 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 190d69a..75292c7 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -210,6 +210,4 @@ extern void release_mce_event(void);
 extern void machine_check_queue_event(void);
 extern void machine_check_print_event_info(struct machine_check_event *evt,
 					   bool user_mode);
-extern uint64_t get_mce_fault_addr(struct machine_check_event *evt);
-
 #endif /* __ASM_PPC64_MCE_H__ */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 9b2ea7e..e254399 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -411,45 +411,6 @@ void machine_check_print_event_info(struct machine_check_event *evt,
 }
 EXPORT_SYMBOL_GPL(machine_check_print_event_info);
 
-uint64_t get_mce_fault_addr(struct machine_check_event *evt)
-{
-	switch (evt->error_type) {
-	case MCE_ERROR_TYPE_UE:
-		if (evt->u.ue_error.effective_address_provided)
-			return evt->u.ue_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_SLB:
-		if (evt->u.slb_error.effective_address_provided)
-			return evt->u.slb_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_ERAT:
-		if (evt->u.erat_error.effective_address_provided)
-			return evt->u.erat_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_TLB:
-		if (evt->u.tlb_error.effective_address_provided)
-			return evt->u.tlb_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_USER:
-		if (evt->u.user_error.effective_address_provided)
-			return evt->u.user_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_RA:
-		if (evt->u.ra_error.effective_address_provided)
-			return evt->u.ra_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_LINK:
-		if (evt->u.link_error.effective_address_provided)
-			return evt->u.link_error.effective_address;
-		break;
-	default:
-	case MCE_ERROR_TYPE_UNKNOWN:
-		break;
-	}
-	return 0;
-}
-EXPORT_SYMBOL(get_mce_fault_addr);
-
 /*
  * This function is called in real mode. Strictly no printk's please.
  *
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v1 2/4] powerpc/mce: Hookup derror (load/store) UE errors
  2017-09-12  4:38 [PATCH v1 0/4] Revisit MCE handling for UE Errors Balbir Singh
  2017-09-12  4:38 ` [PATCH v1 1/4] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
@ 2017-09-12  4:38 ` Balbir Singh
  2017-09-12  4:38 ` [PATCH v1 3/4] powerpc/mce: Hookup ierror (instruction) " Balbir Singh
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Balbir Singh @ 2017-09-12  4:38 UTC (permalink / raw)
  To: mpe; +Cc: npiggin, mahesh, linuxppc-dev, Balbir Singh

Extract physical_address for UE errors by walking the page
tables for the mm and address at the NIP, to extract the
instruction. Then use the instruction to find the effective
address via analyse_instr().

We might have page table walking races, but we expect them to
be rare, the physical address extraction is best effort. The idea
is to then hook up this infrastructure to memory failure eventually.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/include/asm/mce.h  |  2 +-
 arch/powerpc/kernel/mce.c       |  6 ++-
 arch/powerpc/kernel/mce_power.c | 82 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 75292c7..3a1226e 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -204,7 +204,7 @@ struct mce_error_info {
 
 extern void save_mce_event(struct pt_regs *regs, long handled,
 			   struct mce_error_info *mce_err, uint64_t nip,
-			   uint64_t addr);
+			   uint64_t addr, uint64_t phys_addr);
 extern int get_mce_event(struct machine_check_event *mce, bool release);
 extern void release_mce_event(void);
 extern void machine_check_queue_event(void);
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index e254399..f41a75d 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -82,7 +82,7 @@ static void mce_set_error_info(struct machine_check_event *mce,
  */
 void save_mce_event(struct pt_regs *regs, long handled,
 		    struct mce_error_info *mce_err,
-		    uint64_t nip, uint64_t addr)
+		    uint64_t nip, uint64_t addr, uint64_t phys_addr)
 {
 	int index = __this_cpu_inc_return(mce_nest_count) - 1;
 	struct machine_check_event *mce = this_cpu_ptr(&mce_event[index]);
@@ -140,6 +140,10 @@ void save_mce_event(struct pt_regs *regs, long handled,
 	} else if (mce->error_type == MCE_ERROR_TYPE_UE) {
 		mce->u.ue_error.effective_address_provided = true;
 		mce->u.ue_error.effective_address = addr;
+		if (phys_addr != ULONG_MAX) {
+			mce->u.ue_error.physical_address_provided = true;
+			mce->u.ue_error.physical_address = phys_addr;
+		}
 	}
 	return;
 }
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index b76ca19..956dadc 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -27,6 +27,29 @@
 #include <asm/mmu.h>
 #include <asm/mce.h>
 #include <asm/machdep.h>
+#include <asm/pgtable.h>
+#include <asm/pte-walk.h>
+#include <asm/sstep.h>
+
+/*
+ * Convert an address related to an mm to a PFN. NOTE: we are in real
+ * mode, we could potentially race with page table updates.
+ */
+static unsigned long addr_to_pfn(struct mm_struct *mm, unsigned long addr)
+{
+	pte_t *ptep;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	if (mm == current->mm)
+		ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
+	else
+		ptep = find_init_mm_pte(addr, NULL);
+	local_irq_restore(flags);
+	if (!ptep || pte_special(*ptep))
+		return ULONG_MAX;
+	return pte_pfn(*ptep);
+}
 
 static void flush_tlb_206(unsigned int num_sets, unsigned int action)
 {
@@ -421,6 +444,50 @@ static const struct mce_derror_table mce_p9_derror_table[] = {
   MCE_INITIATOR_CPU,   MCE_SEV_ERROR_SYNC, },
 { 0, false, 0, 0, 0, 0 } };
 
+static int mce_find_instr_ea_and_pfn(struct pt_regs *regs, uint64_t *addr,
+					uint64_t *phys_addr)
+{
+	/*
+	 * Carefully look at the NIP to determine
+	 * the instruction to analyse. Reading the NIP
+	 * in real-mode is tricky and can lead to recursive
+	 * faults
+	 */
+	int instr;
+	struct mm_struct *mm;
+	unsigned long nip = regs->nip;
+	unsigned long pfn, instr_addr;
+	struct instruction_op op;
+	struct pt_regs tmp = *regs;
+
+	if (user_mode(regs))
+		mm = current->mm;
+	else
+		mm = &init_mm;
+
+	pfn = addr_to_pfn(mm, nip);
+	if (pfn != ULONG_MAX) {
+		instr_addr = (pfn << PAGE_SHIFT) + (nip & ~PAGE_MASK);
+		instr = *(unsigned int *)(instr_addr);
+		if (!analyse_instr(&op, &tmp, instr)) {
+			pfn = addr_to_pfn(mm, op.ea);
+			*addr = op.ea;
+			*phys_addr = pfn;
+			return 0;
+		}
+		/*
+		 * analyse_instr() might fail if the instruction
+		 * is not a load/store, although this is unexpected
+		 * for load/store errors or if we got the NIP
+		 * wrong
+		 */
+		pr_warn("failed to analyse instr %x\n", instr);
+	}
+	pr_warn("failed to get PFN for nip %lx\n", regs->nip);
+	*addr = 0;
+	return -1;
+}
+
 static int mce_handle_ierror(struct pt_regs *regs,
 		const struct mce_ierror_table table[],
 		struct mce_error_info *mce_err, uint64_t *addr)
@@ -489,7 +556,8 @@ static int mce_handle_ierror(struct pt_regs *regs,
 
 static int mce_handle_derror(struct pt_regs *regs,
 		const struct mce_derror_table table[],
-		struct mce_error_info *mce_err, uint64_t *addr)
+		struct mce_error_info *mce_err, uint64_t *addr,
+		uint64_t *phys_addr)
 {
 	uint64_t dsisr = regs->dsisr;
 	int handled = 0;
@@ -555,7 +623,10 @@ static int mce_handle_derror(struct pt_regs *regs,
 		mce_err->initiator = table[i].initiator;
 		if (table[i].dar_valid)
 			*addr = regs->dar;
-
+		else if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
+				table[i].error_type == MCE_ERROR_TYPE_UE) {
+			mce_find_instr_ea_and_pfn(regs, addr, phys_addr);
+		}
 		found = 1;
 	}
 
@@ -592,19 +663,20 @@ static long mce_handle_error(struct pt_regs *regs,
 		const struct mce_ierror_table itable[])
 {
 	struct mce_error_info mce_err = { 0 };
-	uint64_t addr;
+	uint64_t addr, phys_addr;
 	uint64_t srr1 = regs->msr;
 	long handled;
 
 	if (SRR1_MC_LOADSTORE(srr1))
-		handled = mce_handle_derror(regs, dtable, &mce_err, &addr);
+		handled = mce_handle_derror(regs, dtable, &mce_err, &addr,
+				&phys_addr);
 	else
 		handled = mce_handle_ierror(regs, itable, &mce_err, &addr);
 
 	if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
 		handled = mce_handle_ue_error(regs);
 
-	save_mce_event(regs, handled, &mce_err, regs->nip, addr);
+	save_mce_event(regs, handled, &mce_err, regs->nip, addr, phys_addr);
 
 	return handled;
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v1 3/4] powerpc/mce: Hookup ierror (instruction) UE errors
  2017-09-12  4:38 [PATCH v1 0/4] Revisit MCE handling for UE Errors Balbir Singh
  2017-09-12  4:38 ` [PATCH v1 1/4] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
  2017-09-12  4:38 ` [PATCH v1 2/4] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
@ 2017-09-12  4:38 ` Balbir Singh
  2017-09-12  4:38 ` [PATCH v1 4/4] powerpc/mce: hookup memory_failure for " Balbir Singh
  2017-09-12  5:03 ` [PATCH v1 0/4] Revisit MCE handling for UE Errors Nicholas Piggin
  4 siblings, 0 replies; 7+ messages in thread
From: Balbir Singh @ 2017-09-12  4:38 UTC (permalink / raw)
  To: mpe; +Cc: npiggin, mahesh, linuxppc-dev, Balbir Singh

Hookup instruction errors (UE) for memory offling via memory_failure()
in a manner similar to load/store errors (derror). Since we have access
to the NIP, the conversion is a one step process in this case.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/kernel/mce_power.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 956dadc..e0a7b1c 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -490,7 +490,8 @@ static int mce_find_instr_ea_and_pfn(struct pt_regs *regs, uint64_t *addr,
 
 static int mce_handle_ierror(struct pt_regs *regs,
 		const struct mce_ierror_table table[],
-		struct mce_error_info *mce_err, uint64_t *addr)
+		struct mce_error_info *mce_err, uint64_t *addr,
+		uint64_t *phys_addr)
 {
 	uint64_t srr1 = regs->msr;
 	int handled = 0;
@@ -542,8 +543,23 @@ static int mce_handle_ierror(struct pt_regs *regs,
 		}
 		mce_err->severity = table[i].severity;
 		mce_err->initiator = table[i].initiator;
-		if (table[i].nip_valid)
+		if (table[i].nip_valid) {
 			*addr = regs->nip;
+			if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
+				table[i].error_type == MCE_ERROR_TYPE_UE) {
+				unsigned long pfn;
+				struct mm_struct *mm;
+
+				if (user_mode(regs))
+					mm = current->mm;
+				else
+					mm = &init_mm;
+
+				pfn = addr_to_pfn(mm, regs->nip);
+				if (pfn != ULONG_MAX)
+					*phys_addr = pfn;
+			}
+		}
 		return handled;
 	}
 
@@ -671,7 +687,8 @@ static long mce_handle_error(struct pt_regs *regs,
 		handled = mce_handle_derror(regs, dtable, &mce_err, &addr,
 				&phys_addr);
 	else
-		handled = mce_handle_ierror(regs, itable, &mce_err, &addr);
+		handled = mce_handle_ierror(regs, itable, &mce_err, &addr,
+				&phys_addr);
 
 	if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
 		handled = mce_handle_ue_error(regs);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v1 4/4] powerpc/mce: hookup memory_failure for UE errors
  2017-09-12  4:38 [PATCH v1 0/4] Revisit MCE handling for UE Errors Balbir Singh
                   ` (2 preceding siblings ...)
  2017-09-12  4:38 ` [PATCH v1 3/4] powerpc/mce: Hookup ierror (instruction) " Balbir Singh
@ 2017-09-12  4:38 ` Balbir Singh
  2017-09-12  5:03 ` [PATCH v1 0/4] Revisit MCE handling for UE Errors Nicholas Piggin
  4 siblings, 0 replies; 7+ messages in thread
From: Balbir Singh @ 2017-09-12  4:38 UTC (permalink / raw)
  To: mpe; +Cc: npiggin, mahesh, linuxppc-dev, Balbir Singh

If we are in user space and hit a UE error, we now have the
basic infrastructure to walk the page tables and find out
the effective address that was accessed, since the DAR
is not valid.

We use a work_queue content to hookup the bad pfn, any
other context causes problems, since memory_failure itself
can call into schedule() via lru_drain_ bits.

We could probably poison the struct page to avoid a race
between detection and taking corrective action.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/kernel/mce.c | 63 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index f41a75d..d9be9e4 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -39,11 +39,21 @@ static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT], mce_event);
 static DEFINE_PER_CPU(int, mce_queue_count);
 static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT], mce_event_queue);
 
+/* Queue for delayed MCE UE events. */
+static DEFINE_PER_CPU(int, mce_ue_count);
+static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT],
+					mce_ue_event_queue);
+
 static void machine_check_process_queued_event(struct irq_work *work);
+void machine_check_ue_event(struct machine_check_event *evt);
+static void machine_process_ue_event(struct work_struct *work);
+
 static struct irq_work mce_event_process_work = {
         .func = machine_check_process_queued_event,
 };
 
+DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
+
 static void mce_set_error_info(struct machine_check_event *mce,
 			       struct mce_error_info *mce_err)
 {
@@ -143,6 +153,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
 		if (phys_addr != ULONG_MAX) {
 			mce->u.ue_error.physical_address_provided = true;
 			mce->u.ue_error.physical_address = phys_addr;
+			machine_check_ue_event(mce);
 		}
 	}
 	return;
@@ -197,6 +208,26 @@ void release_mce_event(void)
 	get_mce_event(NULL, true);
 }
 
+
+/*
+ * Queue up the MCE event which then can be handled later.
+ */
+void machine_check_ue_event(struct machine_check_event *evt)
+{
+	int index;
+
+	index = __this_cpu_inc_return(mce_ue_count) - 1;
+	/* If queue is full, just return for now. */
+	if (index >= MAX_MC_EVT) {
+		__this_cpu_dec(mce_ue_count);
+		return;
+	}
+	memcpy(this_cpu_ptr(&mce_ue_event_queue[index]), evt, sizeof(*evt));
+
+	/* Queue work to process this event later. */
+	schedule_work(&mce_ue_event_work);
+}
+
 /*
  * Queue up the MCE event which then can be handled later.
  */
@@ -219,7 +250,32 @@ void machine_check_queue_event(void)
 	/* Queue irq work to process this event later. */
 	irq_work_queue(&mce_event_process_work);
 }
-
+/*
+ * process pending MCE event from the mce event queue. This function will be
+ * called during syscall exit.
+ */
+static void machine_process_ue_event(struct work_struct *work)
+{
+	int index;
+	struct machine_check_event *evt;
+
+	while (__this_cpu_read(mce_ue_count) > 0) {
+		index = __this_cpu_read(mce_ue_count) - 1;
+		evt = this_cpu_ptr(&mce_ue_event_queue[index]);
+#ifdef CONFIG_MEMORY_FAILURE
+		/*
+		 * This should probably queued elsewhere, but
+		 * oh! well
+		 */
+		if (evt->error_type == MCE_ERROR_TYPE_UE) {
+			if (evt->u.ue_error.physical_address_provided)
+				memory_failure(evt->u.ue_error.physical_address,
+						SIGBUS, 0);
+		}
+#endif
+		__this_cpu_dec(mce_ue_count);
+	}
+}
 /*
  * process pending MCE event from the mce event queue. This function will be
  * called during syscall exit.
@@ -227,6 +283,7 @@ void machine_check_queue_event(void)
 static void machine_check_process_queued_event(struct irq_work *work)
 {
 	int index;
+	struct machine_check_event *evt;
 
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
 
@@ -236,8 +293,8 @@ static void machine_check_process_queued_event(struct irq_work *work)
 	 */
 	while (__this_cpu_read(mce_queue_count) > 0) {
 		index = __this_cpu_read(mce_queue_count) - 1;
-		machine_check_print_event_info(
-				this_cpu_ptr(&mce_event_queue[index]), false);
+		evt = this_cpu_ptr(&mce_event_queue[index]);
+		machine_check_print_event_info(evt, false);
 		__this_cpu_dec(mce_queue_count);
 	}
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 0/4] Revisit MCE handling for UE Errors
  2017-09-12  4:38 [PATCH v1 0/4] Revisit MCE handling for UE Errors Balbir Singh
                   ` (3 preceding siblings ...)
  2017-09-12  4:38 ` [PATCH v1 4/4] powerpc/mce: hookup memory_failure for " Balbir Singh
@ 2017-09-12  5:03 ` Nicholas Piggin
  2017-09-12  7:11   ` Balbir Singh
  4 siblings, 1 reply; 7+ messages in thread
From: Nicholas Piggin @ 2017-09-12  5:03 UTC (permalink / raw)
  To: Balbir Singh; +Cc: mpe, mahesh, linuxppc-dev, skiboot

Hi Balbir,

Very cool. How are you testing it? Is it failing memory pages
and poisoning them out properly?

Looks like you have a printk in the machine_check_early path,
which you shouldn't. I guess because we don't mark that context
as an NMI. Which we could... but I think you want to put as
little as possible in that path, so avoiding the print would
be preferable. Perhaps you could mark the mce event somehow that
the failure can be reported during processing it?

Firmware logging is a good question, I could not really see
where this all gets plumbed through. If this is expected to be
a common problem for some types of attached memory, then we
really need to build up a log of these errors that can be used
to exclude the memory after a reboot too. Do we have anything
like this capability in firmware?

Thanks,
Nick

On Tue, 12 Sep 2017 14:38:55 +1000
Balbir Singh <bsingharora@gmail.com> wrote:

> This patch series is designed to hook up memory_failure on
> UE errors, this is specially helpful for user_mode UE errors.
> 
> The first patch is a cleanup patch, it removes dead code.
> I could not find any users of get_mce_fault_addr().
> The second patch walks kernel/user mode page tables in
> real mode to extract the effective address of the instruction
> that caused the UE error and the effective address it was
> trying to access (for load/store). The third patch hooks
> up the pfn for instruction UE errors (ierror).
> 
> The fourth patch hooks up memory_failure to the MCE patch.
> 
> TODO:
> Log the address in NVRAM, so that we can recover from
> bad pages at boot and keep the blacklist persistent.
> 
> Changelog v2:
> 	- address review comments from Nick and Mahesh
> 	(initialization of pfn and more comments on failure
> 	when addr_to_pfn() or anaylse_instr() fail)
> 	- Hookup ierrors to the framework as well
> 	(comments from Mahesh)
> 
> Balbir Singh (4):
>   powerpc/mce.c: Remove unused function get_mce_fault_addr()
>   powerpc/mce: Hookup derror (load/store) UE errors
>   powerpc/mce: Hookup ierror (instruction) UE errors
>   powerpc/mce: hookup memory_failure for UE errors
> 
>  arch/powerpc/include/asm/mce.h  |   4 +-
>  arch/powerpc/kernel/mce.c       | 108 ++++++++++++++++++++++++----------------
>  arch/powerpc/kernel/mce_power.c | 105 +++++++++++++++++++++++++++++++++++---
>  3 files changed, 163 insertions(+), 54 deletions(-)
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1 0/4] Revisit MCE handling for UE Errors
  2017-09-12  5:03 ` [PATCH v1 0/4] Revisit MCE handling for UE Errors Nicholas Piggin
@ 2017-09-12  7:11   ` Balbir Singh
  0 siblings, 0 replies; 7+ messages in thread
From: Balbir Singh @ 2017-09-12  7:11 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Michael Ellerman, Mahesh Jagannath Salgaonkar,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	skiboot list

On Tue, Sep 12, 2017 at 3:03 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> Hi Balbir,
>
> Very cool. How are you testing it? Is it failing memory pages
> and poisoning them out properly?
>

Yep, I tested it and it seems to work correctly so far. I am testing this
on a simulator with injected MCE UE errors for both the data and
instruction side.

> Looks like you have a printk in the machine_check_early path,
> which you shouldn't. I guess because we don't mark that context
> as an NMI. Which we could... but I think you want to put as
> little as possible in that path, so avoiding the print would
> be preferable. Perhaps you could mark the mce event somehow that
> the failure can be reported during processing it?
>

Good point, I did see that printk handles stuff via printk_nmi_enter/exit,
but its best avoided. Will spin v2

> Firmware logging is a good question, I could not really see
> where this all gets plumbed through. If this is expected to be
> a common problem for some types of attached memory, then we
> really need to build up a log of these errors that can be used
> to exclude the memory after a reboot too. Do we have anything
> like this capability in firmware?

It's to be built, we should log these to NVRAM and revisit at every
boot to isolate these pages

Balbir Singh.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-09-12  7:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-12  4:38 [PATCH v1 0/4] Revisit MCE handling for UE Errors Balbir Singh
2017-09-12  4:38 ` [PATCH v1 1/4] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
2017-09-12  4:38 ` [PATCH v1 2/4] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
2017-09-12  4:38 ` [PATCH v1 3/4] powerpc/mce: Hookup ierror (instruction) " Balbir Singh
2017-09-12  4:38 ` [PATCH v1 4/4] powerpc/mce: hookup memory_failure for " Balbir Singh
2017-09-12  5:03 ` [PATCH v1 0/4] Revisit MCE handling for UE Errors Nicholas Piggin
2017-09-12  7:11   ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).