linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Revisit MCE handling for UE Errors
@ 2017-09-13  6:10 Balbir Singh
  2017-09-13  6:10 ` [PATCH v2 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Balbir Singh @ 2017-09-13  6:10 UTC (permalink / raw)
  To: mpe, npiggin, mahesh; +Cc: linuxppc-dev, Balbir Singh

This patch series is designed to hook up memory_failure on
UE errors, this is specially helpful for user_mode UE errors.

The first two patches cleanup bits, remove dead code.
I could not find any users of get_mce_fault_addr().
The second one improves printing of physical address

The third patch walks kernel/user mode page tables in
real mode to extract the effective address of the instruction
that caused the UE error and the effective address it was
trying to access (for load/store). The fourth patch hooks
up the pfn for instruction UE errors (ierror).

The fifth patch hooks up memory_failure to the MCE patch.

TODO:
Log the address in NVRAM, so that we can recover from
bad pages at boot and keep the blacklist persistent.

Changelog v2:
	- Remove pr_warn from real mode to a more delayed
	context, where its OK to warn.
	- Add a trivial patch to align prints of physical
	and effective address
Changelog v1:
	- address review comments from Nick and Mahesh
	(initialization of pfn and more comments on failure
	when addr_to_pfn() or anaylse_instr() fail)
	- Hookup ierrors to the framework as well
	(comments from Mahesh)

Balbir Singh (4):
  powerpc/mce.c: Remove unused function get_mce_fault_addr()
  powerpc/mce: Hookup derror (load/store) UE errors
  powerpc/mce: Hookup ierror (instruction) UE errors
  powerpc/mce: hookup memory_failure for UE errors

 arch/powerpc/include/asm/mce.h  |   4 +-
 arch/powerpc/kernel/mce.c       | 112 +++++++++++++++++++++++++---------------
 arch/powerpc/kernel/mce_power.c | 103 +++++++++++++++++++++++++++++++++---
 3 files changed, 165 insertions(+), 54 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr()
  2017-09-13  6:10 [PATCH v2 0/4] Revisit MCE handling for UE Errors Balbir Singh
@ 2017-09-13  6:10 ` Balbir Singh
  2017-09-13  6:10 ` [PATCH v2 2/5] powerpc/mce: align the print of physical address better Balbir Singh
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Balbir Singh @ 2017-09-13  6:10 UTC (permalink / raw)
  To: mpe, npiggin, mahesh; +Cc: linuxppc-dev, Balbir Singh

There are no users of get_mce_fault_addr()

Fixes: b63a0ff ("powerpc/powernv: Machine check exception handling.")

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/include/asm/mce.h |  2 --
 arch/powerpc/kernel/mce.c      | 39 ---------------------------------------
 2 files changed, 41 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 190d69a..75292c7 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -210,6 +210,4 @@ extern void release_mce_event(void);
 extern void machine_check_queue_event(void);
 extern void machine_check_print_event_info(struct machine_check_event *evt,
 					   bool user_mode);
-extern uint64_t get_mce_fault_addr(struct machine_check_event *evt);
-
 #endif /* __ASM_PPC64_MCE_H__ */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 9b2ea7e..e254399 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -411,45 +411,6 @@ void machine_check_print_event_info(struct machine_check_event *evt,
 }
 EXPORT_SYMBOL_GPL(machine_check_print_event_info);
 
-uint64_t get_mce_fault_addr(struct machine_check_event *evt)
-{
-	switch (evt->error_type) {
-	case MCE_ERROR_TYPE_UE:
-		if (evt->u.ue_error.effective_address_provided)
-			return evt->u.ue_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_SLB:
-		if (evt->u.slb_error.effective_address_provided)
-			return evt->u.slb_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_ERAT:
-		if (evt->u.erat_error.effective_address_provided)
-			return evt->u.erat_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_TLB:
-		if (evt->u.tlb_error.effective_address_provided)
-			return evt->u.tlb_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_USER:
-		if (evt->u.user_error.effective_address_provided)
-			return evt->u.user_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_RA:
-		if (evt->u.ra_error.effective_address_provided)
-			return evt->u.ra_error.effective_address;
-		break;
-	case MCE_ERROR_TYPE_LINK:
-		if (evt->u.link_error.effective_address_provided)
-			return evt->u.link_error.effective_address;
-		break;
-	default:
-	case MCE_ERROR_TYPE_UNKNOWN:
-		break;
-	}
-	return 0;
-}
-EXPORT_SYMBOL(get_mce_fault_addr);
-
 /*
  * This function is called in real mode. Strictly no printk's please.
  *
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/5] powerpc/mce: align the print of physical address better
  2017-09-13  6:10 [PATCH v2 0/4] Revisit MCE handling for UE Errors Balbir Singh
  2017-09-13  6:10 ` [PATCH v2 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
@ 2017-09-13  6:10 ` Balbir Singh
  2017-09-13  6:10 ` [PATCH v2 3/5] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Balbir Singh @ 2017-09-13  6:10 UTC (permalink / raw)
  To: mpe, npiggin, mahesh; +Cc: linuxppc-dev, Balbir Singh

Use the same alignment as Effective address and rename
phyiscal address to Page Frame Number

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/kernel/mce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index e254399..f351adf 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -340,7 +340,7 @@ void machine_check_print_event_info(struct machine_check_event *evt,
 			printk("%s    Effective address: %016llx\n",
 			       level, evt->u.ue_error.effective_address);
 		if (evt->u.ue_error.physical_address_provided)
-			printk("%s      Physical address: %016llx\n",
+			printk("%s    Page Frame Number: %016llx\n",
 			       level, evt->u.ue_error.physical_address);
 		break;
 	case MCE_ERROR_TYPE_SLB:
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/5] powerpc/mce: Hookup derror (load/store) UE errors
  2017-09-13  6:10 [PATCH v2 0/4] Revisit MCE handling for UE Errors Balbir Singh
  2017-09-13  6:10 ` [PATCH v2 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
  2017-09-13  6:10 ` [PATCH v2 2/5] powerpc/mce: align the print of physical address better Balbir Singh
@ 2017-09-13  6:10 ` Balbir Singh
  2017-09-13  6:21   ` Nicholas Piggin
  2017-09-13  6:10 ` [PATCH v2 4/5] powerpc/mce: Hookup ierror (instruction) " Balbir Singh
  2017-09-13  6:10 ` [PATCH v2 5/5] powerpc/mce: hookup memory_failure for " Balbir Singh
  4 siblings, 1 reply; 9+ messages in thread
From: Balbir Singh @ 2017-09-13  6:10 UTC (permalink / raw)
  To: mpe, npiggin, mahesh; +Cc: linuxppc-dev, Balbir Singh

Extract physical_address for UE errors by walking the page
tables for the mm and address at the NIP, to extract the
instruction. Then use the instruction to find the effective
address via analyse_instr().

We might have page table walking races, but we expect them to
be rare, the physical address extraction is best effort. The idea
is to then hook up this infrastructure to memory failure eventually.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/include/asm/mce.h  |  2 +-
 arch/powerpc/kernel/mce.c       |  6 +++-
 arch/powerpc/kernel/mce_power.c | 80 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 75292c7..3a1226e 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -204,7 +204,7 @@ struct mce_error_info {
 
 extern void save_mce_event(struct pt_regs *regs, long handled,
 			   struct mce_error_info *mce_err, uint64_t nip,
-			   uint64_t addr);
+			   uint64_t addr, uint64_t phys_addr);
 extern int get_mce_event(struct machine_check_event *mce, bool release);
 extern void release_mce_event(void);
 extern void machine_check_queue_event(void);
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index f351adf..c7acc33 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -82,7 +82,7 @@ static void mce_set_error_info(struct machine_check_event *mce,
  */
 void save_mce_event(struct pt_regs *regs, long handled,
 		    struct mce_error_info *mce_err,
-		    uint64_t nip, uint64_t addr)
+		    uint64_t nip, uint64_t addr, uint64_t phys_addr)
 {
 	int index = __this_cpu_inc_return(mce_nest_count) - 1;
 	struct machine_check_event *mce = this_cpu_ptr(&mce_event[index]);
@@ -140,6 +140,10 @@ void save_mce_event(struct pt_regs *regs, long handled,
 	} else if (mce->error_type == MCE_ERROR_TYPE_UE) {
 		mce->u.ue_error.effective_address_provided = true;
 		mce->u.ue_error.effective_address = addr;
+		if (phys_addr != ULONG_MAX) {
+			mce->u.ue_error.physical_address_provided = true;
+			mce->u.ue_error.physical_address = phys_addr;
+		}
 	}
 	return;
 }
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index b76ca19..2dfbbe0 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -27,6 +27,29 @@
 #include <asm/mmu.h>
 #include <asm/mce.h>
 #include <asm/machdep.h>
+#include <asm/pgtable.h>
+#include <asm/pte-walk.h>
+#include <asm/sstep.h>
+
+/*
+ * Convert an address related to an mm to a PFN. NOTE: we are in real
+ * mode, we could potentially race with page table updates.
+ */
+static unsigned long addr_to_pfn(struct mm_struct *mm, unsigned long addr)
+{
+	pte_t *ptep;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	if (mm == current->mm)
+		ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
+	else
+		ptep = find_init_mm_pte(addr, NULL);
+	local_irq_restore(flags);
+	if (!ptep || pte_special(*ptep))
+		return ULONG_MAX;
+	return pte_pfn(*ptep);
+}
 
 static void flush_tlb_206(unsigned int num_sets, unsigned int action)
 {
@@ -421,6 +444,48 @@ static const struct mce_derror_table mce_p9_derror_table[] = {
   MCE_INITIATOR_CPU,   MCE_SEV_ERROR_SYNC, },
 { 0, false, 0, 0, 0, 0 } };
 
+static int mce_find_instr_ea_and_pfn(struct pt_regs *regs, uint64_t *addr,
+					uint64_t *phys_addr)
+{
+	/*
+	 * Carefully look at the NIP to determine
+	 * the instruction to analyse. Reading the NIP
+	 * in real-mode is tricky and can lead to recursive
+	 * faults
+	 */
+	int instr;
+	struct mm_struct *mm;
+	unsigned long nip = regs->nip;
+	unsigned long pfn, instr_addr;
+	struct instruction_op op;
+	struct pt_regs tmp = *regs;
+
+	if (user_mode(regs))
+		mm = current->mm;
+	else
+		mm = &init_mm;
+
+	pfn = addr_to_pfn(mm, nip);
+	if (pfn != ULONG_MAX) {
+		instr_addr = (pfn << PAGE_SHIFT) + (nip & ~PAGE_MASK);
+		instr = *(unsigned int *)(instr_addr);
+		if (!analyse_instr(&op, &tmp, instr)) {
+			pfn = addr_to_pfn(mm, op.ea);
+			*addr = op.ea;
+			*phys_addr = pfn;
+			return 0;
+		}
+		/*
+		 * analyse_instr() might fail if the instruction
+		 * is not a load/store, although this is unexpected
+		 * for load/store errors or if we got the NIP
+		 * wrong
+		 */
+	}
+	*addr = 0;
+	return -1;
+}
+
 static int mce_handle_ierror(struct pt_regs *regs,
 		const struct mce_ierror_table table[],
 		struct mce_error_info *mce_err, uint64_t *addr)
@@ -489,7 +554,8 @@ static int mce_handle_ierror(struct pt_regs *regs,
 
 static int mce_handle_derror(struct pt_regs *regs,
 		const struct mce_derror_table table[],
-		struct mce_error_info *mce_err, uint64_t *addr)
+		struct mce_error_info *mce_err, uint64_t *addr,
+		uint64_t *phys_addr)
 {
 	uint64_t dsisr = regs->dsisr;
 	int handled = 0;
@@ -555,7 +621,10 @@ static int mce_handle_derror(struct pt_regs *regs,
 		mce_err->initiator = table[i].initiator;
 		if (table[i].dar_valid)
 			*addr = regs->dar;
-
+		else if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
+				table[i].error_type == MCE_ERROR_TYPE_UE) {
+			mce_find_instr_ea_and_pfn(regs, addr, phys_addr);
+		}
 		found = 1;
 	}
 
@@ -592,19 +661,20 @@ static long mce_handle_error(struct pt_regs *regs,
 		const struct mce_ierror_table itable[])
 {
 	struct mce_error_info mce_err = { 0 };
-	uint64_t addr;
+	uint64_t addr, phys_addr;
 	uint64_t srr1 = regs->msr;
 	long handled;
 
 	if (SRR1_MC_LOADSTORE(srr1))
-		handled = mce_handle_derror(regs, dtable, &mce_err, &addr);
+		handled = mce_handle_derror(regs, dtable, &mce_err, &addr,
+				&phys_addr);
 	else
 		handled = mce_handle_ierror(regs, itable, &mce_err, &addr);
 
 	if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
 		handled = mce_handle_ue_error(regs);
 
-	save_mce_event(regs, handled, &mce_err, regs->nip, addr);
+	save_mce_event(regs, handled, &mce_err, regs->nip, addr, phys_addr);
 
 	return handled;
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 4/5] powerpc/mce: Hookup ierror (instruction) UE errors
  2017-09-13  6:10 [PATCH v2 0/4] Revisit MCE handling for UE Errors Balbir Singh
                   ` (2 preceding siblings ...)
  2017-09-13  6:10 ` [PATCH v2 3/5] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
@ 2017-09-13  6:10 ` Balbir Singh
  2017-09-13  6:10 ` [PATCH v2 5/5] powerpc/mce: hookup memory_failure for " Balbir Singh
  4 siblings, 0 replies; 9+ messages in thread
From: Balbir Singh @ 2017-09-13  6:10 UTC (permalink / raw)
  To: mpe, npiggin, mahesh; +Cc: linuxppc-dev, Balbir Singh

Hookup instruction errors (UE) for memory offling via memory_failure()
in a manner similar to load/store errors (derror). Since we have access
to the NIP, the conversion is a one step process in this case.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/kernel/mce_power.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 2dfbbe0..3bc9500 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -488,7 +488,8 @@ static int mce_find_instr_ea_and_pfn(struct pt_regs *regs, uint64_t *addr,
 
 static int mce_handle_ierror(struct pt_regs *regs,
 		const struct mce_ierror_table table[],
-		struct mce_error_info *mce_err, uint64_t *addr)
+		struct mce_error_info *mce_err, uint64_t *addr,
+		uint64_t *phys_addr)
 {
 	uint64_t srr1 = regs->msr;
 	int handled = 0;
@@ -540,8 +541,23 @@ static int mce_handle_ierror(struct pt_regs *regs,
 		}
 		mce_err->severity = table[i].severity;
 		mce_err->initiator = table[i].initiator;
-		if (table[i].nip_valid)
+		if (table[i].nip_valid) {
 			*addr = regs->nip;
+			if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
+				table[i].error_type == MCE_ERROR_TYPE_UE) {
+				unsigned long pfn;
+				struct mm_struct *mm;
+
+				if (user_mode(regs))
+					mm = current->mm;
+				else
+					mm = &init_mm;
+
+				pfn = addr_to_pfn(mm, regs->nip);
+				if (pfn != ULONG_MAX)
+					*phys_addr = pfn;
+			}
+		}
 		return handled;
 	}
 
@@ -669,7 +685,8 @@ static long mce_handle_error(struct pt_regs *regs,
 		handled = mce_handle_derror(regs, dtable, &mce_err, &addr,
 				&phys_addr);
 	else
-		handled = mce_handle_ierror(regs, itable, &mce_err, &addr);
+		handled = mce_handle_ierror(regs, itable, &mce_err, &addr,
+				&phys_addr);
 
 	if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
 		handled = mce_handle_ue_error(regs);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 5/5] powerpc/mce: hookup memory_failure for UE errors
  2017-09-13  6:10 [PATCH v2 0/4] Revisit MCE handling for UE Errors Balbir Singh
                   ` (3 preceding siblings ...)
  2017-09-13  6:10 ` [PATCH v2 4/5] powerpc/mce: Hookup ierror (instruction) " Balbir Singh
@ 2017-09-13  6:10 ` Balbir Singh
  4 siblings, 0 replies; 9+ messages in thread
From: Balbir Singh @ 2017-09-13  6:10 UTC (permalink / raw)
  To: mpe, npiggin, mahesh; +Cc: linuxppc-dev, Balbir Singh

If we are in user space and hit a UE error, we now have the
basic infrastructure to walk the page tables and find out
the effective address that was accessed, since the DAR
is not valid.

We use a work_queue content to hookup the bad pfn, any
other context causes problems, since memory_failure itself
can call into schedule() via lru_drain_ bits.

We could probably poison the struct page to avoid a race
between detection and taking corrective action.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/kernel/mce.c | 67 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 64 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index c7acc33..f20345a 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -39,11 +39,21 @@ static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT], mce_event);
 static DEFINE_PER_CPU(int, mce_queue_count);
 static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT], mce_event_queue);
 
+/* Queue for delayed MCE UE events. */
+static DEFINE_PER_CPU(int, mce_ue_count);
+static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT],
+					mce_ue_event_queue);
+
 static void machine_check_process_queued_event(struct irq_work *work);
+void machine_check_ue_event(struct machine_check_event *evt);
+static void machine_process_ue_event(struct work_struct *work);
+
 static struct irq_work mce_event_process_work = {
         .func = machine_check_process_queued_event,
 };
 
+DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
+
 static void mce_set_error_info(struct machine_check_event *mce,
 			       struct mce_error_info *mce_err)
 {
@@ -143,6 +153,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
 		if (phys_addr != ULONG_MAX) {
 			mce->u.ue_error.physical_address_provided = true;
 			mce->u.ue_error.physical_address = phys_addr;
+			machine_check_ue_event(mce);
 		}
 	}
 	return;
@@ -197,6 +208,26 @@ void release_mce_event(void)
 	get_mce_event(NULL, true);
 }
 
+
+/*
+ * Queue up the MCE event which then can be handled later.
+ */
+void machine_check_ue_event(struct machine_check_event *evt)
+{
+	int index;
+
+	index = __this_cpu_inc_return(mce_ue_count) - 1;
+	/* If queue is full, just return for now. */
+	if (index >= MAX_MC_EVT) {
+		__this_cpu_dec(mce_ue_count);
+		return;
+	}
+	memcpy(this_cpu_ptr(&mce_ue_event_queue[index]), evt, sizeof(*evt));
+
+	/* Queue work to process this event later. */
+	schedule_work(&mce_ue_event_work);
+}
+
 /*
  * Queue up the MCE event which then can be handled later.
  */
@@ -219,7 +250,36 @@ void machine_check_queue_event(void)
 	/* Queue irq work to process this event later. */
 	irq_work_queue(&mce_event_process_work);
 }
-
+/*
+ * process pending MCE event from the mce event queue. This function will be
+ * called during syscall exit.
+ */
+static void machine_process_ue_event(struct work_struct *work)
+{
+	int index;
+	struct machine_check_event *evt;
+
+	while (__this_cpu_read(mce_ue_count) > 0) {
+		index = __this_cpu_read(mce_ue_count) - 1;
+		evt = this_cpu_ptr(&mce_ue_event_queue[index]);
+#ifdef CONFIG_MEMORY_FAILURE
+		/*
+		 * This should probably queued elsewhere, but
+		 * oh! well
+		 */
+		if (evt->error_type == MCE_ERROR_TYPE_UE) {
+			if (evt->u.ue_error.physical_address_provided)
+				memory_failure(evt->u.ue_error.physical_address,
+						SIGBUS, 0);
+			else
+				pr_warn("Failed to identify bad address from "
+					"where the uncorrectable error (UE) "
+					"was generated\n");
+		}
+#endif
+		__this_cpu_dec(mce_ue_count);
+	}
+}
 /*
  * process pending MCE event from the mce event queue. This function will be
  * called during syscall exit.
@@ -227,6 +287,7 @@ void machine_check_queue_event(void)
 static void machine_check_process_queued_event(struct irq_work *work)
 {
 	int index;
+	struct machine_check_event *evt;
 
 	add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
 
@@ -236,8 +297,8 @@ static void machine_check_process_queued_event(struct irq_work *work)
 	 */
 	while (__this_cpu_read(mce_queue_count) > 0) {
 		index = __this_cpu_read(mce_queue_count) - 1;
-		machine_check_print_event_info(
-				this_cpu_ptr(&mce_event_queue[index]), false);
+		evt = this_cpu_ptr(&mce_event_queue[index]);
+		machine_check_print_event_info(evt, false);
 		__this_cpu_dec(mce_queue_count);
 	}
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/5] powerpc/mce: Hookup derror (load/store) UE errors
  2017-09-13  6:10 ` [PATCH v2 3/5] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
@ 2017-09-13  6:21   ` Nicholas Piggin
  2017-09-13  6:26     ` Balbir Singh
  0 siblings, 1 reply; 9+ messages in thread
From: Nicholas Piggin @ 2017-09-13  6:21 UTC (permalink / raw)
  To: Balbir Singh; +Cc: mpe, mahesh, linuxppc-dev

On Wed, 13 Sep 2017 16:10:47 +1000
Balbir Singh <bsingharora@gmail.com> wrote:

> Extract physical_address for UE errors by walking the page
> tables for the mm and address at the NIP, to extract the
> instruction. Then use the instruction to find the effective
> address via analyse_instr().
> 
> We might have page table walking races, but we expect them to
> be rare, the physical address extraction is best effort. The idea
> is to then hook up this infrastructure to memory failure eventually.

This all looks pretty good to me, you can probably update these
changelogs now because you are hooking it into memory failure.

I wonder if it would be worth skipping the instruction analysis and
page table walk if we've recursed up to the maximum MCE depth, just
in case we're hitting MCEs in part of that code or data.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/5] powerpc/mce: Hookup derror (load/store) UE errors
  2017-09-13  6:21   ` Nicholas Piggin
@ 2017-09-13  6:26     ` Balbir Singh
  2017-09-13  8:56       ` Nicholas Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: Balbir Singh @ 2017-09-13  6:26 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Michael Ellerman, Mahesh Jagannath Salgaonkar,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, Sep 13, 2017 at 4:21 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> On Wed, 13 Sep 2017 16:10:47 +1000
> Balbir Singh <bsingharora@gmail.com> wrote:
>
>> Extract physical_address for UE errors by walking the page
>> tables for the mm and address at the NIP, to extract the
>> instruction. Then use the instruction to find the effective
>> address via analyse_instr().
>>
>> We might have page table walking races, but we expect them to
>> be rare, the physical address extraction is best effort. The idea
>> is to then hook up this infrastructure to memory failure eventually.
>
> This all looks pretty good to me, you can probably update these
> changelogs now because you are hooking it into memory failure.

Yep, the eventually can probably go, I meant in the next patch.
The following patch then hooks this up into memory_failure

>
> I wonder if it would be worth skipping the instruction analysis and
> page table walk if we've recursed up to the maximum MCE depth, just
> in case we're hitting MCEs in part of that code or data.

Yep, good idea. Would you be OK if we did this after this small series
got merged? Since that would mean that we got a UE error
while processing the our third machine check exception, I think the
probability of us running into that is low, but I'd definitely like to do that
once these changes are merged.

Balbir Singh.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 3/5] powerpc/mce: Hookup derror (load/store) UE errors
  2017-09-13  6:26     ` Balbir Singh
@ 2017-09-13  8:56       ` Nicholas Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nicholas Piggin @ 2017-09-13  8:56 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Michael Ellerman, Mahesh Jagannath Salgaonkar,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, 13 Sep 2017 16:26:59 +1000
Balbir Singh <bsingharora@gmail.com> wrote:

> On Wed, Sep 13, 2017 at 4:21 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> > On Wed, 13 Sep 2017 16:10:47 +1000
> > Balbir Singh <bsingharora@gmail.com> wrote:
> >  
> >> Extract physical_address for UE errors by walking the page
> >> tables for the mm and address at the NIP, to extract the
> >> instruction. Then use the instruction to find the effective
> >> address via analyse_instr().
> >>
> >> We might have page table walking races, but we expect them to
> >> be rare, the physical address extraction is best effort. The idea
> >> is to then hook up this infrastructure to memory failure eventually.  
> >
> > This all looks pretty good to me, you can probably update these
> > changelogs now because you are hooking it into memory failure.  
> 
> Yep, the eventually can probably go, I meant in the next patch.
> The following patch then hooks this up into memory_failure
> 
> >
> > I wonder if it would be worth skipping the instruction analysis and
> > page table walk if we've recursed up to the maximum MCE depth, just
> > in case we're hitting MCEs in part of that code or data.  
> 
> Yep, good idea. Would you be OK if we did this after this small series
> got merged?

I don't mind much, but I'd have thought being that it's all new code,
adding the check would be pretty easy.

    if (get_paca()->in_mce == 4) {}

(Probably with the '4' appropriately #defined out of here and the
exception-64s.S code)

> Since that would mean that we got a UE error
> while processing the our third machine check exception, I think the
> probability of us running into that is low, but I'd definitely like to do that
> once these changes are merged.

If we're getting UEs in the machine check code or walking kernel
page tables though, it will just keep recurring. Unlikely yes, but
it's still a slight regression.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-09-13  8:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-13  6:10 [PATCH v2 0/4] Revisit MCE handling for UE Errors Balbir Singh
2017-09-13  6:10 ` [PATCH v2 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
2017-09-13  6:10 ` [PATCH v2 2/5] powerpc/mce: align the print of physical address better Balbir Singh
2017-09-13  6:10 ` [PATCH v2 3/5] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
2017-09-13  6:21   ` Nicholas Piggin
2017-09-13  6:26     ` Balbir Singh
2017-09-13  8:56       ` Nicholas Piggin
2017-09-13  6:10 ` [PATCH v2 4/5] powerpc/mce: Hookup ierror (instruction) " Balbir Singh
2017-09-13  6:10 ` [PATCH v2 5/5] powerpc/mce: hookup memory_failure for " Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).