All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
@ 2018-04-23  4:59 Mahesh J Salgaonkar
  2018-04-23  6:51 ` Balbir Singh
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Mahesh J Salgaonkar @ 2018-04-23  4:59 UTC (permalink / raw)
  To: linuxppc-dev

From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>

The current code extracts the physical address for UE errors and then
hooks it up into memory failure infrastructure. On successful extraction
of physical address it wrongly sets "handled = 1" which means this UE error
has been recovered. Since MCE handler gets return value as handled = 1, it
assumes that error has been recovered and goes back to same NIP. This causes
MCE interrupt again and again in a loop leading to hard lockup.

Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
undesired page to hwpoison.

Without this patch we see:
[ 1476.541984] Severe Machine check interrupt [Recovered]
[ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
[ 1476.541986]   Initiator: CPU
[ 1476.541987]   Error type: UE [Load/Store]
[ 1476.541988]     Effective address: 00007fffd2755940
[ 1476.541989]     Physical address:  000020181a080000
[...]
[ 1476.542003] Severe Machine check interrupt [Recovered]
[ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
[ 1476.542005]   Initiator: CPU
[ 1476.542006]   Error type: UE [Load/Store]
[ 1476.542006]     Effective address: 00007fffd2755940
[ 1476.542007]     Physical address:  000020181a080000
[ 1476.542010] Severe Machine check interrupt [Recovered]
[ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
[ 1476.542013]   Initiator: CPU
[ 1476.542014]   Error type: UE [Load/Store]
[ 1476.542015]     Effective address: 00007fffd2755940
[ 1476.542016]     Physical address:  000020181a080000
[ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
[ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
[ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
[ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
[ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
[ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
[ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
[...]
[ 1490.972174] Watchdog CPU:38 Hard LOCKUP

After this patch we see:

[  325.384336] Severe Machine check interrupt [Not recovered]
[  325.384338]   NIP: [00007fffaae585f4] PID: 7168 Comm: find
[  325.384339]   Initiator: CPU
[  325.384341]   Error type: UE [Load/Store]
[  325.384343]     Effective address: 00007fffaafe28ac
[  325.384345]     Physical address:  00002017c0bd0000
[  325.384350] find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
[  325.388574] Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered

Fixes: 01eaac2b0591 ("powerpc/mce: Hookup ierror (instruction) UE errors")
Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/mce_power.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index fe6fc63251fe..63b58ae5d601 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -441,7 +441,6 @@ static int mce_handle_ierror(struct pt_regs *regs,
 					if (pfn != ULONG_MAX) {
 						*phys_addr =
 							(pfn << PAGE_SHIFT);
-						handled = 1;
 					}
 				}
 			}
@@ -532,9 +531,8 @@ static int mce_handle_derror(struct pt_regs *regs,
 			 * kernel/exception-64s.h
 			 */
 			if (get_paca()->in_mce < MAX_MCE_DEPTH)
-				if (!mce_find_instr_ea_and_pfn(regs, addr,
-								phys_addr))
-					handled = 1;
+				mce_find_instr_ea_and_pfn(regs, addr,
+								phys_addr);
 		}
 		found = 1;
 	}
@@ -572,7 +570,7 @@ static long mce_handle_error(struct pt_regs *regs,
 		const struct mce_ierror_table itable[])
 {
 	struct mce_error_info mce_err = { 0 };
-	uint64_t addr, phys_addr;
+	uint64_t addr, phys_addr = ULONG_MAX;
 	uint64_t srr1 = regs->msr;
 	long handled;
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
  2018-04-23  4:59 [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE Mahesh J Salgaonkar
@ 2018-04-23  6:51 ` Balbir Singh
  2018-04-23  9:23   ` Balbir Singh
  2018-04-23 10:33   ` Mahesh Jagannath Salgaonkar
  2018-04-23 23:41 ` Balbir Singh
  2018-04-25  2:55 ` Michael Ellerman
  2 siblings, 2 replies; 10+ messages in thread
From: Balbir Singh @ 2018-04-23  6:51 UTC (permalink / raw)
  To: Mahesh J Salgaonkar; +Cc: linuxppc-dev

On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
<mahesh@linux.vnet.ibm.com> wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>
> The current code extracts the physical address for UE errors and then
> hooks it up into memory failure infrastructure. On successful extraction
> of physical address it wrongly sets "handled = 1" which means this UE error
> has been recovered. Since MCE handler gets return value as handled = 1, it
> assumes that error has been recovered and goes back to same NIP. This causes
> MCE interrupt again and again in a loop leading to hard lockup.
>
> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
> undesired page to hwpoison.
>
> Without this patch we see:
> [ 1476.541984] Severe Machine check interrupt [Recovered]
> [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
> [ 1476.541986]   Initiator: CPU
> [ 1476.541987]   Error type: UE [Load/Store]
> [ 1476.541988]     Effective address: 00007fffd2755940
> [ 1476.541989]     Physical address:  000020181a080000
> [...]
> [ 1476.542003] Severe Machine check interrupt [Recovered]
> [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
> [ 1476.542005]   Initiator: CPU
> [ 1476.542006]   Error type: UE [Load/Store]
> [ 1476.542006]     Effective address: 00007fffd2755940
> [ 1476.542007]     Physical address:  000020181a080000
> [ 1476.542010] Severe Machine check interrupt [Recovered]
> [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
> [ 1476.542013]   Initiator: CPU
> [ 1476.542014]   Error type: UE [Load/Store]
> [ 1476.542015]     Effective address: 00007fffd2755940
> [ 1476.542016]     Physical address:  000020181a080000
> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
> [...]
> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>
> After this patch we see:
>
> [  325.384336] Severe Machine check interrupt [Not recovered]

How did you test for this? If the error was recovered, shouldn't the
process have gotten
a SIGBUS and we should have prevented further access as a part of the handling
(memory_failure()). Do we just need a MF_MUST_KILL in the flags?

Why shouldn't we treat it as handled if we isolate the page?

Thanks,
Balbir Singh.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
  2018-04-23  6:51 ` Balbir Singh
@ 2018-04-23  9:23   ` Balbir Singh
  2018-04-23 10:33   ` Mahesh Jagannath Salgaonkar
  1 sibling, 0 replies; 10+ messages in thread
From: Balbir Singh @ 2018-04-23  9:23 UTC (permalink / raw)
  To: Mahesh J Salgaonkar; +Cc: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2917 bytes --]

On Mon, Apr 23, 2018 at 4:51 PM, Balbir Singh <bsingharora@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
> <mahesh@linux.vnet.ibm.com> wrote:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> The current code extracts the physical address for UE errors and then
>> hooks it up into memory failure infrastructure. On successful extraction
>> of physical address it wrongly sets "handled = 1" which means this UE error
>> has been recovered. Since MCE handler gets return value as handled = 1, it
>> assumes that error has been recovered and goes back to same NIP. This causes
>> MCE interrupt again and again in a loop leading to hard lockup.
>>
>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>> undesired page to hwpoison.
>>
>> Without this patch we see:
>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>> [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
>> [ 1476.541986]   Initiator: CPU
>> [ 1476.541987]   Error type: UE [Load/Store]
>> [ 1476.541988]     Effective address: 00007fffd2755940
>> [ 1476.541989]     Physical address:  000020181a080000
>> [...]
>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>> [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
>> [ 1476.542005]   Initiator: CPU
>> [ 1476.542006]   Error type: UE [Load/Store]
>> [ 1476.542006]     Effective address: 00007fffd2755940
>> [ 1476.542007]     Physical address:  000020181a080000
>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>> [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
>> [ 1476.542013]   Initiator: CPU
>> [ 1476.542014]   Error type: UE [Load/Store]
>> [ 1476.542015]     Effective address: 00007fffd2755940
>> [ 1476.542016]     Physical address:  000020181a080000
>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>> [...]
>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>
>> After this patch we see:
>>
>> [  325.384336] Severe Machine check interrupt [Not recovered]
>
> How did you test for this? If the error was recovered, shouldn't the
> process have gotten
> a SIGBUS and we should have prevented further access as a part of the handling
> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?
>
> Why shouldn't we treat it as handled if we isolate the page?

Not yet signed-off-by patch attached for testing, Mahesh, please check/confirm

Thanks,
Balbir

>
> Thanks,
> Balbir Singh.

[-- Attachment #2: 0001-powerpc-mce-force-a-KILL-on-user-page-MCE.patch --]
[-- Type: text/x-patch, Size: 977 bytes --]

From b297f2ad8473eea7755bcc239e7de21227438065 Mon Sep 17 00:00:00 2001
From: Balbir Singh <bsingharora@gmail.com>
Date: Mon, 23 Apr 2018 19:21:27 +1000
Subject: [PATCH] powerpc/mce: force a KILL on user page MCE

Experimental fix for MCE error, force a kill
because that's our current recovery mechanism

Not tested/compiled yet

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
---
 arch/powerpc/kernel/mce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index efdd16a79075..a8441f75861e 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -273,7 +273,7 @@ static void machine_process_ue_event(struct work_struct *work)
 
 				pfn = evt->u.ue_error.physical_address >>
 					PAGE_SHIFT;
-				memory_failure(pfn, 0);
+				memory_failure(pfn, MF_MUST_KILL);
 			} else
 				pr_warn("Failed to identify bad address from "
 					"where the uncorrectable error (UE) "
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
  2018-04-23  6:51 ` Balbir Singh
  2018-04-23  9:23   ` Balbir Singh
@ 2018-04-23 10:33   ` Mahesh Jagannath Salgaonkar
  2018-04-23 11:14     ` Balbir Singh
  1 sibling, 1 reply; 10+ messages in thread
From: Mahesh Jagannath Salgaonkar @ 2018-04-23 10:33 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linuxppc-dev

On 04/23/2018 12:21 PM, Balbir Singh wrote:
> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
> <mahesh@linux.vnet.ibm.com> wrote:
>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>
>> The current code extracts the physical address for UE errors and then
>> hooks it up into memory failure infrastructure. On successful extraction
>> of physical address it wrongly sets "handled = 1" which means this UE error
>> has been recovered. Since MCE handler gets return value as handled = 1, it
>> assumes that error has been recovered and goes back to same NIP. This causes
>> MCE interrupt again and again in a loop leading to hard lockup.
>>
>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>> undesired page to hwpoison.
>>
>> Without this patch we see:
>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>> [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
>> [ 1476.541986]   Initiator: CPU
>> [ 1476.541987]   Error type: UE [Load/Store]
>> [ 1476.541988]     Effective address: 00007fffd2755940
>> [ 1476.541989]     Physical address:  000020181a080000
>> [...]
>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>> [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
>> [ 1476.542005]   Initiator: CPU
>> [ 1476.542006]   Error type: UE [Load/Store]
>> [ 1476.542006]     Effective address: 00007fffd2755940
>> [ 1476.542007]     Physical address:  000020181a080000
>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>> [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
>> [ 1476.542013]   Initiator: CPU
>> [ 1476.542014]   Error type: UE [Load/Store]
>> [ 1476.542015]     Effective address: 00007fffd2755940
>> [ 1476.542016]     Physical address:  000020181a080000
>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>> [...]
>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>
>> After this patch we see:
>>
>> [  325.384336] Severe Machine check interrupt [Not recovered]
> 
> How did you test for this? 

By injecting cache SUE using L2 FIR register (0x1001080c).

> If the error was recovered, shouldn't the
> process have gotten
> a SIGBUS and we should have prevented further access as a part of the handling
> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?

We hook it up to memory_failure() through a work queue and by the time
work queue kicks in, the application continues to restart and hit same
NIP again and again. Every MCE again hooks the same address to memory
failure work queue and throws multiple recovered MCE messages for same
address. Once the memory_failure() hwpoisons the page, application gets
SIGBUS and then we are fine.

But in case of UE in kernel space, if early machine_check handler
"machine_check_early()" returns as recovered then
machine_check_handle_early() queues up the MCE event and continues from
NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
up in hard lockup.

> Why shouldn't we treat it as handled if we isolate the page?

Yes we should, but I think not until the page is actually hwpoisioned OR
until we send SIGBUS to process.

> 
> Thanks,
> Balbir Singh.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
  2018-04-23 10:33   ` Mahesh Jagannath Salgaonkar
@ 2018-04-23 11:14     ` Balbir Singh
  2018-04-23 13:01       ` Nicholas Piggin
  2018-04-23 13:01       ` Mahesh Jagannath Salgaonkar
  0 siblings, 2 replies; 10+ messages in thread
From: Balbir Singh @ 2018-04-23 11:14 UTC (permalink / raw)
  To: Mahesh Jagannath Salgaonkar; +Cc: linuxppc-dev

On Mon, Apr 23, 2018 at 8:33 PM, Mahesh Jagannath Salgaonkar
<mahesh@linux.vnet.ibm.com> wrote:
> On 04/23/2018 12:21 PM, Balbir Singh wrote:
>> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
>> <mahesh@linux.vnet.ibm.com> wrote:
>>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>>
>>> The current code extracts the physical address for UE errors and then
>>> hooks it up into memory failure infrastructure. On successful extraction
>>> of physical address it wrongly sets "handled = 1" which means this UE error
>>> has been recovered. Since MCE handler gets return value as handled = 1, it
>>> assumes that error has been recovered and goes back to same NIP. This causes
>>> MCE interrupt again and again in a loop leading to hard lockup.
>>>
>>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>>> undesired page to hwpoison.
>>>
>>> Without this patch we see:
>>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>>> [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
>>> [ 1476.541986]   Initiator: CPU
>>> [ 1476.541987]   Error type: UE [Load/Store]
>>> [ 1476.541988]     Effective address: 00007fffd2755940
>>> [ 1476.541989]     Physical address:  000020181a080000
>>> [...]
>>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>>> [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
>>> [ 1476.542005]   Initiator: CPU
>>> [ 1476.542006]   Error type: UE [Load/Store]
>>> [ 1476.542006]     Effective address: 00007fffd2755940
>>> [ 1476.542007]     Physical address:  000020181a080000
>>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>>> [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
>>> [ 1476.542013]   Initiator: CPU
>>> [ 1476.542014]   Error type: UE [Load/Store]
>>> [ 1476.542015]     Effective address: 00007fffd2755940
>>> [ 1476.542016]     Physical address:  000020181a080000
>>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
>>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>>> [...]
>>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>>
>>> After this patch we see:
>>>
>>> [  325.384336] Severe Machine check interrupt [Not recovered]
>>
>> How did you test for this?
>
> By injecting cache SUE using L2 FIR register (0x1001080c).
>
>> If the error was recovered, shouldn't the
>> process have gotten
>> a SIGBUS and we should have prevented further access as a part of the handling
>> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?
>
> We hook it up to memory_failure() through a work queue and by the time
> work queue kicks in, the application continues to restart and hit same
> NIP again and again. Every MCE again hooks the same address to memory
> failure work queue and throws multiple recovered MCE messages for same
> address. Once the memory_failure() hwpoisons the page, application gets
> SIGBUS and then we are fine.
>

That seems quite broken and not recovered is very confusing. So effectively
we can never recover from a MCE UE. I think we need a notion of delayed
recovery then? Where we do recover, but mark is as recovered with delays?
We might want to revisit our recovery process and see if the recovery requires
to turn the MMU on, but that is for later, I suppose.

> But in case of UE in kernel space, if early machine_check handler
> "machine_check_early()" returns as recovered then
> machine_check_handle_early() queues up the MCE event and continues from
> NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
> up in hard lockup.
>

Yeah for the kernel, we need to definitely cause a panic for now, I've got other
patches for things we need to do for pmem that would allow potential recovery.

Balbir Singh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
  2018-04-23 11:14     ` Balbir Singh
@ 2018-04-23 13:01       ` Nicholas Piggin
  2018-04-23 23:00         ` Balbir Singh
  2018-04-23 13:01       ` Mahesh Jagannath Salgaonkar
  1 sibling, 1 reply; 10+ messages in thread
From: Nicholas Piggin @ 2018-04-23 13:01 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Mahesh Jagannath Salgaonkar, linuxppc-dev

On Mon, 23 Apr 2018 21:14:12 +1000
Balbir Singh <bsingharora@gmail.com> wrote:

> On Mon, Apr 23, 2018 at 8:33 PM, Mahesh Jagannath Salgaonkar
> <mahesh@linux.vnet.ibm.com> wrote:
> > On 04/23/2018 12:21 PM, Balbir Singh wrote:  
> >> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
> >> <mahesh@linux.vnet.ibm.com> wrote:  
> >>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> >>>
> >>> The current code extracts the physical address for UE errors and then
> >>> hooks it up into memory failure infrastructure. On successful extraction
> >>> of physical address it wrongly sets "handled = 1" which means this UE error
> >>> has been recovered. Since MCE handler gets return value as handled = 1, it
> >>> assumes that error has been recovered and goes back to same NIP. This causes
> >>> MCE interrupt again and again in a loop leading to hard lockup.
> >>>
> >>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
> >>> undesired page to hwpoison.
> >>>
> >>> Without this patch we see:
> >>> [ 1476.541984] Severe Machine check interrupt [Recovered]
> >>> [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
> >>> [ 1476.541986]   Initiator: CPU
> >>> [ 1476.541987]   Error type: UE [Load/Store]
> >>> [ 1476.541988]     Effective address: 00007fffd2755940
> >>> [ 1476.541989]     Physical address:  000020181a080000
> >>> [...]
> >>> [ 1476.542003] Severe Machine check interrupt [Recovered]
> >>> [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
> >>> [ 1476.542005]   Initiator: CPU
> >>> [ 1476.542006]   Error type: UE [Load/Store]
> >>> [ 1476.542006]     Effective address: 00007fffd2755940
> >>> [ 1476.542007]     Physical address:  000020181a080000
> >>> [ 1476.542010] Severe Machine check interrupt [Recovered]
> >>> [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
> >>> [ 1476.542013]   Initiator: CPU
> >>> [ 1476.542014]   Error type: UE [Load/Store]
> >>> [ 1476.542015]     Effective address: 00007fffd2755940
> >>> [ 1476.542016]     Physical address:  000020181a080000
> >>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
> >>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
> >>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
> >>> [...]
> >>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
> >>>
> >>> After this patch we see:
> >>>
> >>> [  325.384336] Severe Machine check interrupt [Not recovered]  
> >>
> >> How did you test for this?  
> >
> > By injecting cache SUE using L2 FIR register (0x1001080c).
> >  
> >> If the error was recovered, shouldn't the
> >> process have gotten
> >> a SIGBUS and we should have prevented further access as a part of the handling
> >> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?  
> >
> > We hook it up to memory_failure() through a work queue and by the time
> > work queue kicks in, the application continues to restart and hit same
> > NIP again and again. Every MCE again hooks the same address to memory
> > failure work queue and throws multiple recovered MCE messages for same
> > address. Once the memory_failure() hwpoisons the page, application gets
> > SIGBUS and then we are fine.
> >  
> 
> That seems quite broken and not recovered is very confusing. So effectively
> we can never recover from a MCE UE. I think we need a notion of delayed
> recovery then? Where we do recover, but mark is as recovered with delays?
> We might want to revisit our recovery process and see if the recovery requires
> to turn the MMU on, but that is for later, I suppose.

The notion of being handled in the machine check return value is not
whether the failing resource is later de-allocated or fixed, but if
*this* particular exception was able to be corrected / processing
resume as normal without further action.

The MCE UE is not recovered just by finding its address here, so I think
Mahesh's patch is right.

You can still recover it with further action later.

> 
> > But in case of UE in kernel space, if early machine_check handler
> > "machine_check_early()" returns as recovered then
> > machine_check_handle_early() queues up the MCE event and continues from
> > NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
> > up in hard lockup.
> >  
> 
> Yeah for the kernel, we need to definitely cause a panic for now, I've got other
> patches for things we need to do for pmem that would allow potential recovery.
> 
> Balbir Singh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
  2018-04-23 11:14     ` Balbir Singh
  2018-04-23 13:01       ` Nicholas Piggin
@ 2018-04-23 13:01       ` Mahesh Jagannath Salgaonkar
  1 sibling, 0 replies; 10+ messages in thread
From: Mahesh Jagannath Salgaonkar @ 2018-04-23 13:01 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linuxppc-dev

On 04/23/2018 04:44 PM, Balbir Singh wrote:
> On Mon, Apr 23, 2018 at 8:33 PM, Mahesh Jagannath Salgaonkar
> <mahesh@linux.vnet.ibm.com> wrote:
>> On 04/23/2018 12:21 PM, Balbir Singh wrote:
>>> On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
>>> <mahesh@linux.vnet.ibm.com> wrote:
>>>> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
>>>>
>>>> The current code extracts the physical address for UE errors and then
>>>> hooks it up into memory failure infrastructure. On successful extraction
>>>> of physical address it wrongly sets "handled = 1" which means this UE error
>>>> has been recovered. Since MCE handler gets return value as handled = 1, it
>>>> assumes that error has been recovered and goes back to same NIP. This causes
>>>> MCE interrupt again and again in a loop leading to hard lockup.
>>>>
>>>> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
>>>> undesired page to hwpoison.
>>>>
>>>> Without this patch we see:
>>>> [ 1476.541984] Severe Machine check interrupt [Recovered]
>>>> [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
>>>> [ 1476.541986]   Initiator: CPU
>>>> [ 1476.541987]   Error type: UE [Load/Store]
>>>> [ 1476.541988]     Effective address: 00007fffd2755940
>>>> [ 1476.541989]     Physical address:  000020181a080000
>>>> [...]
>>>> [ 1476.542003] Severe Machine check interrupt [Recovered]
>>>> [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
>>>> [ 1476.542005]   Initiator: CPU
>>>> [ 1476.542006]   Error type: UE [Load/Store]
>>>> [ 1476.542006]     Effective address: 00007fffd2755940
>>>> [ 1476.542007]     Physical address:  000020181a080000
>>>> [ 1476.542010] Severe Machine check interrupt [Recovered]
>>>> [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
>>>> [ 1476.542013]   Initiator: CPU
>>>> [ 1476.542014]   Error type: UE [Load/Store]
>>>> [ 1476.542015]     Effective address: 00007fffd2755940
>>>> [ 1476.542016]     Physical address:  000020181a080000
>>>> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
>>>> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
>>>> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
>>>> [...]
>>>> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
>>>>
>>>> After this patch we see:
>>>>
>>>> [  325.384336] Severe Machine check interrupt [Not recovered]
>>>
>>> How did you test for this?
>>
>> By injecting cache SUE using L2 FIR register (0x1001080c).
>>
>>> If the error was recovered, shouldn't the
>>> process have gotten
>>> a SIGBUS and we should have prevented further access as a part of the handling
>>> (memory_failure()). Do we just need a MF_MUST_KILL in the flags?
>>
>> We hook it up to memory_failure() through a work queue and by the time
>> work queue kicks in, the application continues to restart and hit same
>> NIP again and again. Every MCE again hooks the same address to memory
>> failure work queue and throws multiple recovered MCE messages for same
>> address. Once the memory_failure() hwpoisons the page, application gets
>> SIGBUS and then we are fine.
>>
> 
> That seems quite broken and not recovered is very confusing. So effectively
> we can never recover from a MCE UE. 

By not setting handle = 1, the recovery code will fall through
machine_check_exception()->opal_machine_check() and then SIGBUS is sent
to this process to recover OR head to panic path for kernel UE. We have
already hooked up the physical address to memory_failure() which will
later hwpoison the page whenever work queue kicks in. This patch makes
sure this happens.

> I think we need a notion of delayed
> recovery then? Where we do recover, but mark is as recovered with delays?

Yeah, may be we can set disposition for userspace mce event as recovery
in progress/delayed and then print the mce event again from work queue
by looking at return value from memory_failure(). What do you think ?

> We might want to revisit our recovery process and see if the recovery requires
> to turn the MMU on, but that is for later, I suppose.
> 
>> But in case of UE in kernel space, if early machine_check handler
>> "machine_check_early()" returns as recovered then
>> machine_check_handle_early() queues up the MCE event and continues from
>> NIP assuming it is safe causing a MCE loop. So, for UE in kernel we end
>> up in hard lockup.
>>
> 
> Yeah for the kernel, we need to definitely cause a panic for now, I've got other
> patches for things we need to do for pmem that would allow potential recovery.
> 
> Balbir Singh
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
  2018-04-23 13:01       ` Nicholas Piggin
@ 2018-04-23 23:00         ` Balbir Singh
  0 siblings, 0 replies; 10+ messages in thread
From: Balbir Singh @ 2018-04-23 23:00 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: Mahesh Jagannath Salgaonkar, linuxppc-dev

On Mon, 2018-04-23 at 23:01 +1000, Nicholas Piggin wrote:
> On Mon, 23 Apr 2018 21:14:12 +1000
> Balbir Singh <bsingharora@gmail.com> wrote:
> 
> > On Mon, Apr 23, 2018 at 8:33 PM, Mahesh Jagannath Salgaonkar
> > <mahesh@linux.vnet.ibm.com> wrote:
> > > On 04/23/2018 12:21 PM, Balbir Singh wrote:  
> > > > On Mon, Apr 23, 2018 at 2:59 PM, Mahesh J Salgaonkar
> > > > <mahesh@linux.vnet.ibm.com> wrote:  
> > > > > From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> > > > > 
> > > > > The current code extracts the physical address for UE errors and then
> > > > > hooks it up into memory failure infrastructure. On successful extraction
> > > > > of physical address it wrongly sets "handled = 1" which means this UE error
> > > > > has been recovered. Since MCE handler gets return value as handled = 1, it
> > > > > assumes that error has been recovered and goes back to same NIP. This causes
> > > > > MCE interrupt again and again in a loop leading to hard lockup.
> > > > > 
> > > > > Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
> > > > > undesired page to hwpoison.
> > > > > 
> > > > > Without this patch we see:
> > > > > [ 1476.541984] Severe Machine check interrupt [Recovered]
> > > > > [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
> > > > > [ 1476.541986]   Initiator: CPU
> > > > > [ 1476.541987]   Error type: UE [Load/Store]
> > > > > [ 1476.541988]     Effective address: 00007fffd2755940
> > > > > [ 1476.541989]     Physical address:  000020181a080000
> > > > > [...]
> > > > > [ 1476.542003] Severe Machine check interrupt [Recovered]
> > > > > [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
> > > > > [ 1476.542005]   Initiator: CPU
> > > > > [ 1476.542006]   Error type: UE [Load/Store]
> > > > > [ 1476.542006]     Effective address: 00007fffd2755940
> > > > > [ 1476.542007]     Physical address:  000020181a080000
> > > > > [ 1476.542010] Severe Machine check interrupt [Recovered]
> > > > > [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
> > > > > [ 1476.542013]   Initiator: CPU
> > > > > [ 1476.542014]   Error type: UE [Load/Store]
> > > > > [ 1476.542015]     Effective address: 00007fffd2755940
> > > > > [ 1476.542016]     Physical address:  000020181a080000
> > > > > [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
> > > > > [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
> > > > > [...]
> > > > > [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
> > > > > 
> > > > > After this patch we see:
> > > > > 
> > > > > [  325.384336] Severe Machine check interrupt [Not recovered]  
> > > > 
> > > > How did you test for this?  
> > > 
> > > By injecting cache SUE using L2 FIR register (0x1001080c).
> > >  
> > > > If the error was recovered, shouldn't the
> > > > process have gotten
> > > > a SIGBUS and we should have prevented further access as a part of the handling
> > > > (memory_failure()). Do we just need a MF_MUST_KILL in the flags?  
> > > 
> > > We hook it up to memory_failure() through a work queue and by the time
> > > work queue kicks in, the application continues to restart and hit same
> > > NIP again and again. Every MCE again hooks the same address to memory
> > > failure work queue and throws multiple recovered MCE messages for same
> > > address. Once the memory_failure() hwpoisons the page, application gets
> > > SIGBUS and then we are fine.
> > >  
> > 
> > That seems quite broken and not recovered is very confusing. So effectively
> > we can never recover from a MCE UE. I think we need a notion of delayed
> > recovery then? Where we do recover, but mark is as recovered with delays?
> > We might want to revisit our recovery process and see if the recovery requires
> > to turn the MMU on, but that is for later, I suppose.
> 
> The notion of being handled in the machine check return value is not
> whether the failing resource is later de-allocated or fixed, but if
> *this* particular exception was able to be corrected / processing
> resume as normal without further action.
> 
> The MCE UE is not recovered just by finding its address here, so I think
> Mahesh's patch is right.
> 

OK, It would nice to see a "recovered" in the output as opposed to process
killed, but the MCE is "not recovered"? The kernel bits do sound sane

Balbir Singh.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE.
  2018-04-23  4:59 [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE Mahesh J Salgaonkar
  2018-04-23  6:51 ` Balbir Singh
@ 2018-04-23 23:41 ` Balbir Singh
  2018-04-25  2:55 ` Michael Ellerman
  2 siblings, 0 replies; 10+ messages in thread
From: Balbir Singh @ 2018-04-23 23:41 UTC (permalink / raw)
  To: Mahesh J Salgaonkar, linuxppc-dev

On Mon, 2018-04-23 at 10:29 +0530, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> The current code extracts the physical address for UE errors and then
> hooks it up into memory failure infrastructure. On successful extraction
> of physical address it wrongly sets "handled = 1" which means this UE error
> has been recovered. Since MCE handler gets return value as handled = 1, it
> assumes that error has been recovered and goes back to same NIP. This causes
> MCE interrupt again and again in a loop leading to hard lockup.
> 
> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
> undesired page to hwpoison.
> 
> Without this patch we see:
> [ 1476.541984] Severe Machine check interrupt [Recovered]
> [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
> [ 1476.541986]   Initiator: CPU
> [ 1476.541987]   Error type: UE [Load/Store]
> [ 1476.541988]     Effective address: 00007fffd2755940
> [ 1476.541989]     Physical address:  000020181a080000
> [...]
> [ 1476.542003] Severe Machine check interrupt [Recovered]
> [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
> [ 1476.542005]   Initiator: CPU
> [ 1476.542006]   Error type: UE [Load/Store]
> [ 1476.542006]     Effective address: 00007fffd2755940
> [ 1476.542007]     Physical address:  000020181a080000
> [ 1476.542010] Severe Machine check interrupt [Recovered]
> [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
> [ 1476.542013]   Initiator: CPU
> [ 1476.542014]   Error type: UE [Load/Store]
> [ 1476.542015]     Effective address: 00007fffd2755940
> [ 1476.542016]     Physical address:  000020181a080000
> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
> [...]
> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
> 
> After this patch we see:
> 
> [  325.384336] Severe Machine check interrupt [Not recovered]
> [  325.384338]   NIP: [00007fffaae585f4] PID: 7168 Comm: find
> [  325.384339]   Initiator: CPU
> [  325.384341]   Error type: UE [Load/Store]
> [  325.384343]     Effective address: 00007fffaafe28ac
> [  325.384345]     Physical address:  00002017c0bd0000
> [  325.384350] find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
> [  325.388574] Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
> 
> Fixes: 01eaac2b0591 ("powerpc/mce: Hookup ierror (instruction) UE errors")
> Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/mce_power.c |    8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index fe6fc63251fe..63b58ae5d601 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -441,7 +441,6 @@ static int mce_handle_ierror(struct pt_regs *regs,
>  					if (pfn != ULONG_MAX) {
>  						*phys_addr =
>  							(pfn << PAGE_SHIFT);
> -						handled = 1;
>  					}
>  				}
>  			}
> @@ -532,9 +531,8 @@ static int mce_handle_derror(struct pt_regs *regs,
>  			 * kernel/exception-64s.h
>  			 */
>  			if (get_paca()->in_mce < MAX_MCE_DEPTH)
> -				if (!mce_find_instr_ea_and_pfn(regs, addr,
> -								phys_addr))
> -					handled = 1;
> +				mce_find_instr_ea_and_pfn(regs, addr,
> +								phys_addr);
>  		}
>  		found = 1;
>  	}
> @@ -572,7 +570,7 @@ static long mce_handle_error(struct pt_regs *regs,
>  		const struct mce_ierror_table itable[])
>  {
>  	struct mce_error_info mce_err = { 0 };
> -	uint64_t addr, phys_addr;
> +	uint64_t addr, phys_addr = ULONG_MAX;
>  	uint64_t srr1 = regs->msr;
>  	long handled;
>  
> 

Reviewed-by: Balbir Singh <bsingharora@gmail.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: powerpc/mce: Fix a bug where mce loops on memory UE.
  2018-04-23  4:59 [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE Mahesh J Salgaonkar
  2018-04-23  6:51 ` Balbir Singh
  2018-04-23 23:41 ` Balbir Singh
@ 2018-04-25  2:55 ` Michael Ellerman
  2 siblings, 0 replies; 10+ messages in thread
From: Michael Ellerman @ 2018-04-25  2:55 UTC (permalink / raw)
  To: Mahesh J Salgaonkar, linuxppc-dev

On Mon, 2018-04-23 at 04:59:27 UTC, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> 
> The current code extracts the physical address for UE errors and then
> hooks it up into memory failure infrastructure. On successful extraction
> of physical address it wrongly sets "handled = 1" which means this UE error
> has been recovered. Since MCE handler gets return value as handled = 1, it
> assumes that error has been recovered and goes back to same NIP. This causes
> MCE interrupt again and again in a loop leading to hard lockup.
> 
> Also, initialize phys_addr to ULONG_MAX so that we don't end up queuing
> undesired page to hwpoison.
> 
> Without this patch we see:
> [ 1476.541984] Severe Machine check interrupt [Recovered]
> [ 1476.541985]   NIP: [000000001002588c] PID: 7109 Comm: find
> [ 1476.541986]   Initiator: CPU
> [ 1476.541987]   Error type: UE [Load/Store]
> [ 1476.541988]     Effective address: 00007fffd2755940
> [ 1476.541989]     Physical address:  000020181a080000
> [...]
> [ 1476.542003] Severe Machine check interrupt [Recovered]
> [ 1476.542004]   NIP: [000000001002588c] PID: 7109 Comm: find
> [ 1476.542005]   Initiator: CPU
> [ 1476.542006]   Error type: UE [Load/Store]
> [ 1476.542006]     Effective address: 00007fffd2755940
> [ 1476.542007]     Physical address:  000020181a080000
> [ 1476.542010] Severe Machine check interrupt [Recovered]
> [ 1476.542012]   NIP: [000000001002588c] PID: 7109 Comm: find
> [ 1476.542013]   Initiator: CPU
> [ 1476.542014]   Error type: UE [Load/Store]
> [ 1476.542015]     Effective address: 00007fffd2755940
> [ 1476.542016]     Physical address:  000020181a080000
> [ 1476.542448] Memory failure: 0x20181a08: recovery action for dirty LRU page: Recovered
> [ 1476.542452] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542453] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542454] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542455] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542456] Memory failure: 0x20181a08: already hardware poisoned
> [ 1476.542457] Memory failure: 0x20181a08: already hardware poisoned
> [...]
> [ 1490.972174] Watchdog CPU:38 Hard LOCKUP
> 
> After this patch we see:
> 
> [  325.384336] Severe Machine check interrupt [Not recovered]
> [  325.384338]   NIP: [00007fffaae585f4] PID: 7168 Comm: find
> [  325.384339]   Initiator: CPU
> [  325.384341]   Error type: UE [Load/Store]
> [  325.384343]     Effective address: 00007fffaafe28ac
> [  325.384345]     Physical address:  00002017c0bd0000
> [  325.384350] find[7168]: unhandled signal 7 at 00007fffaae585f4 nip 00007fffaae585f4 lr 00007fffaae585e0 code 4
> [  325.388574] Memory failure: 0x2017c0bd: recovery action for dirty LRU page: Recovered
> 
> Fixes: 01eaac2b0591 ("powerpc/mce: Hookup ierror (instruction) UE errors")
> Fixes: ba41e1e1ccb9 ("powerpc/mce: Hookup derror (load/store) UE errors")
> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
> Reviewed-by: Balbir Singh <bsingharora@gmail.com>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/75ecfb49516c53da00c57b9efe48fa

cheers

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-04-25  2:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-23  4:59 [PATCH] powerpc/mce: Fix a bug where mce loops on memory UE Mahesh J Salgaonkar
2018-04-23  6:51 ` Balbir Singh
2018-04-23  9:23   ` Balbir Singh
2018-04-23 10:33   ` Mahesh Jagannath Salgaonkar
2018-04-23 11:14     ` Balbir Singh
2018-04-23 13:01       ` Nicholas Piggin
2018-04-23 23:00         ` Balbir Singh
2018-04-23 13:01       ` Mahesh Jagannath Salgaonkar
2018-04-23 23:41 ` Balbir Singh
2018-04-25  2:55 ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.