linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/hwpoison: Fix error page recovered but reported "not recovered"
@ 2022-01-07 19:44 Tony Luck
  2022-01-12 12:11 ` Naoya Horiguchi
  0 siblings, 1 reply; 4+ messages in thread
From: Tony Luck @ 2022-01-07 19:44 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, linux-mm, linux-kernel, Youquan Song, Tony Luck

From: Youquan Song <youquan.song@intel.com>

When an uncorrected memory error is consumed there is a race between
the CMCI from the memory controller reporting an uncorrected error
with a UCNA signature, and the core reporting and SRAR signature
machine check when the data is about to be consumed.

If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure() and the machine
check processing code finds the page already poisoned. It calls
kill_accessing_process() to make sure a SIGBUS is sent. But
returns the wrong error code.

Console log looks like this:

[34775.674296] mce: Uncorrected hardware memory error in user-access at 3710b3400
[34775.675413] Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
[34775.690310] Memory failure: 0x3710b3: already hardware poisoned
[34775.696247] Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
[34775.706072] mce: Memory error not recovered

Fix kill_accessing_process() to return -EHWPOISON to avoid the noise
message "Memory error not recovered" and skip duplicate SIGBUS.

[Tony: Reworded some parts of commit message]

Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address")
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

This code is very subtle ... the fix makes the "not recovered" message
go away ... but I'm not more than 75% confident that this is the right
fix. Please check carefully :-)

 mm/memory-failure.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3a274468f193..a67f558b08ea 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -707,7 +707,8 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
 	if (ret == 1 && priv.tk.addr)
 		kill_proc(&priv.tk, pfn, flags);
 	mmap_read_unlock(p->mm);
-	return ret ? -EFAULT : -EHWPOISON;
+
+	return (ret < 0) ? -EFAULT : -EHWPOISON;
 }
 
 static const char *action_name[] = {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-01-13  6:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-07 19:44 [PATCH] mm/hwpoison: Fix error page recovered but reported "not recovered" Tony Luck
2022-01-12 12:11 ` Naoya Horiguchi
2022-01-13  2:00   ` Luck, Tony
2022-01-13  6:52     ` HORIGUCHI NAOYA(堀口 直也)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).