mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* + mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch added to -mm tree
@ 2022-01-27  2:27 akpm
  0 siblings, 0 replies; 2+ messages in thread
From: akpm @ 2022-01-27  2:27 UTC (permalink / raw)
  To: mm-commits, naoya.horiguchi, tony.luck, youquan.song


The patch titled
     Subject: mm/hwpoison: fix error page recovered but reported "not recovered"
has been added to the -mm tree.  Its filename is
     mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch

This patch should soon appear at
    https://ozlabs.org/~akpm/mmots/broken-out/mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch
and later at
    https://ozlabs.org/~akpm/mmotm/broken-out/mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: mm/hwpoison: fix error page recovered but reported "not recovered"

When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a UCNA
signature, and the core reporting and SRAR signature machine check when
the data is about to be consumed.

If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure() and the machine check
processing code finds the page already poisoned.  It calls
kill_accessing_process() to make sure a SIGBUS is sent.  But returns the
wrong error code.

Console log looks like this:

[34775.674296] mce: Uncorrected hardware memory error in user-access at 3710b3400
[34775.675413] Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
[34775.690310] Memory failure: 0x3710b3: already hardware poisoned
[34775.696247] Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
[34775.706072] mce: Memory error not recovered

kill_accessing_process() is supposed to return -EHWPOISON to notify that
SIGBUS is already set to the process and kill_me_maybe() doesn't have to
send it again.  But current code simply fails to do this, so fix it to
make sure to work as intended.  This change avoids the noise message
"Memory error not recovered" and skips duplicate SIGBUSs.

[tony.luck@intel.com: reword some parts of commit message]
Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev
Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Youquan Song <youquan.song@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/memory-failure.c~mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered
+++ a/mm/memory-failure.c
@@ -707,8 +707,10 @@ static int kill_accessing_process(struct
 			      (void *)&priv);
 	if (ret == 1 && priv.tk.addr)
 		kill_proc(&priv.tk, pfn, flags);
+	else
+		ret = 0;
 	mmap_read_unlock(p->mm);
-	return ret ? -EFAULT : -EHWPOISON;
+	return ret > 0 ? -EHWPOISON : -EFAULT;
 }
 
 static const char *action_name[] = {
_

Patches currently in -mm which might be from naoya.horiguchi@nec.com are

mm-hwpoison-remove-obsolete-comment.patch
mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch


^ permalink raw reply	[flat|nested] 2+ messages in thread

* + mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch added to -mm tree
@ 2022-01-07 23:30 akpm
  0 siblings, 0 replies; 2+ messages in thread
From: akpm @ 2022-01-07 23:30 UTC (permalink / raw)
  To: mm-commits, naoya.horiguchi, tony.luck, youquan.song


The patch titled
     Subject: mm/hwpoison: fix error page recovered but reported "not recovered"
has been added to the -mm tree.  Its filename is
     mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch

This patch should soon appear at
    https://ozlabs.org/~akpm/mmots/broken-out/mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch
and later at
    https://ozlabs.org/~akpm/mmotm/broken-out/mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Youquan Song <youquan.song@intel.com>
Subject: mm/hwpoison: fix error page recovered but reported "not recovered"

When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a UCNA
signature, and the core reporting and SRAR signature machine check when
the data is about to be consumed.

If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure() and the machine check
processing code finds the page already poisoned.  It calls
kill_accessing_process() to make sure a SIGBUS is sent.  But returns the
wrong error code.

Console log looks like this:

[34775.674296] mce: Uncorrected hardware memory error in user-access at 3710b3400
[34775.675413] Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
[34775.690310] Memory failure: 0x3710b3: already hardware poisoned
[34775.696247] Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
[34775.706072] mce: Memory error not recovered

Fix kill_accessing_process() to return -EHWPOISON to avoid the noise
message "Memory error not recovered" and skip duplicate SIGBUS.

[Tony: Reworded some parts of commit message]

Link: https://lkml.kernel.org/r/20220107194450.1687264-1-tony.luck@intel.com
Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address")
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/memory-failure.c~mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered
+++ a/mm/memory-failure.c
@@ -708,7 +708,8 @@ static int kill_accessing_process(struct
 	if (ret == 1 && priv.tk.addr)
 		kill_proc(&priv.tk, pfn, flags);
 	mmap_read_unlock(p->mm);
-	return ret ? -EFAULT : -EHWPOISON;
+
+	return (ret < 0) ? -EFAULT : -EHWPOISON;
 }
 
 static const char *action_name[] = {
_

Patches currently in -mm which might be from youquan.song@intel.com are

mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-01-27  2:27 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-27  2:27 + mm-hwpoison-fix-error-page-recovered-but-reported-not-recovered.patch added to -mm tree akpm
  -- strict thread matches above, loose matches on Subject: below --
2022-01-07 23:30 akpm

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).