All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rasdaemon 0/2] ras-page-isolation bugfix
@ 2020-10-31  9:57 lvying6
  2020-10-31  9:57 ` [PATCH rasdaemon 1/2] ras-page-isolation: fix do_page_offline always considers page offline is successful lvying6
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: lvying6 @ 2020-10-31  9:57 UTC (permalink / raw)
  To: mchehab+huawei, linux-edac; +Cc: fanwentao

This patchset fix two problems in ras-page-isolation.c:
1. fix do_page_offline always considers kernel page offline is
successful
2. fix page which is PAGE_OFFLINE_FAILED can not be offlined again

lvying (1):
  ras-page-isolation: fix do_page_offline always considers page offline
    is successful

lvying6 (1):
  ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined
    again

 ras-page-isolation.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH rasdaemon 1/2] ras-page-isolation: fix do_page_offline always considers page offline is successful
  2020-10-31  9:57 [PATCH rasdaemon 0/2] ras-page-isolation bugfix lvying6
@ 2020-10-31  9:57 ` lvying6
  2020-10-31  9:57 ` [PATCH rasdaemon 2/2] ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again lvying6
  2020-12-23  9:47 ` [PATCH rasdaemon 0/2] ras-page-isolation bugfix Mauro Carvalho Chehab
  2 siblings, 0 replies; 4+ messages in thread
From: lvying6 @ 2020-10-31  9:57 UTC (permalink / raw)
  To: mchehab+huawei, linux-edac; +Cc: fanwentao

From: lvying <lvying6@huawei.com>

do_page_offline always consider page offline is successful even if kernel
soft/hard offline page failed

when I set /etc/sysconfig/rasdaemon PAGE_CE_THRESHOLD="1" i.e when a
page's address occurs Corrected Error, rasdaemon will trigger this page
soft offline. Also I put a livepatch into kernel's
store_soft_offline_page to observe this function's return vlaue.

When I inject a CE into address 0x3f7ec30000
kernel log:
soft_offline: 0x3f7ec30: unknown non LRU page type ffffe0000000000 ()
[store_soft_offline_page]return from soft_offline_page: -5

rasdaemon log:
rasdaemon[73711]: cpu 00:rasdaemon: Corrected Errors at 0x3f7ec30000 exceed threshold
rasdaemon[73711]: rasdaemon: Result of offlining page at 0x3f7ec30000: offlined

At the same time, I use strace to record rasdaemon's system call:
strace -p 73711
openat(AT_FDCWD, "/sys/devices/system/memory/soft_offline_page",
	O_WRONLY|O_CREAT|O_TRUNC, 0666) = 28
fstat(28, {st_mode=S_IFREG|0200, st_size=4096, ...}) = 0
write(28, "0x3f7ec30000", 12)           = -1 EIO (Input/output error)
close(28)                               = 0

So, kernel actually soft offline pfn 0x3f7ec30 failed, store_soft_offline_page
return -EIO. However, rasdaemon always considers the page offline is
successful.
According to strace display, ferror is unaware of the failure of the
write syscall. So I change fopen-fprintf-ferror-fclose process to
open-write-close process which can be aware of the failure of the write
syscall.

Signed-off-by: lvying <lvying6@huawei.com>
---
 ras-page-isolation.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/ras-page-isolation.c b/ras-page-isolation.c
index 50e4406..dc07545 100644
--- a/ras-page-isolation.c
+++ b/ras-page-isolation.c
@@ -17,6 +17,9 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <errno.h>
 #include "ras-logger.h"
 #include "ras-page-isolation.h"
 
@@ -210,18 +213,22 @@ void ras_page_account_init(void)
 
 static int do_page_offline(unsigned long long addr, enum otype type)
 {
-	FILE *offline_file;
-	int err;
+	int fd, rc;
+	char buf[20];
 
-	offline_file = fopen(kernel_offline[type], "w");
-	if (!offline_file)
+	fd = open(kernel_offline[type], O_WRONLY);
+	if (fd == -1) {
+		log(TERM, LOG_ERR, "[%s]:open file: %s failed\n", __func__, kernel_offline[type]);
 		return -1;
+	}
 
-	fprintf(offline_file, "%#llx", addr);
-	err = ferror(offline_file) ? -1 : 0;
-	fclose(offline_file);
-
-	return err;
+	sprintf(buf, "%#llx", addr);
+	rc = write(fd, buf, strlen(buf));
+	if (rc < 0) {
+		log(TERM, LOG_ERR, "page offline addr(%s) by %s failed, errno:%d\n", buf, kernel_offline[type], errno);
+	}
+	close(fd);
+	return rc;
 }
 
 static void page_offline(struct page_record *pr)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH rasdaemon 2/2] ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again
  2020-10-31  9:57 [PATCH rasdaemon 0/2] ras-page-isolation bugfix lvying6
  2020-10-31  9:57 ` [PATCH rasdaemon 1/2] ras-page-isolation: fix do_page_offline always considers page offline is successful lvying6
@ 2020-10-31  9:57 ` lvying6
  2020-12-23  9:47 ` [PATCH rasdaemon 0/2] ras-page-isolation bugfix Mauro Carvalho Chehab
  2 siblings, 0 replies; 4+ messages in thread
From: lvying6 @ 2020-10-31  9:57 UTC (permalink / raw)
  To: mchehab+huawei, linux-edac; +Cc: fanwentao

OS may fail to offline page at the previous time. After some time,
this page's state changed, and the page can be offlined by OS.
At this time, Correctable errors on this page reached the threshold.
Rasdaemon should trigger to offline this page again.

Signed-off-by: lvying6 <lvying6@huawei.com>
---
 ras-page-isolation.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/ras-page-isolation.c b/ras-page-isolation.c
index dc07545..fd7bd70 100644
--- a/ras-page-isolation.c
+++ b/ras-page-isolation.c
@@ -237,12 +237,17 @@ static void page_offline(struct page_record *pr)
 	int ret;
 
 	/* Offlining page is not required */
-	if (offline <= OFFLINE_ACCOUNT)
+	if (offline <= OFFLINE_ACCOUNT) {
+		log(TERM, LOG_INFO, "PAGE_CE_ACTION=%s, ignore to offline page at %#llx\n",
+				offline_choice[offline].name, addr);
 		return;
+	}
 
 	/* Ignore offlined pages */
-	if (pr->offlined != PAGE_ONLINE)
+	if (pr->offlined == PAGE_OFFLINE) {
+		log(TERM, LOG_INFO, "page at %#llx is already offlined, ignore\n", addr);
 		return;
+	}
 
 	/* Time to silence this noisy page */
 	if (offline == OFFLINE_SOFT_THEN_HARD) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH rasdaemon 0/2] ras-page-isolation bugfix
  2020-10-31  9:57 [PATCH rasdaemon 0/2] ras-page-isolation bugfix lvying6
  2020-10-31  9:57 ` [PATCH rasdaemon 1/2] ras-page-isolation: fix do_page_offline always considers page offline is successful lvying6
  2020-10-31  9:57 ` [PATCH rasdaemon 2/2] ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again lvying6
@ 2020-12-23  9:47 ` Mauro Carvalho Chehab
  2 siblings, 0 replies; 4+ messages in thread
From: Mauro Carvalho Chehab @ 2020-12-23  9:47 UTC (permalink / raw)
  To: lvying6; +Cc: linux-edac, fanwentao

Em Sat, 31 Oct 2020 17:57:13 +0800
lvying6 <lvying6@huawei.com> escreveu:

> This patchset fix two problems in ras-page-isolation.c:
> 1. fix do_page_offline always considers kernel page offline is
> successful
> 2. fix page which is PAGE_OFFLINE_FAILED can not be offlined again
> 
> lvying (1):
>   ras-page-isolation: fix do_page_offline always considers page offline
>     is successful
> 
> lvying6 (1):
>   ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined
>     again
> 
>  ras-page-isolation.c | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
> 

Patches applied, thanks!


Thanks,
Mauro

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-12-23  9:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-31  9:57 [PATCH rasdaemon 0/2] ras-page-isolation bugfix lvying6
2020-10-31  9:57 ` [PATCH rasdaemon 1/2] ras-page-isolation: fix do_page_offline always considers page offline is successful lvying6
2020-10-31  9:57 ` [PATCH rasdaemon 2/2] ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again lvying6
2020-12-23  9:47 ` [PATCH rasdaemon 0/2] ras-page-isolation bugfix Mauro Carvalho Chehab

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.