All of lore.kernel.org
 help / color / mirror / Atom feed
* Some RAS bug fix patches
@ 2014-07-16  2:34 ` Chen, Gong
  0 siblings, 0 replies; 18+ messages in thread
From: Chen, Gong @ 2014-07-16  2:34 UTC (permalink / raw)
  To: tony.luck, bp; +Cc: linux-acpi, linux-mm, x86

[PATCH 1/3] APEI, GHES: Cleanup unnecessary function for lock-less
[RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
[PATCH 3/3] RAS, HWPOISON: Fix wrong error recovery status

The patch 1/3 & 3/3 are minor fixes which are irrelevant with patch
2/3. I send them together just to avoid fragments.

The patch 2/3 is a RFC patch depending on the following thread:
https://lkml.org/lkml/2014/6/27/26

Comments and suggestions are welcome.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Some RAS bug fix patches
@ 2014-07-16  2:34 ` Chen, Gong
  0 siblings, 0 replies; 18+ messages in thread
From: Chen, Gong @ 2014-07-16  2:34 UTC (permalink / raw)
  To: tony.luck, bp; +Cc: linux-acpi, linux-mm, x86

[PATCH 1/3] APEI, GHES: Cleanup unnecessary function for lock-less
[RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
[PATCH 3/3] RAS, HWPOISON: Fix wrong error recovery status

The patch 1/3 & 3/3 are minor fixes which are irrelevant with patch
2/3. I send them together just to avoid fragments.

The patch 2/3 is a RFC patch depending on the following thread:
https://lkml.org/lkml/2014/6/27/26

Comments and suggestions are welcome.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] APEI, GHES: Cleanup unnecessary function for lock-less list
  2014-07-16  2:34 ` Chen, Gong
  (?)
@ 2014-07-16  2:34 ` Chen, Gong
  2014-07-20  8:01   ` Borislav Petkov
  -1 siblings, 1 reply; 18+ messages in thread
From: Chen, Gong @ 2014-07-16  2:34 UTC (permalink / raw)
  To: tony.luck, bp; +Cc: linux-acpi, linux-mm, x86, Chen, Gong

We have provided a reverse function for lock-less list so delete
uncessary codes.

Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
---
 drivers/acpi/apei/ghes.c | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index dab7cb7..1f9fba9 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -734,20 +734,6 @@ static int ghes_notify_sci(struct notifier_block *this,
 	return ret;
 }
 
-static struct llist_node *llist_nodes_reverse(struct llist_node *llnode)
-{
-	struct llist_node *next, *tail = NULL;
-
-	while (llnode) {
-		next = llnode->next;
-		llnode->next = tail;
-		tail = llnode;
-		llnode = next;
-	}
-
-	return tail;
-}
-
 static void ghes_proc_in_irq(struct irq_work *irq_work)
 {
 	struct llist_node *llnode, *next;
@@ -761,7 +747,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 	 * Because the time order of estatus in list is reversed,
 	 * revert it back to proper order.
 	 */
-	llnode = llist_nodes_reverse(llnode);
+	llnode = llist_reverse_order(llnode);
 	while (llnode) {
 		next = llnode->next;
 		estatus_node = llist_entry(llnode, struct ghes_estatus_node,
@@ -794,7 +780,7 @@ static void ghes_print_queued_estatus(void)
 	 * Because the time order of estatus in list is reversed,
 	 * revert it back to proper order.
 	 */
-	llnode = llist_nodes_reverse(llnode);
+	llnode = llist_reverse_order(llnode);
 	while (llnode) {
 		estatus_node = llist_entry(llnode, struct ghes_estatus_node,
 					   llnode);
-- 
2.0.0.rc2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
  2014-07-16  2:34 ` Chen, Gong
  (?)
  (?)
@ 2014-07-16  2:34 ` Chen, Gong
  2014-07-21  8:47   ` Borislav Petkov
  -1 siblings, 1 reply; 18+ messages in thread
From: Chen, Gong @ 2014-07-16  2:34 UTC (permalink / raw)
  To: tony.luck, bp; +Cc: linux-acpi, linux-mm, x86, Chen, Gong

When Uncorrected error happens, an MCE will be raised. Some
notification callbacks will be called in MCE context. If
some notification call printk it will cause potential
deadlock because MCE can preempt normal interrupts like NMI does.

Since printk is not safe in MCE context. So a lock-less memory
allocator (genpool) is used to save information which are
organized via a lock-less list. Print will be delayed into IRQ
context via irq_work. This idea is inspired by APEI/GHES driver.

Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
---
 arch/x86/Kconfig                          |   1 +
 arch/x86/include/asm/mce.h                |   2 +-
 arch/x86/kernel/cpu/mcheck/Makefile       |   2 +-
 arch/x86/kernel/cpu/mcheck/mce-apei.c     |   2 +-
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  | 109 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   9 +++
 arch/x86/kernel/cpu/mcheck/mce.c          |  21 ++++--
 arch/x86/kernel/cpu/mcheck/mce_amd.c      |   2 +-
 8 files changed, 137 insertions(+), 11 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mcheck/mce-genpool.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a8f749e..2632732 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -861,6 +861,7 @@ config X86_REROUTE_FOR_BROKEN_BOOT_IRQS
 
 config X86_MCE
 	bool "Machine Check / overheating reporting"
+	select GENERIC_ALLOCATOR
 	default y
 	---help---
 	  Machine Check support allows the processor to notify the
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 958b90f..e90d8ac 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -141,7 +141,7 @@ static inline void enable_p5_mce(void) {}
 #endif
 
 void mce_setup(struct mce *m);
-void mce_log(struct mce *m);
+void mce_log(struct mce *m, int);
 DECLARE_PER_CPU(struct device *, mce_device);
 
 /*
diff --git a/arch/x86/kernel/cpu/mcheck/Makefile b/arch/x86/kernel/cpu/mcheck/Makefile
index bb34b03..a3311c8 100644
--- a/arch/x86/kernel/cpu/mcheck/Makefile
+++ b/arch/x86/kernel/cpu/mcheck/Makefile
@@ -1,4 +1,4 @@
-obj-y				=  mce.o mce-severity.o
+obj-y				=  mce.o mce-severity.o mce-genpool.o
 
 obj-$(CONFIG_X86_ANCIENT_MCE)	+= winchip.o p5.o
 obj-$(CONFIG_X86_MCE_INTEL)	+= mce_intel.o
diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index a1aef95..7b92089 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -56,7 +56,7 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 		m.status |= MCI_STATUS_PCC;
 
 	m.addr = mem_err->physical_addr;
-	mce_log(&m);
+	mce_log(&m, 0);
 	mce_notify_irq();
 }
 EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
new file mode 100644
index 0000000..1c08c37
--- /dev/null
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -0,0 +1,109 @@
+/*
+ * Print memory pool management in MCE context
+ *
+ * Copyright (C) 2014 Intel Corp.
+ * Author: Chen, Gong <gong.chen@linux.intel.com>
+ *
+ * This file is licensed under GPLv2.
+ */
+#include <linux/mm.h>
+#include <linux/genalloc.h>
+#include <linux/llist.h>
+#include <linux/irq_work.h>
+#include "mce-internal.h"
+
+/*
+ * printk is not safe in MCE context. So a lock-less memory allocator
+ * (genpool) is used to save information which are organized via a lock-less
+ * list. Print will be delayed into IRQ context via irq_work.
+ */
+static struct gen_pool *mce_evt_pool;
+static struct llist_head mce_event_llist;
+static struct irq_work mce_irqwork;
+static int mce_evt_len;
+static bool pool_inactive;
+
+/*
+ * This memory pool is only to be used to save MCE records in MCE context.
+ * The MCE event is rare so a fixed size memory pool should be enough.
+ */
+static int mce_genpool_create(void)
+{
+	int i, ret, pages;
+	unsigned long addr;
+
+	mce_evt_len = sizeof(struct mce_evt_llist);
+	mce_evt_pool = gen_pool_create(ilog2(mce_evt_len), -1);
+	if (!mce_evt_pool)
+		return -ENOMEM;
+
+	/* two pages should be enough */
+	pages = 2;
+	for (i = 0; i < pages; i++) {
+		addr = __get_free_page(GFP_KERNEL);
+		if (!addr)
+			return -ENOMEM;
+		ret = gen_pool_add(mce_evt_pool, addr, PAGE_SIZE, -1);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static void free_chunk_page(struct gen_pool *pool,
+		struct gen_pool_chunk *chunk, void *data)
+{
+	free_page(chunk->start_addr);
+}
+
+static void mce_evt_pool_destroy(void)
+{
+	gen_pool_for_each_chunk(mce_evt_pool, free_chunk_page, NULL);
+	gen_pool_destroy(mce_evt_pool);
+}
+
+static void do_mce_irqwork(struct irq_work *irq_work)
+{
+	struct llist_node *llnode, *next;
+	struct mce_evt_llist *node;
+	struct mce *mce;
+
+	llnode = llist_del_all(&mce_event_llist);
+	/* print MCE event record from the oldest */
+	llnode = llist_reverse_order(llnode);
+	while (llnode) {
+		next = llnode->next;
+		node = llist_entry(llnode, struct mce_evt_llist, llnode);
+		mce = &node->mce;
+		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
+		gen_pool_free(mce_evt_pool, (unsigned long)node, mce_evt_len);
+		llnode = next;
+	}
+}
+
+void mce_evt_save_ll(struct mce *mce)
+{
+	struct mce_evt_llist *node;
+
+	if (unlikely(pool_inactive))
+		return;
+
+	node = (void *)gen_pool_alloc(mce_evt_pool, mce_evt_len);
+	if (node) {
+		memcpy(&node->mce, mce, mce_evt_len);
+		llist_add(&node->llnode, &mce_event_llist);
+		irq_work_queue(&mce_irqwork);
+	}
+}
+
+void __init mce_genpool_init(void)
+{
+	if (mce_genpool_create() != 0)
+		pool_inactive = true;
+
+	if (pool_inactive)
+		mce_evt_pool_destroy();
+	else
+		init_irq_work(&mce_irqwork, do_mce_irqwork);
+}
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 09edd0b..005a688 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -11,6 +11,8 @@ enum severity_level {
 	MCE_PANIC_SEVERITY,
 };
 
+extern struct atomic_notifier_head x86_mce_decoder_chain;
+
 #define ATTR_LEN		16
 
 /* One object for each MCE bank, shared by all CPUs */
@@ -21,6 +23,13 @@ struct mce_bank {
 	char			attrname[ATTR_LEN];	/* attribute name */
 };
 
+struct mce_evt_llist {
+	struct llist_node llnode;
+	struct mce mce;
+};
+
+void mce_evt_save_ll(struct mce *mce);
+void __init mce_genpool_init(void);
 int mce_severity(struct mce *a, int tolerant, char **msg);
 struct dentry *mce_get_debugfs_dir(void);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bb92f38..dcca519 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -147,7 +147,7 @@ static struct mce_log mcelog = {
 	.recordlen	= sizeof(struct mce),
 };
 
-void mce_log(struct mce *mce)
+void mce_log(struct mce *mce, int in_mce)
 {
 	unsigned next, entry;
 	int ret = 0;
@@ -155,9 +155,14 @@ void mce_log(struct mce *mce)
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
-	if (ret == NOTIFY_STOP)
-		return;
+	if (in_mce)
+		mce_evt_save_ll(mce);
+	else {
+		ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0,
+						 mce);
+		if (ret == NOTIFY_STOP)
+			return;
+	}
 
 	mce->finished = 0;
 	wmb();
@@ -635,7 +640,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		 * have anything to do with the actual error location.
 		 */
 		if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)
-			mce_log(&m);
+			mce_log(&m, 0);
 
 		/*
 		 * Clear state for this bank.
@@ -1124,7 +1129,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
 			mce_ring_add(m.addr >> PAGE_SHIFT);
 
-		mce_log(&m);
+		mce_log(&m, 1);
 
 		if (severity > worst) {
 			*final = m;
@@ -1254,7 +1259,7 @@ void mce_log_therm_throt_event(__u64 status)
 	mce_setup(&m);
 	m.bank = MCE_THERMAL_BANK;
 	m.status = status;
-	mce_log(&m);
+	mce_log(&m, 0);
 }
 #endif /* CONFIG_X86_MCE_INTEL */
 
@@ -2466,6 +2471,8 @@ static __init int mcheck_init_device(void)
 	if (err)
 		goto err_register;
 
+	mce_genpool_init();
+
 	return 0;
 
 err_register:
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 603df4f..ca64641 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -319,7 +319,7 @@ static void amd_threshold_interrupt(void)
 				m.bank = K8_MCE_THRESHOLD_BASE
 				       + bank * NR_BLOCKS
 				       + block;
-				mce_log(&m);
+				mce_log(&m, 0);
 				return;
 			}
 		}
-- 
2.0.0.rc2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] RAS, HWPOISON: Fix wrong error recovery status
  2014-07-16  2:34 ` Chen, Gong
                   ` (2 preceding siblings ...)
  (?)
@ 2014-07-16  2:34 ` Chen, Gong
  2014-07-16 19:57     ` Naoya Horiguchi
  -1 siblings, 1 reply; 18+ messages in thread
From: Chen, Gong @ 2014-07-16  2:34 UTC (permalink / raw)
  To: tony.luck, bp; +Cc: linux-acpi, linux-mm, x86, Chen, Gong

When Uncorrected error happens, if the poisoned page is referenced
by more than one user after error recovery, the recovery is not
successful. But currently the display result is wrong.
Before this patch:

MCE 0x44e336: dirty mlocked LRU page recovery: Recovered
MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
mce: Memory error not recovered

After this patch:

MCE 0x44e336: dirty mlocked LRU page recovery: Failed
MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
mce: Memory error not recovered

Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
---
 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c6399e3..2985861 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -860,7 +860,6 @@ static int page_action(struct page_state *ps, struct page *p,
 	int count;
 
 	result = ps->action(p, pfn);
-	action_result(pfn, ps->msg, result);
 
 	count = page_count(p) - 1;
 	if (ps->action == me_swapcache_dirty && result == DELAYED)
@@ -871,6 +870,7 @@ static int page_action(struct page_state *ps, struct page *p,
 		       pfn, ps->msg, count);
 		result = FAILED;
 	}
+	action_result(pfn, ps->msg, result);
 
 	/* Could do more checks here if page looks ok */
 	/*
-- 
2.0.0.rc2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] RAS, HWPOISON: Fix wrong error recovery status
  2014-07-16  2:34 ` [PATCH 3/3] RAS, HWPOISON: Fix wrong error recovery status Chen, Gong
@ 2014-07-16 19:57     ` Naoya Horiguchi
  0 siblings, 0 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2014-07-16 19:57 UTC (permalink / raw)
  To: Chen, Gong; +Cc: tony.luck, bp, linux-acpi, linux-mm, x86

On Tue, Jul 15, 2014 at 10:34:42PM -0400, Chen, Gong wrote:
> When Uncorrected error happens, if the poisoned page is referenced
> by more than one user after error recovery, the recovery is not
> successful. But currently the display result is wrong.
> Before this patch:
> 
> MCE 0x44e336: dirty mlocked LRU page recovery: Recovered
> MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
> mce: Memory error not recovered
> 
> After this patch:
> 
> MCE 0x44e336: dirty mlocked LRU page recovery: Failed
> MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
> mce: Memory error not recovered
> 
> Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

> ---
>  mm/memory-failure.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index c6399e3..2985861 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -860,7 +860,6 @@ static int page_action(struct page_state *ps, struct page *p,
>  	int count;
>  
>  	result = ps->action(p, pfn);
> -	action_result(pfn, ps->msg, result);
>  
>  	count = page_count(p) - 1;
>  	if (ps->action == me_swapcache_dirty && result == DELAYED)
> @@ -871,6 +870,7 @@ static int page_action(struct page_state *ps, struct page *p,
>  		       pfn, ps->msg, count);
>  		result = FAILED;
>  	}
> +	action_result(pfn, ps->msg, result);
>  
>  	/* Could do more checks here if page looks ok */
>  	/*
> -- 
> 2.0.0.rc2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] RAS, HWPOISON: Fix wrong error recovery status
@ 2014-07-16 19:57     ` Naoya Horiguchi
  0 siblings, 0 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2014-07-16 19:57 UTC (permalink / raw)
  To: Chen, Gong; +Cc: tony.luck, bp, linux-acpi, linux-mm, x86

On Tue, Jul 15, 2014 at 10:34:42PM -0400, Chen, Gong wrote:
> When Uncorrected error happens, if the poisoned page is referenced
> by more than one user after error recovery, the recovery is not
> successful. But currently the display result is wrong.
> Before this patch:
> 
> MCE 0x44e336: dirty mlocked LRU page recovery: Recovered
> MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
> mce: Memory error not recovered
> 
> After this patch:
> 
> MCE 0x44e336: dirty mlocked LRU page recovery: Failed
> MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
> mce: Memory error not recovered
> 
> Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

> ---
>  mm/memory-failure.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index c6399e3..2985861 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -860,7 +860,6 @@ static int page_action(struct page_state *ps, struct page *p,
>  	int count;
>  
>  	result = ps->action(p, pfn);
> -	action_result(pfn, ps->msg, result);
>  
>  	count = page_count(p) - 1;
>  	if (ps->action == me_swapcache_dirty && result == DELAYED)
> @@ -871,6 +870,7 @@ static int page_action(struct page_state *ps, struct page *p,
>  		       pfn, ps->msg, count);
>  		result = FAILED;
>  	}
> +	action_result(pfn, ps->msg, result);
>  
>  	/* Could do more checks here if page looks ok */
>  	/*
> -- 
> 2.0.0.rc2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Some RAS bug fix patches
  2014-07-16  2:34 ` Chen, Gong
                   ` (3 preceding siblings ...)
  (?)
@ 2014-07-19  8:05 ` Chen, Gong
  -1 siblings, 0 replies; 18+ messages in thread
From: Chen, Gong @ 2014-07-19  8:05 UTC (permalink / raw)
  To: tony.luck, bp; +Cc: linux-acpi, linux-mm, x86

[-- Attachment #1: Type: text/plain, Size: 858 bytes --]

On Tue, Jul 15, 2014 at 10:34:39PM -0400, Chen, Gong wrote:
> Date: Tue, 15 Jul 2014 22:34:39 -0400
> From: "Chen, Gong" <gong.chen@linux.intel.com>
> To: tony.luck@intel.com, bp@alien8.de
> Cc: linux-acpi@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
> Subject: Some RAS bug fix patches
> X-Mailer: git-send-email 2.0.0.rc2
> 
> [PATCH 1/3] APEI, GHES: Cleanup unnecessary function for lock-less
> [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
> [PATCH 3/3] RAS, HWPOISON: Fix wrong error recovery status
> 
> The patch 1/3 & 3/3 are minor fixes which are irrelevant with patch
> 2/3. I send them together just to avoid fragments.
> 
> The patch 2/3 is a RFC patch depending on the following thread:
> https://lkml.org/lkml/2014/6/27/26
> 
> Comments and suggestions are welcome.
> 
Hi, Boris

Any comments?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] APEI, GHES: Cleanup unnecessary function for lock-less list
  2014-07-16  2:34 ` [PATCH 1/3] APEI, GHES: Cleanup unnecessary function for lock-less list Chen, Gong
@ 2014-07-20  8:01   ` Borislav Petkov
  0 siblings, 0 replies; 18+ messages in thread
From: Borislav Petkov @ 2014-07-20  8:01 UTC (permalink / raw)
  To: Chen, Gong; +Cc: tony.luck, linux-acpi, linux-mm, x86

On Tue, Jul 15, 2014 at 10:34:40PM -0400, Chen, Gong wrote:
> We have provided a reverse function for lock-less list so delete
> uncessary codes.
> 
> Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
  2014-07-16  2:34 ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context Chen, Gong
@ 2014-07-21  8:47   ` Borislav Petkov
  2014-07-21 17:14     ` Luck, Tony
  0 siblings, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2014-07-21  8:47 UTC (permalink / raw)
  To: Chen, Gong; +Cc: tony.luck, linux-acpi, linux-mm, x86

On Tue, Jul 15, 2014 at 10:34:41PM -0400, Chen, Gong wrote:
> When Uncorrected error happens, an MCE will be raised. Some
> notification callbacks will be called in MCE context. If
> some notification call printk it will cause potential
> deadlock because MCE can preempt normal interrupts like NMI does.
> 
> Since printk is not safe in MCE context. So a lock-less memory
> allocator (genpool) is used to save information which are
> organized via a lock-less list. Print will be delayed into IRQ
> context via irq_work. This idea is inspired by APEI/GHES driver.

This patch is overengineered even though we already have both process
context work and irq work facilities in place.

We also already have mce_ring where we add MCE signatures in #MC
context. Well, only for AO errors with usable addresses for now, at
least.

And we empty that ring in mce_process_work().

I think it would be a *lot* simpler if you modify the logic to put all
errors into the ring and remove the call chain call from mce_log(). I'm
looking at mce_report_event() which even does the irq_work stuff if
we need to raise an IPI *during* the MCE so that stuff gets processed
*before* we return to userspace.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
  2014-07-21  8:47   ` Borislav Petkov
@ 2014-07-21 17:14     ` Luck, Tony
  2014-07-21 21:41       ` Borislav Petkov
  2014-07-23  7:48       ` Chen, Gong
  0 siblings, 2 replies; 18+ messages in thread
From: Luck, Tony @ 2014-07-21 17:14 UTC (permalink / raw)
  To: Borislav Petkov, Chen, Gong; +Cc: linux-acpi, linux-mm, x86

> This patch is overengineered even though we already have both process
> context work and irq work facilities in place.
>
> We also already have mce_ring where we add MCE signatures in #MC
> context. Well, only for AO errors with usable addresses for now, at
> least.

We've evolved a bunch of mechanisms:

1) mce_ring: to pass pfn for AO errors from MCE context to a work thread
2) mce_info: to pass pfn for AR errors from MCE context to same process running in process context
3) mce_log: to pass entire "mce" structures from any context (MCE, CMCI, or init-time) to /dev/mcelog

something simpler might be nice - but a generic thing that is overkill for each of the
specialized uses might not necessarily be an improvement.

E.g. #3 above has a fixed capacity (MCE_LOG_LEN) and just drops any extras if it should fill
up (deliberately, because we almost always prefer to see the first bunch of errors rather
than the newest).

> I think it would be a *lot* simpler if you modify the logic to put all
> errors into the ring and remove the call chain call from mce_log().

I was actually wondering about going in the other direction. Make the
/dev/mcelog code register a notifier on x86_mce_decoder_chain (and
perhaps move all the /dev/mcelog functions out of mce.c into an actual
driver file).  Then use Chen Gong's NMI safe code to just unconditionally
make safe copies of anything that gets passed to mce_log() and run all
the notifiers from his do_mce_irqwork().

-Tony

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
  2014-07-21 17:14     ` Luck, Tony
@ 2014-07-21 21:41       ` Borislav Petkov
  2014-07-21 22:03         ` Luck, Tony
  2014-07-23  7:48       ` Chen, Gong
  1 sibling, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2014-07-21 21:41 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Chen, Gong, linux-acpi, linux-mm, x86

On Mon, Jul 21, 2014 at 05:14:06PM +0000, Luck, Tony wrote:
> We've evolved a bunch of mechanisms:
> 
> 1) mce_ring: to pass pfn for AO errors from MCE context to a work thread
> 2) mce_info: to pass pfn for AR errors from MCE context to same process running in process context
> 3) mce_log: to pass entire "mce" structures from any context (MCE, CMCI, or init-time) to /dev/mcelog
> 
> something simpler might be nice - but a generic thing that is overkill for each of the
> specialized uses might not necessarily be an improvement.
> 
> E.g. #3 above has a fixed capacity (MCE_LOG_LEN) and just drops any extras if it should fill

Gong's too. Famous last words:

	/* two pages should be enough */
	pages = 2;

> up (deliberately, because we almost always prefer to see the first bunch of errors rather
> than the newest).
> 
> > I think it would be a *lot* simpler if you modify the logic to put all
> > errors into the ring and remove the call chain call from mce_log().
> 
> I was actually wondering about going in the other direction. Make the
> /dev/mcelog code register a notifier on x86_mce_decoder_chain (and
> perhaps move all the /dev/mcelog functions out of mce.c into an actual
> driver file).

For easier deletion later. :-P

> Then use Chen Gong's NMI safe code to just unconditionally make safe
> copies of anything that gets passed to mce_log() and run all the
> notifiers from his do_mce_irqwork().

And drop all the homegrown other stuff like mce_ring and all? If this
gets designed right and it is well thought out - not hastily coded out -
it will probably be better, yes.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
  2014-07-21 21:41       ` Borislav Petkov
@ 2014-07-21 22:03         ` Luck, Tony
  2014-07-21 22:44           ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE Tony Luck
  2014-07-22 17:26           ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context Borislav Petkov
  0 siblings, 2 replies; 18+ messages in thread
From: Luck, Tony @ 2014-07-21 22:03 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Chen, Gong, linux-acpi, linux-mm, x86

> And drop all the homegrown other stuff like mce_ring and all?

mce_ring should be easy ... the "mce" structure has the address
from which we can easily get the pfn to pass into the action-optional
recovery path.  Only thing missing is a direct indication that this mce
does contain an AO error that needs to be processed. We could
re-invoke mce_severity() to figure it out again - or just add a flag
somewhere.

Not so sure about mce_info. This one passes from the
MCE context to the same task when we catch it in process
context (set_thread_flag(MCE_NOTIFY)).  Back when I was pushing
this code it, I really wanted to just add a field to the thread_info
structure to hold the address ... because this really is some
information that belongs to the thread. But I was unable to
convince people back then.  We must be able to find the
page frame when we arrive in mce_notify_process(). So we
can't stash it in some limited size pool of "mce" structures that
might decide to just drop this one.

-Tony


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE
  2014-07-21 22:03         ` Luck, Tony
@ 2014-07-21 22:44           ` Tony Luck
  2014-07-22 17:20             ` Borislav Petkov
  2014-07-22 17:26           ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context Borislav Petkov
  1 sibling, 1 reply; 18+ messages in thread
From: Tony Luck @ 2014-07-21 22:44 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Chen, Gong, linux-acpi, linux-mm, x86


This is how much cleaner things could be with a couple of task_struct
fields instead of the mce_info silliness ... untested.

---
 arch/x86/kernel/cpu/mcheck/mce.c | 58 ++++------------------------------------
 include/linux/sched.h            |  4 +++
 2 files changed, 9 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bb92f38153b2..b08398e69b5c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -956,51 +956,6 @@ static void mce_clear_state(unsigned long *toclear)
 }
 
 /*
- * Need to save faulting physical address associated with a process
- * in the machine check handler some place where we can grab it back
- * later in mce_notify_process()
- */
-#define	MCE_INFO_MAX	16
-
-struct mce_info {
-	atomic_t		inuse;
-	struct task_struct	*t;
-	__u64			paddr;
-	int			restartable;
-} mce_info[MCE_INFO_MAX];
-
-static void mce_save_info(__u64 addr, int c)
-{
-	struct mce_info *mi;
-
-	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++) {
-		if (atomic_cmpxchg(&mi->inuse, 0, 1) == 0) {
-			mi->t = current;
-			mi->paddr = addr;
-			mi->restartable = c;
-			return;
-		}
-	}
-
-	mce_panic("Too many concurrent recoverable errors", NULL, NULL);
-}
-
-static struct mce_info *mce_find_info(void)
-{
-	struct mce_info *mi;
-
-	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++)
-		if (atomic_read(&mi->inuse) && mi->t == current)
-			return mi;
-	return NULL;
-}
-
-static void mce_clear_info(struct mce_info *mi)
-{
-	atomic_set(&mi->inuse, 0);
-}
-
-/*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
  *
@@ -1156,7 +1111,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 			mce_panic("Fatal machine check on current CPU", &m, msg);
 		if (worst == MCE_AR_SEVERITY) {
 			/* schedule action before return to userland */
-			mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
+			current->paddr = m.addr;
+			current->restartable = m.mcgstatus & MCG_STATUS_RIPV;
 			set_thread_flag(TIF_MCE_NOTIFY);
 		} else if (kill_it) {
 			force_sig(SIGBUS, current);
@@ -1195,29 +1151,25 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 void mce_notify_process(void)
 {
 	unsigned long pfn;
-	struct mce_info *mi = mce_find_info();
 	int flags = MF_ACTION_REQUIRED;
 
-	if (!mi)
-		mce_panic("Lost physical address for unconsumed uncorrectable error", NULL, NULL);
-	pfn = mi->paddr >> PAGE_SHIFT;
+	pfn = current->paddr >> PAGE_SHIFT;
 
 	clear_thread_flag(TIF_MCE_NOTIFY);
 
 	pr_err("Uncorrected hardware memory error in user-access at %llx",
-		 mi->paddr);
+		 current->paddr);
 	/*
 	 * We must call memory_failure() here even if the current process is
 	 * doomed. We still need to mark the page as poisoned and alert any
 	 * other users of the page.
 	 */
-	if (!mi->restartable)
+	if (!current->restartable)
 		flags |= MF_MUST_KILL;
 	if (memory_failure(pfn, MCE_VECTOR, flags) < 0) {
 		pr_err("Memory error not recovered");
 		force_sig(SIGBUS, current);
 	}
-	mce_clear_info(mi);
 }
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0376b054a0d0..91db69a4acd7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1655,6 +1655,10 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	__u64	paddr;
+	int	restartable;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
-- 
1.8.4.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE
  2014-07-21 22:44           ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE Tony Luck
@ 2014-07-22 17:20             ` Borislav Petkov
  0 siblings, 0 replies; 18+ messages in thread
From: Borislav Petkov @ 2014-07-22 17:20 UTC (permalink / raw)
  To: Tony Luck; +Cc: Chen, Gong, linux-acpi, linux-mm, x86

On Mon, Jul 21, 2014 at 03:44:06PM -0700, Tony Luck wrote:
> 
> This is how much cleaner things could be with a couple of task_struct
> fields instead of the mce_info silliness ... untested.

...

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 0376b054a0d0..91db69a4acd7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1655,6 +1655,10 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	__u64	paddr;
> +	int	restartable;
> +#endif

Right, I don't see anything wrong with this approach especially as
task_struct is full of CONFIG_* ifdeffery for members used with
different features.

Adding 12 more bytes for CONFIG_MEMORY_FAILURE shouldn't hurt anyone. If
we really want to save space, we can use the highest significant byte of
paddr for a bit to say "restartable" or not.

So I think we should make it into a patch and push it upstream.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
  2014-07-21 22:03         ` Luck, Tony
  2014-07-21 22:44           ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE Tony Luck
@ 2014-07-22 17:26           ` Borislav Petkov
  2014-07-22 21:24             ` Tony Luck
  1 sibling, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2014-07-22 17:26 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Chen, Gong, linux-acpi, linux-mm, x86

On Mon, Jul 21, 2014 at 10:03:04PM +0000, Luck, Tony wrote:
> > And drop all the homegrown other stuff like mce_ring and all?
> 
> mce_ring should be easy ... the "mce" structure has the address
> from which we can easily get the pfn to pass into the action-optional
> recovery path.  Only thing missing is a direct indication that this mce
> does contain an AO error that needs to be processed. We could
> re-invoke mce_severity() to figure it out again - or just add a flag
> somewhere.

Right, so if we're going to clean up this mess, I think we should strive
for having a single ring buffer which contains all the MCEs and which we
can iterate over at leisure, either in IRQ context if some of them have
been reported through real exceptions or in process context if they're
simple CEs.

Once they've been eaten by something, we simply remove them from that
buffer and that's it. But sure, one lockless buffer which works in all
contexts would be much better than growing stuff here and there with
different semantics and context usage.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
  2014-07-22 17:26           ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context Borislav Petkov
@ 2014-07-22 21:24             ` Tony Luck
  0 siblings, 0 replies; 18+ messages in thread
From: Tony Luck @ 2014-07-22 21:24 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Chen, Gong, linux-acpi, linux-mm, x86

On Tue, Jul 22, 2014 at 10:26 AM, Borislav Petkov <bp@alien8.de> wrote:
> Once they've been eaten by something, we simply remove them from that
> buffer and that's it.

swap that for

Once everyone who registered a notifier as had a chance to see each
logged entry, we simply remove ...

I'm not a fan of the current NOTIFY_STOP behavior where one registrant
can say they are so important that nobody else should be allowed to see
what was logged.

-Tony

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context
  2014-07-21 17:14     ` Luck, Tony
  2014-07-21 21:41       ` Borislav Petkov
@ 2014-07-23  7:48       ` Chen, Gong
  1 sibling, 0 replies; 18+ messages in thread
From: Chen, Gong @ 2014-07-23  7:48 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, linux-acpi, linux-mm, x86

[-- Attachment #1: Type: text/plain, Size: 1173 bytes --]

On Mon, Jul 21, 2014 at 05:14:06PM +0000, Luck, Tony wrote:
> We've evolved a bunch of mechanisms:
> 
> 1) mce_ring: to pass pfn for AO errors from MCE context to a work thread
> 2) mce_info: to pass pfn for AR errors from MCE context to same process running in process context
> 3) mce_log: to pass entire "mce" structures from any context (MCE, CMCI, or init-time) to /dev/mcelog
> 
> I was actually wondering about going in the other direction. Make the
> /dev/mcelog code register a notifier on x86_mce_decoder_chain (and
> perhaps move all the /dev/mcelog functions out of mce.c into an actual
> driver file).  Then use Chen Gong's NMI safe code to just unconditionally
> make safe copies of anything that gets passed to mce_log() and run all
> the notifiers from his do_mce_irqwork().
> 
> -Tony

OK, I can cook some patches based on Tony's suggestion:
patch 1: add a generic lock-less memory pool to save error records
patch 2: remove mce_info (Tony has done a draft)
patch 3: remove mce_ring
patch 4: remove mce log buffer
patch 5: move all mce log related logic into a separate file lke mcelog.c
         under the same directory with mce.c

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-07-23  8:19 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-16  2:34 Some RAS bug fix patches Chen, Gong
2014-07-16  2:34 ` Chen, Gong
2014-07-16  2:34 ` [PATCH 1/3] APEI, GHES: Cleanup unnecessary function for lock-less list Chen, Gong
2014-07-20  8:01   ` Borislav Petkov
2014-07-16  2:34 ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context Chen, Gong
2014-07-21  8:47   ` Borislav Petkov
2014-07-21 17:14     ` Luck, Tony
2014-07-21 21:41       ` Borislav Petkov
2014-07-21 22:03         ` Luck, Tony
2014-07-21 22:44           ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE Tony Luck
2014-07-22 17:20             ` Borislav Petkov
2014-07-22 17:26           ` [RFC PATCH 2/3] x86, MCE: Avoid potential deadlock in MCE context Borislav Petkov
2014-07-22 21:24             ` Tony Luck
2014-07-23  7:48       ` Chen, Gong
2014-07-16  2:34 ` [PATCH 3/3] RAS, HWPOISON: Fix wrong error recovery status Chen, Gong
2014-07-16 19:57   ` Naoya Horiguchi
2014-07-16 19:57     ` Naoya Horiguchi
2014-07-19  8:05 ` Some RAS bug fix patches Chen, Gong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.