Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context
  2015-05-20 19:35 ` [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context Chen, Gong
@ 2015-05-20  9:28   ` Borislav Petkov
  2015-05-22 21:12     ` Chen, Gong
  0 siblings, 1 reply; 15+ messages in thread
From: Borislav Petkov @ 2015-05-20  9:28 UTC (permalink / raw)
  To: Chen, Gong, tony.luck; +Cc: linux-kernel

On Wed, May 20, 2015 at 03:35:38PM -0400, Chen, Gong wrote:
> Printing in MCE context is a no-no, currently, as printk is not
> NMI-safe. If some of the notifiers on the MCE chain call *printk*, we
> may deadlock. In order to avoid that, delay printk into process context
> to fix it.
> 
> Background info at: https://lkml.org/lkml/2014/6/27/26
> 
> Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
> Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
> Link: http://lkml.kernel.org/r/1406797523-28710-6-git-send-email-gong.chen@linux.intel.com
> [ Boris: rewrite a bit. ]
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/include/asm/mce.h               | 1 +
>  arch/x86/kernel/cpu/mcheck/mce-apei.c    | 2 +-
>  arch/x86/kernel/cpu/mcheck/mce.c         | 8 ++++++--
>  arch/x86/kernel/cpu/mcheck/mce_intel.c   | 1 -
>  arch/x86/kernel/cpu/mcheck/therm_throt.c | 1 +
>  arch/x86/kernel/cpu/mcheck/threshold.c   | 1 +
>  6 files changed, 10 insertions(+), 4 deletions(-)

....

> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> index 1af51b1586d7..2733f275237d 100644
> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -427,6 +427,7 @@ static inline void __smp_thermal_interrupt(void)
>  {
>  	inc_irq_stat(irq_thermal_count);
>  	smp_thermal_vector();
> +	mce_queue_irq_work();

Hmm, at a second glance, this looks wrong. I think we should do that
call in intel_thermal_interrupt().

>  asmlinkage __visible void smp_thermal_interrupt(struct pt_regs *regs)
> diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
> index 7245980186ee..d695faa234eb 100644
> --- a/arch/x86/kernel/cpu/mcheck/threshold.c
> +++ b/arch/x86/kernel/cpu/mcheck/threshold.c
> @@ -22,6 +22,7 @@ static inline void __smp_threshold_interrupt(void)
>  {
>  	inc_irq_stat(irq_threshold_count);
>  	mce_threshold_vector();
> +	mce_queue_irq_work();

Same here.

mce_queue_irq_work() call should be issued in both AMD and Intel
threshold handlers but not in the generic one which is unlikely to queue
any MCE...

Right?


-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error record
  2015-05-20 19:35 ` [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error record Chen, Gong
@ 2015-05-20 10:36   ` Borislav Petkov
  2015-05-22 21:06     ` Chen, Gong
  0 siblings, 1 reply; 15+ messages in thread
From: Borislav Petkov @ 2015-05-20 10:36 UTC (permalink / raw)
  To: Chen, Gong; +Cc: linux-kernel, tony.luck

On Wed, May 20, 2015 at 03:35:35PM -0400, Chen, Gong wrote:
> printk is not safe to use in MCE context. Add a lockless memory
> allocator pool to save error records in MCE context. Issual of those
> records will be delayed to a context safe to do printk. This idea is
> inspired by APEI/GHES driver.
> 
> We're very conservative and allocate only two pages for it but since
> we're going to use those pages throughout the system's lifetime, we
> allocate them statically to avoid early boot time allocation woes.
> 
> Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
> Link: http://lkml.kernel.org/r/1407830375-11087-1-git-send-email-gong.chen@linux.intel.com
> [Boris: rewrite. ]
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/Kconfig                          |   1 +
>  arch/x86/include/uapi/asm/mce.h           |   3 +-
>  arch/x86/kernel/cpu/mcheck/Makefile       |   2 +-
>  arch/x86/kernel/cpu/mcheck/mce-genpool.c  | 102 ++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/mcheck/mce-internal.h |  12 ++++
>  arch/x86/kernel/cpu/mcheck/mce.c          |   8 ++-
>  6 files changed, 125 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/kernel/cpu/mcheck/mce-genpool.c

Applied, thanks...

> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index e535533d5ab8..ba91777a7ad8 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -115,7 +115,7 @@ static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
>   * CPU/chipset specific EDAC code can register a notifier call here to print
>   * MCE errors in a human-readable form.
>   */
> -static ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
> +ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
>  
>  /* Do initial initialization of a struct mce */
>  void mce_setup(struct mce *m)
> @@ -1688,6 +1688,12 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
>  	if (mca_cfg.disabled)
>  		return;
>  
> +	if (mce_genpool_init()) {
> +		mca_cfg.disabled = true;
> +		pr_emerg("Couldn't allocate MCE records pool!\n");
> +		return;
> +	}
> +
>  	if (__mcheck_cpu_ancient_init(c))
>  		return;

... and moved this pool initialization right before we assign
machine_check_vector so that we don't do it unnecessarily if we return
earlier due to missing MCA features/MCA not enabled.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 15+ messages in thread

* MCE ring buffer management (Rebase)
@ 2015-05-20 19:35 Chen, Gong
  2015-05-20 19:35 ` [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error record Chen, Gong
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Chen, Gong @ 2015-05-20 19:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: tony.luck

[PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error
[PATCH 2/4 Rebase] x86, MCE: Don't use percpu for MCE workqueue/irq_work
[PATCH 3/4 Rebase] x86, MCE: Remove mce_ring for SRAO error
[PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context

We have too many rings for different H/W error events management.
All of them can be merged into one kind of unified mechanism.
Furthermore, this management mechanism should be reliable enough
even in MCE context to avoid deadlock like calling printk in MCE
context. This patch series is used for this purpose.

P.S. One of patch has been removed since last commit because it
has been merged already.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error record
  2015-05-20 19:35 MCE ring buffer management (Rebase) Chen, Gong
@ 2015-05-20 19:35 ` Chen, Gong
  2015-05-20 10:36   ` Borislav Petkov
  2015-05-20 19:35 ` [PATCH 2/4 Rebase] x86, MCE: Don't use percpu for MCE workqueue/irq_work Chen, Gong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Chen, Gong @ 2015-05-20 19:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: tony.luck, Chen, Gong, Borislav Petkov

printk is not safe to use in MCE context. Add a lockless memory
allocator pool to save error records in MCE context. Issual of those
records will be delayed to a context safe to do printk. This idea is
inspired by APEI/GHES driver.

We're very conservative and allocate only two pages for it but since
we're going to use those pages throughout the system's lifetime, we
allocate them statically to avoid early boot time allocation woes.

Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1407830375-11087-1-git-send-email-gong.chen@linux.intel.com
[Boris: rewrite. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/Kconfig                          |   1 +
 arch/x86/include/uapi/asm/mce.h           |   3 +-
 arch/x86/kernel/cpu/mcheck/Makefile       |   2 +-
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  | 102 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  12 ++++
 arch/x86/kernel/cpu/mcheck/mce.c          |   8 ++-
 6 files changed, 125 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mcheck/mce-genpool.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d43e7e1c784b..00c3395ed1c6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -945,6 +945,7 @@ config X86_REROUTE_FOR_BROKEN_BOOT_IRQS
 
 config X86_MCE
 	bool "Machine Check / overheating reporting"
+	select GENERIC_ALLOCATOR
 	default y
 	---help---
 	  Machine Check support allows the processor to notify the
diff --git a/arch/x86/include/uapi/asm/mce.h b/arch/x86/include/uapi/asm/mce.h
index a0eab85ce7b8..577acacaf515 100644
--- a/arch/x86/include/uapi/asm/mce.h
+++ b/arch/x86/include/uapi/asm/mce.h
@@ -15,7 +15,8 @@ struct mce {
 	__u64 time;	/* wall time_t when error was detected */
 	__u8  cpuvendor;	/* cpu vendor as encoded in system.h */
 	__u8  inject_flags;	/* software inject flags */
-	__u16  pad;
+	__u8  severity;
+	__u8  usable_addr;	/* addr is usable */
 	__u32 cpuid;	/* CPUID 1 EAX */
 	__u8  cs;		/* code segment */
 	__u8  bank;	/* machine check bank */
diff --git a/arch/x86/kernel/cpu/mcheck/Makefile b/arch/x86/kernel/cpu/mcheck/Makefile
index bb34b03af252..a3311c886194 100644
--- a/arch/x86/kernel/cpu/mcheck/Makefile
+++ b/arch/x86/kernel/cpu/mcheck/Makefile
@@ -1,4 +1,4 @@
-obj-y				=  mce.o mce-severity.o
+obj-y				=  mce.o mce-severity.o mce-genpool.o
 
 obj-$(CONFIG_X86_ANCIENT_MCE)	+= winchip.o p5.o
 obj-$(CONFIG_X86_MCE_INTEL)	+= mce_intel.o
diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
new file mode 100644
index 000000000000..742e187ace30
--- /dev/null
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -0,0 +1,102 @@
+/*
+ * MCE event pool management in MCE context
+ *
+ * Copyright (C) 2015 Intel Corp.
+ * Author: Chen, Gong <gong.chen@linux.intel.com>
+ *
+ * This file is licensed under GPLv2.
+ */
+#include <linux/smp.h>
+#include <linux/mm.h>
+#include <linux/genalloc.h>
+#include <linux/llist.h>
+#include "mce-internal.h"
+
+/*
+ * printk is not safe in MCE context thus a lock-less memory allocator
+ * (genpool) is used to save error information organized in a lock-less
+ * list.
+ */
+
+#define GENPOOL_BUFSZ	(2 * PAGE_SIZE)
+
+static struct gen_pool *mce_evt_pool;
+static LLIST_HEAD(mce_event_llist);
+static char genpool_buf[GENPOOL_BUFSZ];
+
+/*
+ * This memory pool is only to be used to save MCE records in MCE context.
+ * Since MCE events are rare so a fixed size memory pool should be enough.
+ * Keep 2 pages to save MCE events for now (~80 MCE records at most).
+ */
+static int mce_genpool_create(void)
+{
+	struct gen_pool *tmpp;
+	int ret = -ENOMEM;
+
+	tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
+	if (!tmpp)
+		goto out;
+
+	ret = gen_pool_add(tmpp, (unsigned long)genpool_buf, GENPOOL_BUFSZ, -1);
+	if (ret) {
+		gen_pool_destroy(tmpp);
+		goto out;
+	}
+
+	mce_evt_pool = tmpp;
+
+out:
+	return ret;
+}
+
+void mce_genpool_process(void)
+{
+	struct llist_node *head;
+	struct mce_evt_llist *node;
+	struct mce *mce;
+
+	head = llist_del_all(&mce_event_llist);
+	if (!head)
+		return;
+
+	head = llist_reverse_order(head);
+	llist_for_each_entry(node, head, llnode) {
+		mce = &node->mce;
+		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
+		gen_pool_free(mce_evt_pool, (unsigned long)node, sizeof(*node));
+	}
+}
+
+bool mce_genpool_empty(void)
+{
+	return llist_empty(&mce_event_llist);
+}
+
+bool mce_genpool_add(struct mce *mce)
+{
+	struct mce_evt_llist *node;
+
+	if (!mce_evt_pool)
+		return false;
+
+	node = (void *)gen_pool_alloc(mce_evt_pool, sizeof(*node));
+	if (!node) {
+		pr_warn_ratelimited("MCE records pool is overflowed!\n");
+		return false;
+	}
+
+	memcpy(&node->mce, mce, sizeof(*mce));
+	llist_add(&node->llnode, &mce_event_llist);
+
+	return true;
+}
+
+int mce_genpool_init(void)
+{
+	/* Just init mce_genpool once. */
+	if (mce_evt_pool)
+		return 0;
+
+	return mce_genpool_create();
+}
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index fe32074b865b..70d43f541bed 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -13,6 +13,8 @@ enum severity_level {
 	MCE_PANIC_SEVERITY,
 };
 
+extern struct atomic_notifier_head x86_mce_decoder_chain;
+
 #define ATTR_LEN		16
 #define INITIAL_CHECK_INTERVAL	5 * 60 /* 5 minutes */
 
@@ -24,6 +26,16 @@ struct mce_bank {
 	char			attrname[ATTR_LEN];	/* attribute name */
 };
 
+struct mce_evt_llist {
+	struct llist_node llnode;
+	struct mce mce;
+};
+
+void mce_genpool_process(void);
+bool mce_genpool_empty(void);
+bool mce_genpool_add(struct mce *mce);
+int mce_genpool_init(void);
+
 extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
 struct dentry *mce_get_debugfs_dir(void);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e535533d5ab8..ba91777a7ad8 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -115,7 +115,7 @@ static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
  */
-static ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
+ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 
 /* Do initial initialization of a struct mce */
 void mce_setup(struct mce *m)
@@ -1688,6 +1688,12 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
 	if (mca_cfg.disabled)
 		return;
 
+	if (mce_genpool_init()) {
+		mca_cfg.disabled = true;
+		pr_emerg("Couldn't allocate MCE records pool!\n");
+		return;
+	}
+
 	if (__mcheck_cpu_ancient_init(c))
 		return;
 
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4 Rebase] x86, MCE: Don't use percpu for MCE workqueue/irq_work
  2015-05-20 19:35 MCE ring buffer management (Rebase) Chen, Gong
  2015-05-20 19:35 ` [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error record Chen, Gong
@ 2015-05-20 19:35 ` Chen, Gong
  2015-05-20 19:35 ` [PATCH 3/4 Rebase] x86, MCE: Remove mce_ring for SRAO error Chen, Gong
  2015-05-20 19:35 ` [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context Chen, Gong
  3 siblings, 0 replies; 15+ messages in thread
From: Chen, Gong @ 2015-05-20 19:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: tony.luck, Chen, Gong, Borislav Petkov

An MCE is considered a rare event. Therefore, there's no need to have
per-CPU instances of both normal and IRQ workqueues. Make them both
global.

Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1406797523-28710-3-git-send-email-gong.chen@linux.intel.com
[Boris: massage commit message]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ba91777a7ad8..6c064245f802 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -107,7 +107,8 @@ DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
  */
 mce_banks_t mce_banks_ce_disabled;
 
-static DEFINE_PER_CPU(struct work_struct, mce_work);
+static struct work_struct mce_work;
+static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
@@ -523,11 +524,9 @@ int mce_available(struct cpuinfo_x86 *c)
 static void mce_schedule_work(void)
 {
 	if (!mce_ring_empty())
-		schedule_work(this_cpu_ptr(&mce_work));
+		schedule_work(&mce_work);
 }
 
-static DEFINE_PER_CPU(struct irq_work, mce_irq_work);
-
 static void mce_irq_work_cb(struct irq_work *entry)
 {
 	mce_notify_irq();
@@ -548,7 +547,7 @@ static void mce_report_event(struct pt_regs *regs)
 		return;
 	}
 
-	irq_work_queue(this_cpu_ptr(&mce_irq_work));
+	irq_work_queue(&mce_irq_work);
 }
 
 /*
@@ -1710,8 +1709,8 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
 	__mcheck_cpu_init_generic();
 	__mcheck_cpu_init_vendor(c);
 	__mcheck_cpu_init_timer();
-	INIT_WORK(this_cpu_ptr(&mce_work), mce_process_work);
-	init_irq_work(this_cpu_ptr(&mce_irq_work), &mce_irq_work_cb);
+	INIT_WORK(&mce_work, mce_process_work);
+	init_irq_work(&mce_irq_work, mce_irq_work_cb);
 }
 
 /*
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4 Rebase] x86, MCE: Remove mce_ring for SRAO error
  2015-05-20 19:35 MCE ring buffer management (Rebase) Chen, Gong
  2015-05-20 19:35 ` [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error record Chen, Gong
  2015-05-20 19:35 ` [PATCH 2/4 Rebase] x86, MCE: Don't use percpu for MCE workqueue/irq_work Chen, Gong
@ 2015-05-20 19:35 ` Chen, Gong
  2015-05-20 19:35 ` [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context Chen, Gong
  3 siblings, 0 replies; 15+ messages in thread
From: Chen, Gong @ 2015-05-20 19:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: tony.luck, Chen, Gong, Borislav Petkov

Use unified genpool to save SRAO error events and put AO error
handling in the same notification chain with mce error decoding.

Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1406797523-28710-4-git-send-email-gong.chen@linux.intel.com
[ Boris: correct a lot. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h       |   2 +-
 arch/x86/kernel/cpu/mcheck/mce.c | 115 ++++++++++++++-------------------------
 drivers/acpi/acpi_extlog.c       |   2 +-
 drivers/edac/i7core_edac.c       |   2 +-
 drivers/edac/mce_amd.c           |   2 +-
 drivers/edac/sb_edac.c           |   2 +-
 6 files changed, 47 insertions(+), 78 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 1f5a86d518db..d16f983f46f5 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -123,7 +123,7 @@ struct mce_vendor_flags {
 extern struct mce_vendor_flags mce_flags;
 
 extern struct mca_config mca_cfg;
-extern void mce_register_decode_chain(struct notifier_block *nb);
+extern void mce_register_decode_chain(struct notifier_block *nb, bool drain);
 extern void mce_unregister_decode_chain(struct notifier_block *nb);
 
 #include <linux/percpu.h>
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6c064245f802..b369b5fcda1d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -111,6 +111,7 @@ static struct work_struct mce_work;
 static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
+static int mce_usable_address(struct mce *m);
 
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
@@ -231,11 +232,18 @@ static void drain_mcelog_buffer(void)
 	} while (next != prev);
 }
 
+static struct notifier_block mce_srao_nb;
 
-void mce_register_decode_chain(struct notifier_block *nb)
+void mce_register_decode_chain(struct notifier_block *nb, bool drain)
 {
+	/* Ensure SRAO notifier has the highest priority in the decode chain. */
+	if (nb != &mce_srao_nb && nb->priority == INT_MAX)
+		nb->priority -= 1;
+
 	atomic_notifier_chain_register(&x86_mce_decoder_chain, nb);
-	drain_mcelog_buffer();
+
+	if (drain)
+		drain_mcelog_buffer();
 }
 EXPORT_SYMBOL_GPL(mce_register_decode_chain);
 
@@ -459,61 +467,6 @@ static inline void mce_gather_info(struct mce *m, struct pt_regs *regs)
 	}
 }
 
-/*
- * Simple lockless ring to communicate PFNs from the exception handler with the
- * process context work function. This is vastly simplified because there's
- * only a single reader and a single writer.
- */
-#define MCE_RING_SIZE 16	/* we use one entry less */
-
-struct mce_ring {
-	unsigned short start;
-	unsigned short end;
-	unsigned long ring[MCE_RING_SIZE];
-};
-static DEFINE_PER_CPU(struct mce_ring, mce_ring);
-
-/* Runs with CPU affinity in workqueue */
-static int mce_ring_empty(void)
-{
-	struct mce_ring *r = this_cpu_ptr(&mce_ring);
-
-	return r->start == r->end;
-}
-
-static int mce_ring_get(unsigned long *pfn)
-{
-	struct mce_ring *r;
-	int ret = 0;
-
-	*pfn = 0;
-	get_cpu();
-	r = this_cpu_ptr(&mce_ring);
-	if (r->start == r->end)
-		goto out;
-	*pfn = r->ring[r->start];
-	r->start = (r->start + 1) % MCE_RING_SIZE;
-	ret = 1;
-out:
-	put_cpu();
-	return ret;
-}
-
-/* Always runs in MCE context with preempt off */
-static int mce_ring_add(unsigned long pfn)
-{
-	struct mce_ring *r = this_cpu_ptr(&mce_ring);
-	unsigned next;
-
-	next = (r->end + 1) % MCE_RING_SIZE;
-	if (next == r->start)
-		return -1;
-	r->ring[r->end] = pfn;
-	wmb();
-	r->end = next;
-	return 0;
-}
-
 int mce_available(struct cpuinfo_x86 *c)
 {
 	if (mca_cfg.disabled)
@@ -523,7 +476,7 @@ int mce_available(struct cpuinfo_x86 *c)
 
 static void mce_schedule_work(void)
 {
-	if (!mce_ring_empty())
+	if (!mce_genpool_empty())
 		schedule_work(&mce_work);
 }
 
@@ -550,6 +503,27 @@ static void mce_report_event(struct pt_regs *regs)
 	irq_work_queue(&mce_irq_work);
 }
 
+static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
+				void *data)
+{
+	struct mce *mce = (struct mce *)data;
+	unsigned long pfn;
+
+	if (!mce)
+		return NOTIFY_DONE;
+
+	if (mce->usable_addr && (mce->severity == MCE_AO_SEVERITY)) {
+		pfn = mce->addr >> PAGE_SHIFT;
+		memory_failure(pfn, MCE_VECTOR, 0);
+	}
+
+	return NOTIFY_OK;
+}
+static struct notifier_block mce_srao_nb = {
+	.notifier_call	= srao_decode_notifier,
+	.priority = INT_MAX,
+};
+
 /*
  * Read ADDR and MISC registers.
  */
@@ -668,7 +642,9 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		 */
 		if (severity == MCE_DEFERRED_SEVERITY && memory_error(&m)) {
 			if (m.status & MCI_STATUS_ADDRV) {
-				mce_ring_add(m.addr >> PAGE_SHIFT);
+				m.severity = severity;
+				m.usable_addr = mce_usable_address(&m);
+				mce_genpool_add(&m);
 				mce_schedule_work();
 			}
 		}
@@ -1126,15 +1102,10 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 		mce_read_aux(&m, i);
 
-		/*
-		 * Action optional error. Queue address for later processing.
-		 * When the ring overflows we just ignore the AO error.
-		 * RED-PEN add some logging mechanism when
-		 * usable_address or mce_add_ring fails.
-		 * RED-PEN don't ignore overflow for mca_cfg.tolerant == 0
-		 */
-		if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
-			mce_ring_add(m.addr >> PAGE_SHIFT);
+		/* assuming valid severity level != 0 */
+		m.severity = severity;
+		m.usable_addr = mce_usable_address(&m);
+		mce_genpool_add(&m);
 
 		mce_log(&m);
 
@@ -1220,14 +1191,11 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 /*
  * Action optional processing happens here (picking up
  * from the list of faulting pages that do_machine_check()
- * placed into the "ring").
+ * placed into the genpool).
  */
 static void mce_process_work(struct work_struct *dummy)
 {
-	unsigned long pfn;
-
-	while (mce_ring_get(&pfn))
-		memory_failure(pfn, MCE_VECTOR, 0);
+	mce_genpool_process();
 }
 
 #ifdef CONFIG_X86_MCE_INTEL
@@ -2029,6 +1997,7 @@ __setup("mce", mcheck_enable);
 int __init mcheck_init(void)
 {
 	mcheck_intel_therm_init();
+	mce_register_decode_chain(&mce_srao_nb, false);
 	mcheck_vendor_init_severity();
 
 	return 0;
diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index b3842ffc19ba..07e012e74c1b 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -286,7 +286,7 @@ static int __init extlog_init(void)
 	 */
 	old_edac_report_status = get_edac_report_status();
 	set_edac_report_status(EDAC_REPORTING_DISABLED);
-	mce_register_decode_chain(&extlog_mce_dec);
+	mce_register_decode_chain(&extlog_mce_dec, true);
 	/* enable OS to be involved to take over management from BIOS */
 	((struct extlog_l1_head *)extlog_l1_addr)->flags |= FLAG_OS_OPTIN;
 
diff --git a/drivers/edac/i7core_edac.c b/drivers/edac/i7core_edac.c
index 01087a38da22..13d77f4a892c 100644
--- a/drivers/edac/i7core_edac.c
+++ b/drivers/edac/i7core_edac.c
@@ -2424,7 +2424,7 @@ static int __init i7core_init(void)
 	pci_rc = pci_register_driver(&i7core_driver);
 
 	if (pci_rc >= 0) {
-		mce_register_decode_chain(&i7_mce_dec);
+		mce_register_decode_chain(&i7_mce_dec, true);
 		return 0;
 	}
 
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 58586d59bf8e..aca31a237073 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -895,7 +895,7 @@ static int __init mce_amd_init(void)
 
 	pr_info("MCE: In-kernel MCE decoding enabled.\n");
 
-	mce_register_decode_chain(&amd_mce_dec_nb);
+	mce_register_decode_chain(&amd_mce_dec_nb, true);
 
 	return 0;
 }
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 1acf57ba4c86..494a23e4f7e6 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -2556,7 +2556,7 @@ static int __init sbridge_init(void)
 
 	pci_rc = pci_register_driver(&sbridge_driver);
 	if (pci_rc >= 0) {
-		mce_register_decode_chain(&sbridge_mce_dec);
+		mce_register_decode_chain(&sbridge_mce_dec, true);
 		if (get_edac_report_status() == EDAC_REPORTING_DISABLED)
 			sbridge_printk(KERN_WARNING, "Loading driver, error reporting disabled.\n");
 		return 0;
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context
  2015-05-20 19:35 MCE ring buffer management (Rebase) Chen, Gong
                   ` (2 preceding siblings ...)
  2015-05-20 19:35 ` [PATCH 3/4 Rebase] x86, MCE: Remove mce_ring for SRAO error Chen, Gong
@ 2015-05-20 19:35 ` Chen, Gong
  2015-05-20  9:28   ` Borislav Petkov
  3 siblings, 1 reply; 15+ messages in thread
From: Chen, Gong @ 2015-05-20 19:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: tony.luck, Chen, Gong, Borislav Petkov

Printing in MCE context is a no-no, currently, as printk is not
NMI-safe. If some of the notifiers on the MCE chain call *printk*, we
may deadlock. In order to avoid that, delay printk into process context
to fix it.

Background info at: https://lkml.org/lkml/2014/6/27/26

Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1406797523-28710-6-git-send-email-gong.chen@linux.intel.com
[ Boris: rewrite a bit. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h               | 1 +
 arch/x86/kernel/cpu/mcheck/mce-apei.c    | 2 +-
 arch/x86/kernel/cpu/mcheck/mce.c         | 8 ++++++--
 arch/x86/kernel/cpu/mcheck/mce_intel.c   | 1 -
 arch/x86/kernel/cpu/mcheck/therm_throt.c | 1 +
 arch/x86/kernel/cpu/mcheck/threshold.c   | 1 +
 6 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index d16f983f46f5..781432dd8123 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -198,6 +198,7 @@ enum mcp_flags {
 bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 int mce_notify_irq(void);
+void mce_queue_irq_work(void);
 
 DECLARE_PER_CPU(struct mce, injectm);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index a1aef9533154..380e3ac8fb62 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -57,7 +57,7 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 
 	m.addr = mem_err->physical_addr;
 	mce_log(&m);
-	mce_notify_irq();
+	mce_queue_irq_work();
 }
 EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index b369b5fcda1d..a3be97961e22 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -156,7 +156,7 @@ void mce_log(struct mce *mce)
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
+	mce_genpool_add(mce);
 
 	mce->finished = 0;
 	wmb();
@@ -486,6 +486,11 @@ static void mce_irq_work_cb(struct irq_work *entry)
 	mce_schedule_work();
 }
 
+void mce_queue_irq_work(void)
+{
+	irq_work_queue(&mce_irq_work);
+}
+
 static void mce_report_event(struct pt_regs *regs)
 {
 	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
@@ -1105,7 +1110,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		/* assuming valid severity level != 0 */
 		m.severity = severity;
 		m.usable_addr = mce_usable_address(&m);
-		mce_genpool_add(&m);
 
 		mce_log(&m);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index b4a41cf030ed..caf6b7e25768 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -216,7 +216,6 @@ static void intel_threshold_interrupt(void)
 		return;
 
 	machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
-	mce_notify_irq();
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 1af51b1586d7..2733f275237d 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -427,6 +427,7 @@ static inline void __smp_thermal_interrupt(void)
 {
 	inc_irq_stat(irq_thermal_count);
 	smp_thermal_vector();
+	mce_queue_irq_work();
 }
 
 asmlinkage __visible void smp_thermal_interrupt(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
index 7245980186ee..d695faa234eb 100644
--- a/arch/x86/kernel/cpu/mcheck/threshold.c
+++ b/arch/x86/kernel/cpu/mcheck/threshold.c
@@ -22,6 +22,7 @@ static inline void __smp_threshold_interrupt(void)
 {
 	inc_irq_stat(irq_threshold_count);
 	mce_threshold_vector();
+	mce_queue_irq_work();
 }
 
 asmlinkage __visible void smp_threshold_interrupt(void)
-- 
2.3.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error record
  2015-05-22 21:06     ` Chen, Gong
@ 2015-05-22  8:17       ` Borislav Petkov
  0 siblings, 0 replies; 15+ messages in thread
From: Borislav Petkov @ 2015-05-22  8:17 UTC (permalink / raw)
  To: Chen, Gong; +Cc: linux-kernel, tony.luck

On Fri, May 22, 2015 at 05:06:55PM -0400, Chen, Gong wrote:
> > ... and moved this pool initialization right before we assign
> > machine_check_vector so that we don't do it unnecessarily if we return
> > earlier due to missing MCA features/MCA not enabled.
> > 
> IIRC, I don't need to post a new patch again, right?

Nah, I changed it when applying.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context
  2015-05-22 21:12     ` Chen, Gong
@ 2015-05-22  9:09       ` Borislav Petkov
  2015-06-08 13:41         ` Borislav Petkov
  0 siblings, 1 reply; 15+ messages in thread
From: Borislav Petkov @ 2015-05-22  9:09 UTC (permalink / raw)
  To: Chen, Gong; +Cc: tony.luck, linux-kernel

On Fri, May 22, 2015 at 05:12:47PM -0400, Chen, Gong wrote:
> Since AMD doesn't queue any MCE,

Of course it does - amd_threshold_interrupt() and the newly added
amd_deferred_error_interrupt() both call __log_error() which does
mce_log().

See latest tip/master.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error record
  2015-05-20 10:36   ` Borislav Petkov
@ 2015-05-22 21:06     ` Chen, Gong
  2015-05-22  8:17       ` Borislav Petkov
  0 siblings, 1 reply; 15+ messages in thread
From: Chen, Gong @ 2015-05-22 21:06 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: linux-kernel, tony.luck

[-- Attachment #1: Type: text/plain, Size: 2968 bytes --]

On Wed, May 20, 2015 at 12:36:10PM +0200, Borislav Petkov wrote:
> Date: Wed, 20 May 2015 12:36:10 +0200
> From: Borislav Petkov <bp@suse.de>
> To: "Chen, Gong" <gong.chen@linux.intel.com>
> Cc: linux-kernel@vger.kernel.org, tony.luck@intel.com
> Subject: Re: [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool
>  to save error record
> User-Agent: Mutt/1.5.23 (2014-03-12)
> 
> On Wed, May 20, 2015 at 03:35:35PM -0400, Chen, Gong wrote:
> > printk is not safe to use in MCE context. Add a lockless memory
> > allocator pool to save error records in MCE context. Issual of those
> > records will be delayed to a context safe to do printk. This idea is
> > inspired by APEI/GHES driver.
> > 
> > We're very conservative and allocate only two pages for it but since
> > we're going to use those pages throughout the system's lifetime, we
> > allocate them statically to avoid early boot time allocation woes.
> > 
> > Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
> > Link: http://lkml.kernel.org/r/1407830375-11087-1-git-send-email-gong.chen@linux.intel.com
> > [Boris: rewrite. ]
> > Signed-off-by: Borislav Petkov <bp@suse.de>
> > ---
> >  arch/x86/Kconfig                          |   1 +
> >  arch/x86/include/uapi/asm/mce.h           |   3 +-
> >  arch/x86/kernel/cpu/mcheck/Makefile       |   2 +-
> >  arch/x86/kernel/cpu/mcheck/mce-genpool.c  | 102 ++++++++++++++++++++++++++++++
> >  arch/x86/kernel/cpu/mcheck/mce-internal.h |  12 ++++
> >  arch/x86/kernel/cpu/mcheck/mce.c          |   8 ++-
> >  6 files changed, 125 insertions(+), 3 deletions(-)
> >  create mode 100644 arch/x86/kernel/cpu/mcheck/mce-genpool.c
> 
> Applied, thanks...
> 
> > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> > index e535533d5ab8..ba91777a7ad8 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> > @@ -115,7 +115,7 @@ static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
> >   * CPU/chipset specific EDAC code can register a notifier call here to print
> >   * MCE errors in a human-readable form.
> >   */
> > -static ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
> > +ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
> >  
> >  /* Do initial initialization of a struct mce */
> >  void mce_setup(struct mce *m)
> > @@ -1688,6 +1688,12 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
> >  	if (mca_cfg.disabled)
> >  		return;
> >  
> > +	if (mce_genpool_init()) {
> > +		mca_cfg.disabled = true;
> > +		pr_emerg("Couldn't allocate MCE records pool!\n");
> > +		return;
> > +	}
> > +
> >  	if (__mcheck_cpu_ancient_init(c))
> >  		return;
> 
> ... and moved this pool initialization right before we assign
> machine_check_vector so that we don't do it unnecessarily if we return
> earlier due to missing MCA features/MCA not enabled.
> 
IIRC, I don't need to post a new patch again, right?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context
  2015-05-20  9:28   ` Borislav Petkov
@ 2015-05-22 21:12     ` Chen, Gong
  2015-05-22  9:09       ` Borislav Petkov
  0 siblings, 1 reply; 15+ messages in thread
From: Chen, Gong @ 2015-05-22 21:12 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: tony.luck, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6649 bytes --]

On Wed, May 20, 2015 at 11:28:00AM +0200, Borislav Petkov wrote:
> Date: Wed, 20 May 2015 11:28:00 +0200
> From: Borislav Petkov <bp@suse.de>
> To: "Chen, Gong" <gong.chen@linux.intel.com>, tony.luck@intel.com
> Cc: linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE
>  context
> User-Agent: Mutt/1.5.23 (2014-03-12)
> 
> On Wed, May 20, 2015 at 03:35:38PM -0400, Chen, Gong wrote:
> > Printing in MCE context is a no-no, currently, as printk is not
> > NMI-safe. If some of the notifiers on the MCE chain call *printk*, we
> > may deadlock. In order to avoid that, delay printk into process context
> > to fix it.
> > 
> > Background info at: https://lkml.org/lkml/2014/6/27/26
> > 
> > Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
> > Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
> > Link: http://lkml.kernel.org/r/1406797523-28710-6-git-send-email-gong.chen@linux.intel.com
> > [ Boris: rewrite a bit. ]
> > Signed-off-by: Borislav Petkov <bp@suse.de>
> > ---
> >  arch/x86/include/asm/mce.h               | 1 +
> >  arch/x86/kernel/cpu/mcheck/mce-apei.c    | 2 +-
> >  arch/x86/kernel/cpu/mcheck/mce.c         | 8 ++++++--
> >  arch/x86/kernel/cpu/mcheck/mce_intel.c   | 1 -
> >  arch/x86/kernel/cpu/mcheck/therm_throt.c | 1 +
> >  arch/x86/kernel/cpu/mcheck/threshold.c   | 1 +
> >  6 files changed, 10 insertions(+), 4 deletions(-)
> 
> ....
> 
> > diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> > index 1af51b1586d7..2733f275237d 100644
> > --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> > +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> > @@ -427,6 +427,7 @@ static inline void __smp_thermal_interrupt(void)
> >  {
> >  	inc_irq_stat(irq_thermal_count);
> >  	smp_thermal_vector();
> > +	mce_queue_irq_work();
> 
> Hmm, at a second glance, this looks wrong. I think we should do that
> call in intel_thermal_interrupt().
> 
> >  asmlinkage __visible void smp_thermal_interrupt(struct pt_regs *regs)
> > diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
> > index 7245980186ee..d695faa234eb 100644
> > --- a/arch/x86/kernel/cpu/mcheck/threshold.c
> > +++ b/arch/x86/kernel/cpu/mcheck/threshold.c
> > @@ -22,6 +22,7 @@ static inline void __smp_threshold_interrupt(void)
> >  {
> >  	inc_irq_stat(irq_threshold_count);
> >  	mce_threshold_vector();
> > +	mce_queue_irq_work();
> 
> Same here.
> 
> mce_queue_irq_work() call should be issued in both AMD and Intel
> threshold handlers but not in the generic one which is unlikely to queue
> any MCE...
> 
> Right?
> 
Since AMD doesn't queue any MCE, it should definitely only cover Intel.
Please check out following patch:

----8<----
From: "Chen, Gong" <gong.chen@linux.intel.com>
Date: Fri, 22 May 2015 17:02:50 -0400
Subject: [PATCH 4/4 rebase v2] x86, MCE: Avoid potential deadlock in MCE context

Printing in MCE context is a no-no, currently, as printk is not
NMI-safe. If some of the notifiers on the MCE chain call *printk*, we
may deadlock. In order to avoid that, delay printk into process context
to fix it.

Background info at: https://lkml.org/lkml/2014/6/27/26

v2 -> v1: move mce_queue_irq_work call into Intel specific logic

Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1406797523-28710-6-git-send-email-gong.chen@linux.intel.com
[ Boris: rewrite a bit. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h               | 1 +
 arch/x86/kernel/cpu/mcheck/mce-apei.c    | 2 +-
 arch/x86/kernel/cpu/mcheck/mce.c         | 8 ++++++--
 arch/x86/kernel/cpu/mcheck/mce_intel.c   | 2 +-
 arch/x86/kernel/cpu/mcheck/therm_throt.c | 2 ++
 5 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index d16f983f46f5..781432dd8123 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -198,6 +198,7 @@ enum mcp_flags {
 bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 int mce_notify_irq(void);
+void mce_queue_irq_work(void);
 
 DECLARE_PER_CPU(struct mce, injectm);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index a1aef9533154..380e3ac8fb62 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -57,7 +57,7 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 
 	m.addr = mem_err->physical_addr;
 	mce_log(&m);
-	mce_notify_irq();
+	mce_queue_irq_work();
 }
 EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index b369b5fcda1d..a3be97961e22 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -156,7 +156,7 @@ void mce_log(struct mce *mce)
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
+	mce_genpool_add(mce);
 
 	mce->finished = 0;
 	wmb();
@@ -486,6 +486,11 @@ static void mce_irq_work_cb(struct irq_work *entry)
 	mce_schedule_work();
 }
 
+void mce_queue_irq_work(void)
+{
+	irq_work_queue(&mce_irq_work);
+}
+
 static void mce_report_event(struct pt_regs *regs)
 {
 	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
@@ -1105,7 +1110,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		/* assuming valid severity level != 0 */
 		m.severity = severity;
 		m.usable_addr = mce_usable_address(&m);
-		mce_genpool_add(&m);
 
 		mce_log(&m);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index b4a41cf030ed..3392002e2911 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -216,7 +216,7 @@ static void intel_threshold_interrupt(void)
 		return;
 
 	machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
-	mce_notify_irq();
+	mce_queue_irq_work();
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 1af51b1586d7..809a1e8f7fbd 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -413,6 +413,8 @@ static void intel_thermal_interrupt(void)
 					POWER_LIMIT_EVENT,
 					PACKAGE_LEVEL);
 	}
+
+	mce_queue_irq_work();
 }
 
 static void unexpected_thermal_interrupt(void)
-- 
2.3.2


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context
  2015-05-22  9:09       ` Borislav Petkov
@ 2015-06-08 13:41         ` Borislav Petkov
  2015-06-08 20:03           ` Luck, Tony
  0 siblings, 1 reply; 15+ messages in thread
From: Borislav Petkov @ 2015-06-08 13:41 UTC (permalink / raw)
  To: Tony Luck; +Cc: Chen, Gong, linux-kernel

On Fri, May 22, 2015 at 11:09:41AM +0200, Borislav Petkov wrote:
> On Fri, May 22, 2015 at 05:12:47PM -0400, Chen, Gong wrote:
> > Since AMD doesn't queue any MCE,
> 
> Of course it does - amd_threshold_interrupt() and the newly added
> amd_deferred_error_interrupt() both call __log_error() which does
> mce_log().

So AFAINM, we want to do MCE work only after we've logged something to
the genpool. So we can do the much simplified thing below and kick the
workqueue from within mce_log() as everything that logs, calls that
function.

Tony, any concerns?

I haven't tested it yet but will do so once we've sorted out the MCE
injection stuff on AMD.

Thanks.

---
From: "Chen, Gong" <gong.chen@linux.intel.com>
Date: Wed, 20 May 2015 15:35:38 -0400
Subject: [PATCH] x86/mce: Avoid potential deadlock due to printk() in MCE
 context

Printing in MCE context is a no-no, currently, as printk is not
NMI-safe. If some of the notifiers on the MCE chain call do so, we may
deadlock. In order to avoid that, delay printk() to process context
where it is safe to do so.

Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1432150538-3120-5-git-send-email-gong.chen@linux.intel.com
[ Boris: kick irq_work in mce_log() directly. ]
Signed-off-by: 
---
 arch/x86/kernel/cpu/mcheck/mce-apei.c  | 1 -
 arch/x86/kernel/cpu/mcheck/mce.c       | 4 ++--
 arch/x86/kernel/cpu/mcheck/mce_intel.c | 1 -
 3 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index a1aef9533154..34c89a3e8260 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -57,7 +57,6 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 
 	m.addr = mem_err->physical_addr;
 	mce_log(&m);
-	mce_notify_irq();
 }
 EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 321c7f6e17a0..6aef4970206f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -156,7 +156,8 @@ void mce_log(struct mce *mce)
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
+	mce_genpool_add(mce);
+	irq_work_queue(&mce_irq_work);
 
 	mce->finished = 0;
 	wmb();
@@ -1115,7 +1116,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		/* assuming valid severity level != 0 */
 		m.severity = severity;
 		m.usable_addr = mce_usable_address(&m);
-		mce_genpool_add(&m);
 
 		mce_log(&m);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 844f56c5616d..70f567f774ed 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -246,7 +246,6 @@ static void intel_threshold_interrupt(void)
 		return;
 
 	machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
-	mce_notify_irq();
 }
 
 /*
-- 
2.3.5

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* RE: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context
  2015-06-08 13:41         ` Borislav Petkov
@ 2015-06-08 20:03           ` Luck, Tony
  2015-06-08 20:26             ` Borislav Petkov
  0 siblings, 1 reply; 15+ messages in thread
From: Luck, Tony @ 2015-06-08 20:03 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Chen, Gong, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 750 bytes --]

> So AFAINM, we want to do MCE work only after we've logged something to
> the genpool. So we can do the much simplified thing below and kick the
> workqueue from within mce_log() as everything that logs, calls that
> function.
>
> Tony, any concerns?

@@ -156,7 +156,8 @@ void mce_log(struct mce *mce)
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
+	mce_genpool_add(mce);
+	irq_work_queue(&mce_irq_work);

Is it safe to call irq_work_queue() from MCE context?  If that is OK, then
I don't have any concerns.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context
  2015-06-08 20:03           ` Luck, Tony
@ 2015-06-08 20:26             ` Borislav Petkov
  2015-06-11  8:27               ` Borislav Petkov
  0 siblings, 1 reply; 15+ messages in thread
From: Borislav Petkov @ 2015-06-08 20:26 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Chen, Gong, linux-kernel

On Mon, Jun 08, 2015 at 08:03:08PM +0000, Luck, Tony wrote:
> @@ -156,7 +156,8 @@ void mce_log(struct mce *mce)
>  	/* Emit the trace record: */
>  	trace_mce_record(mce);
>  
> -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
> +	mce_genpool_add(mce);
> +	irq_work_queue(&mce_irq_work);
> 
> Is it safe to call irq_work_queue() from MCE context?

Yeah, we're using it in contexts like:

do_nmi
|-> default_do_nmi
    |-> nmi_handle
        |->irq_work_queue

for example. I think that is good enough for #MC. :)

> If that is OK, then I don't have any concerns.

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context
  2015-06-08 20:26             ` Borislav Petkov
@ 2015-06-11  8:27               ` Borislav Petkov
  0 siblings, 0 replies; 15+ messages in thread
From: Borislav Petkov @ 2015-06-11  8:27 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Chen, Gong, linux-kernel

On Mon, Jun 08, 2015 at 10:26:58PM +0200, Borislav Petkov wrote:
> On Mon, Jun 08, 2015 at 08:03:08PM +0000, Luck, Tony wrote:
> > @@ -156,7 +156,8 @@ void mce_log(struct mce *mce)
> >  	/* Emit the trace record: */
> >  	trace_mce_record(mce);
> >  
> > -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
> > +	mce_genpool_add(mce);
> > +	irq_work_queue(&mce_irq_work);
> > 
> > Is it safe to call irq_work_queue() from MCE context?
> 
> Yeah, we're using it in contexts like:

Yeah, so that is safe but there's another issue which I triggered
yesterday. I hope the commit message explains it thoroughly. If you see
holes, please shout. I haven't run this on the EINJ box yet, that's
still outstanding.

In any case, I'm thinking of not rushing those changes to tip yet and
have them simmer in linux-next for a whole round and have them ready for
4.3. Just to be on the safe side and have some more time hammering on
them.

---
From: Borislav Petkov <bp@suse.de>
Date: Wed, 10 Jun 2015 18:49:18 +0200
Subject: [PATCH] x86/mce: Fix logging of early boot machine checks

With moving the irq_work queuing to mce_log() which simplified other
code, we're faced with the problem of logging MCEs during early boot
which might be there from before. In particular, facilities which we
rely on, like workqueues, for example, are not initialized yet. See
splat below.

As a first step, move the workqueue init before the first invocation of
mce_log() in

__mcheck_cpu_init_generic
|-> machine_check_poll
    |-> mce_log

But that's not enough because the normal system workqueue on which stuff
gets queued with schedule_work() is not initialized yet either.

Therefore, we still queue the work in the genpool but delay the
scheduling of that work to a late initcall where everything should be
initialized.

While at it, make sure we queue irq_work only if the addition to the
genpool has been successful.

  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: [<          (null)>]           (null)
  PGD 0
  Oops: 0010 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.0-rc7+ #12
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
  task: ffffffff8197d580 ti: ffffffff8196c000 task.ti: ffffffff8196c000
  RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)

  <registers snipped>

  Call Trace:
   <IRQ>
    ? irq_work_run_list
    irq_work_run
    smp_irq_work_interrupt
    irq_work_interrupt
   <EOI>
    ? apic_send_IPI_self
    arch_irq_work_raise
    irq_work_queue
    mce_log
    machine_check_poll
    __mcheck_cpu_init_generic
    mcheck_cpu_init
    identify_cpu
    identify_boot_cpu
    check_bugs
    start_kernel
    x86_64_start_reservations
    x86_64_start_kernel
  Code:  Bad RIP value.
  RIP  [<          (null)>]           (null)
   RSP <ffff88007b603f40>
  CR2: 0000000000000000
  ---[ end trace 6d6dd507e2636bd4 ]---
  Kernel panic - not syncing: Fatal exception in interrupt
  ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index a93a150c3583..17cef061c24a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -159,8 +159,8 @@ void mce_log(struct mce *mce)
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	mce_genpool_add(mce);
-	irq_work_queue(&mce_irq_work);
+	if (mce_genpool_add(mce))
+		irq_work_queue(&mce_irq_work);
 
 	mce->finished = 0;
 	wmb();
@@ -480,7 +480,7 @@ int mce_available(struct cpuinfo_x86 *c)
 
 static void mce_schedule_work(void)
 {
-	if (!mce_genpool_empty())
+	if (!mce_genpool_empty() && keventd_up())
 		schedule_work(&mce_work);
 }
 
@@ -648,8 +648,9 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 			if (m.status & MCI_STATUS_ADDRV) {
 				m.severity = severity;
 				m.usable_addr = mce_usable_address(&m);
-				mce_genpool_add(&m);
-				mce_schedule_work();
+
+				if (mce_genpool_add(&m))
+					mce_schedule_work();
 			}
 		}
 
@@ -1704,13 +1705,14 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
 		return;
 	}
 
+	INIT_WORK(&mce_work, mce_process_work);
+	init_irq_work(&mce_irq_work, mce_irq_work_cb);
+
 	machine_check_vector = do_machine_check;
 
 	__mcheck_cpu_init_generic();
 	__mcheck_cpu_init_vendor(c);
 	__mcheck_cpu_init_timer();
-	INIT_WORK(&mce_work, mce_process_work);
-	init_irq_work(&mce_irq_work, mce_irq_work_cb);
 }
 
 /*
@@ -2565,5 +2567,20 @@ static int __init mcheck_debugfs_init(void)
 
 	return 0;
 }
-late_initcall(mcheck_debugfs_init);
+#else
+static int __init mcheck_debugfs_init(void) {}
 #endif
+
+static int __init mcheck_late_init(void)
+{
+	mcheck_debugfs_init();
+
+	/*
+	 * Flush out everything that has been logged during early boot, now that
+	 * everything has been initialized (workqueues, decoders, ...).
+	 */
+	mce_schedule_work();
+
+	return 0;
+}
+late_initcall(mcheck_late_init);
-- 
2.3.5

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-06-11  8:28 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-20 19:35 MCE ring buffer management (Rebase) Chen, Gong
2015-05-20 19:35 ` [PATCH 1/4 Rebase] x86, MCE: Provide a lock-less memory pool to save error record Chen, Gong
2015-05-20 10:36   ` Borislav Petkov
2015-05-22 21:06     ` Chen, Gong
2015-05-22  8:17       ` Borislav Petkov
2015-05-20 19:35 ` [PATCH 2/4 Rebase] x86, MCE: Don't use percpu for MCE workqueue/irq_work Chen, Gong
2015-05-20 19:35 ` [PATCH 3/4 Rebase] x86, MCE: Remove mce_ring for SRAO error Chen, Gong
2015-05-20 19:35 ` [PATCH 4/4 Rebase] x86, MCE: Avoid potential deadlock in MCE context Chen, Gong
2015-05-20  9:28   ` Borislav Petkov
2015-05-22 21:12     ` Chen, Gong
2015-05-22  9:09       ` Borislav Petkov
2015-06-08 13:41         ` Borislav Petkov
2015-06-08 20:03           ` Luck, Tony
2015-06-08 20:26             ` Borislav Petkov
2015-06-11  8:27               ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).