All of lore.kernel.org
 help / color / mirror / Atom feed
* x86, MCE: MCE event ring management
@ 2014-08-14  6:49 Chen, Gong
  2014-08-14  6:49 ` [PATCH v2 1/5] x86, MCE: Provide a lock-less memory pool to save error record Chen, Gong
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Chen, Gong @ 2014-08-14  6:49 UTC (permalink / raw)
  To: bp; +Cc: tony.luck, x86, linux-kernel

We have too many rings for different H/W error events management.
All of them can be merged into one kind of unified mechanism.
Furthermore, this management mechanism should be reliable enough
even in MCE context to avoid deadlock like calling printk in MCE
context. This patch series is used for this purpose.

[PATCH v2 1/5] x86, MCE: Provide a lock-less memory pool to save
[PATCH v2 2/5] x86, MCE: Don't use percpu for MCE workqueue/irq_work
[PATCH v2 3/5] x86, MCE: Remove mce_ring for SRAO error
[PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable
[PATCH v2 5/5] x86, MCE: Avoid potential deadlock in MCE context


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/5] x86, MCE: Provide a lock-less memory pool to save error record
  2014-08-14  6:49 x86, MCE: MCE event ring management Chen, Gong
@ 2014-08-14  6:49 ` Chen, Gong
  2014-08-14  6:49 ` [PATCH v2 2/5] x86, MCE: Don't use percpu for MCE workqueue/irq_work Chen, Gong
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Chen, Gong @ 2014-08-14  6:49 UTC (permalink / raw)
  To: bp; +Cc: tony.luck, x86, linux-kernel, Chen, Gong, Borislav Petkov

printk is not safe to use in MCE context. Add a lockless memory
allocator pool to save error records in MCE context. Issuing those
records will be delayed to a context safe to do printk. This idea is
inspired by APEI/GHES driver.

We're very conservative and allocate only two pages for it but since
we're going to use those pages throughout the system's lifetime, we
allocate them statically to avoid early boot time allocation woes.

Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1407830375-11087-1-git-send-email-gong.chen@linux.intel.com
[Boris: rewrite. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/Kconfig                          |   1 +
 arch/x86/include/uapi/asm/mce.h           |   3 +-
 arch/x86/kernel/cpu/mcheck/Makefile       |   2 +-
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  | 102 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  11 ++++
 arch/x86/kernel/cpu/mcheck/mce.c          |   6 ++
 6 files changed, 123 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/mcheck/mce-genpool.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4aafd322e21e..706dc3c8f6d6 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -866,6 +866,7 @@ config X86_REROUTE_FOR_BROKEN_BOOT_IRQS
 
 config X86_MCE
 	bool "Machine Check / overheating reporting"
+	select GENERIC_ALLOCATOR
 	default y
 	---help---
 	  Machine Check support allows the processor to notify the
diff --git a/arch/x86/include/uapi/asm/mce.h b/arch/x86/include/uapi/asm/mce.h
index a0eab85ce7b8..577acacaf515 100644
--- a/arch/x86/include/uapi/asm/mce.h
+++ b/arch/x86/include/uapi/asm/mce.h
@@ -15,7 +15,8 @@ struct mce {
 	__u64 time;	/* wall time_t when error was detected */
 	__u8  cpuvendor;	/* cpu vendor as encoded in system.h */
 	__u8  inject_flags;	/* software inject flags */
-	__u16  pad;
+	__u8  severity;
+	__u8  usable_addr;	/* addr is usable */
 	__u32 cpuid;	/* CPUID 1 EAX */
 	__u8  cs;		/* code segment */
 	__u8  bank;	/* machine check bank */
diff --git a/arch/x86/kernel/cpu/mcheck/Makefile b/arch/x86/kernel/cpu/mcheck/Makefile
index bb34b03af252..a3311c886194 100644
--- a/arch/x86/kernel/cpu/mcheck/Makefile
+++ b/arch/x86/kernel/cpu/mcheck/Makefile
@@ -1,4 +1,4 @@
-obj-y				=  mce.o mce-severity.o
+obj-y				=  mce.o mce-severity.o mce-genpool.o
 
 obj-$(CONFIG_X86_ANCIENT_MCE)	+= winchip.o p5.o
 obj-$(CONFIG_X86_MCE_INTEL)	+= mce_intel.o
diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
new file mode 100644
index 000000000000..a0a0ad4b8778
--- /dev/null
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -0,0 +1,102 @@
+/*
+ * MCE event pool management in MCE context
+ *
+ * Copyright (C) 2014 Intel Corp.
+ * Author: Chen, Gong <gong.chen@linux.intel.com>
+ *
+ * This file is licensed under GPLv2.
+ */
+#include <linux/smp.h>
+#include <linux/mm.h>
+#include <linux/genalloc.h>
+#include <linux/llist.h>
+#include "mce-internal.h"
+
+/*
+ * printk is not safe in MCE context thus a lock-less memory allocator
+ * (genpool) is used to save error information organized in a lock-less
+ * list.
+ */
+
+#define GENPOOL_BUFSZ	(2 * PAGE_SIZE)
+
+static struct gen_pool *mce_evt_pool;
+static LLIST_HEAD(mce_event_llist);
+static char genpool_buf[GENPOOL_BUFSZ];
+
+/*
+ * This memory pool is only to be used to save MCE records in MCE context.
+ * Since MCE events are rare so a fixed size memory pool should be enough.
+ * Keep 2 pages to save MCE events for now (~80 MCE records at most).
+ */
+static int mce_genpool_create(void)
+{
+	struct gen_pool *tmpp;
+	int ret = -ENOMEM;
+
+	tmpp = gen_pool_create(ilog2(sizeof(struct mce_evt_llist)), -1);
+	if (!tmpp)
+		goto out;
+
+	ret = gen_pool_add(tmpp, (unsigned long)genpool_buf, GENPOOL_BUFSZ, -1);
+	if (ret) {
+		gen_pool_destroy(tmpp);
+		goto out;
+	}
+
+	mce_evt_pool = tmpp;
+
+out:
+	return ret;
+}
+
+void mce_genpool_process(void)
+{
+	struct llist_node *head;
+	struct mce_evt_llist *node;
+	struct mce *mce;
+
+	head = llist_del_all(&mce_event_llist);
+	if (!head)
+		return;
+
+	head = llist_reverse_order(head);
+	llist_for_each_entry(node, head, llnode) {
+		mce = &node->mce;
+		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
+		gen_pool_free(mce_evt_pool, (unsigned long)node, sizeof(*node));
+	}
+}
+
+bool mce_genpool_empty(void)
+{
+	return llist_empty(&mce_event_llist);
+}
+
+bool mce_genpool_add(struct mce *mce)
+{
+	struct mce_evt_llist *node;
+
+	if (!mce_evt_pool)
+		return false;
+
+	node = (void *)gen_pool_alloc(mce_evt_pool, sizeof(*node));
+	if (!node) {
+		pr_warn_ratelimited("MCE records pool overflow!\n");
+		return false;
+	}
+
+	memcpy(&node->mce, mce, sizeof(*mce));
+	llist_add(&node->llnode, &mce_event_llist);
+
+	return true;
+}
+
+int mce_genpool_init(void)
+{
+	/* Just init mce_genpool once. */
+	if (mce_evt_pool)
+		return 0;
+
+	return mce_genpool_create();
+}
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 09edd0b65fef..6f0ea3db02c8 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -11,6 +11,8 @@ enum severity_level {
 	MCE_PANIC_SEVERITY,
 };
 
+extern struct atomic_notifier_head x86_mce_decoder_chain;
+
 #define ATTR_LEN		16
 
 /* One object for each MCE bank, shared by all CPUs */
@@ -21,6 +23,15 @@ struct mce_bank {
 	char			attrname[ATTR_LEN];	/* attribute name */
 };
 
+struct mce_evt_llist {
+	struct llist_node llnode;
+	struct mce mce;
+};
+
+void mce_genpool_process(void);
+bool mce_genpool_empty(void);
+bool mce_genpool_add(struct mce *mce);
+int mce_genpool_init(void);
 int mce_severity(struct mce *a, int tolerant, char **msg);
 struct dentry *mce_get_debugfs_dir(void);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bd9ccda8087f..b2c8bb83672a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1686,6 +1686,12 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
 	if (mca_cfg.disabled)
 		return;
 
+	if (mce_genpool_init()) {
+		mca_cfg.disabled = true;
+		pr_emerg("Couldn't allocate MCE records pool!\n");
+		return;
+	}
+
 	if (__mcheck_cpu_ancient_init(c))
 		return;
 
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/5] x86, MCE: Don't use percpu for MCE workqueue/irq_work
  2014-08-14  6:49 x86, MCE: MCE event ring management Chen, Gong
  2014-08-14  6:49 ` [PATCH v2 1/5] x86, MCE: Provide a lock-less memory pool to save error record Chen, Gong
@ 2014-08-14  6:49 ` Chen, Gong
  2014-08-14  6:49 ` [PATCH v2 3/5] x86, MCE: Remove mce_ring for SRAO error Chen, Gong
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Chen, Gong @ 2014-08-14  6:49 UTC (permalink / raw)
  To: bp; +Cc: tony.luck, x86, linux-kernel, Chen, Gong, Borislav Petkov

An MCE is considered a rare event. Therefore, there's no need to have
per-CPU instances of both normal and IRQ workqueues. Make them both
global.

Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1406797523-28710-3-git-send-email-gong.chen@linux.intel.com
[Boris: massage commit message]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index b2c8bb83672a..7534cfce43e9 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -107,7 +107,8 @@ DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
  */
 mce_banks_t mce_banks_ce_disabled;
 
-static DEFINE_PER_CPU(struct work_struct, mce_work);
+static struct work_struct mce_work;
+static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
@@ -526,11 +527,9 @@ int mce_available(struct cpuinfo_x86 *c)
 static void mce_schedule_work(void)
 {
 	if (!mce_ring_empty())
-		schedule_work(&__get_cpu_var(mce_work));
+		schedule_work(&mce_work);
 }
 
-DEFINE_PER_CPU(struct irq_work, mce_irq_work);
-
 static void mce_irq_work_cb(struct irq_work *entry)
 {
 	mce_notify_irq();
@@ -551,7 +550,7 @@ static void mce_report_event(struct pt_regs *regs)
 		return;
 	}
 
-	irq_work_queue(&__get_cpu_var(mce_irq_work));
+	irq_work_queue(&mce_irq_work);
 }
 
 /*
@@ -1708,8 +1707,8 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
 	__mcheck_cpu_init_generic();
 	__mcheck_cpu_init_vendor(c);
 	__mcheck_cpu_init_timer();
-	INIT_WORK(&__get_cpu_var(mce_work), mce_process_work);
-	init_irq_work(&__get_cpu_var(mce_irq_work), &mce_irq_work_cb);
+	INIT_WORK(&mce_work, mce_process_work);
+	init_irq_work(&mce_irq_work, mce_irq_work_cb);
 }
 
 /*
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/5] x86, MCE: Remove mce_ring for SRAO error
  2014-08-14  6:49 x86, MCE: MCE event ring management Chen, Gong
  2014-08-14  6:49 ` [PATCH v2 1/5] x86, MCE: Provide a lock-less memory pool to save error record Chen, Gong
  2014-08-14  6:49 ` [PATCH v2 2/5] x86, MCE: Don't use percpu for MCE workqueue/irq_work Chen, Gong
@ 2014-08-14  6:49 ` Chen, Gong
  2014-08-14  6:49 ` [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors Chen, Gong
  2014-08-14  6:49 ` [PATCH v2 5/5] x86, MCE: Avoid potential deadlock in MCE context Chen, Gong
  4 siblings, 0 replies; 19+ messages in thread
From: Chen, Gong @ 2014-08-14  6:49 UTC (permalink / raw)
  To: bp; +Cc: tony.luck, x86, linux-kernel, Chen, Gong, Borislav Petkov

Use unified genpool to save SRAO error events and put AO error
handling in the same notification chain with mce error decoding.

Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1406797523-28710-4-git-send-email-gong.chen@linux.intel.com
[ Boris: correct a lot. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h       |   2 +-
 arch/x86/kernel/cpu/mcheck/mce.c | 110 ++++++++++++++-------------------------
 drivers/acpi/acpi_extlog.c       |   2 +-
 drivers/edac/i7core_edac.c       |   2 +-
 drivers/edac/mce_amd.c           |   2 +-
 drivers/edac/sb_edac.c           |   2 +-
 6 files changed, 43 insertions(+), 77 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 958b90f761e5..a713dabd000b 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -114,7 +114,7 @@ struct mca_config {
 };
 
 extern struct mca_config mca_cfg;
-extern void mce_register_decode_chain(struct notifier_block *nb);
+extern void mce_register_decode_chain(struct notifier_block *nb, bool drain);
 extern void mce_unregister_decode_chain(struct notifier_block *nb);
 
 #include <linux/percpu.h>
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 7534cfce43e9..0e64a3c69b2b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -234,11 +234,18 @@ static void drain_mcelog_buffer(void)
 	} while (next != prev);
 }
 
+static struct notifier_block mce_srao_nb;
 
-void mce_register_decode_chain(struct notifier_block *nb)
+void mce_register_decode_chain(struct notifier_block *nb, bool drain)
 {
+	/* Ensure SRAO notifier has the highest priority in the decode chain. */
+	if (nb != &mce_srao_nb && nb->priority == INT_MAX)
+		nb->priority -= 1;
+
 	atomic_notifier_chain_register(&x86_mce_decoder_chain, nb);
-	drain_mcelog_buffer();
+
+	if (drain)
+		drain_mcelog_buffer();
 }
 EXPORT_SYMBOL_GPL(mce_register_decode_chain);
 
@@ -462,61 +469,6 @@ static inline void mce_gather_info(struct mce *m, struct pt_regs *regs)
 	}
 }
 
-/*
- * Simple lockless ring to communicate PFNs from the exception handler with the
- * process context work function. This is vastly simplified because there's
- * only a single reader and a single writer.
- */
-#define MCE_RING_SIZE 16	/* we use one entry less */
-
-struct mce_ring {
-	unsigned short start;
-	unsigned short end;
-	unsigned long ring[MCE_RING_SIZE];
-};
-static DEFINE_PER_CPU(struct mce_ring, mce_ring);
-
-/* Runs with CPU affinity in workqueue */
-static int mce_ring_empty(void)
-{
-	struct mce_ring *r = &__get_cpu_var(mce_ring);
-
-	return r->start == r->end;
-}
-
-static int mce_ring_get(unsigned long *pfn)
-{
-	struct mce_ring *r;
-	int ret = 0;
-
-	*pfn = 0;
-	get_cpu();
-	r = &__get_cpu_var(mce_ring);
-	if (r->start == r->end)
-		goto out;
-	*pfn = r->ring[r->start];
-	r->start = (r->start + 1) % MCE_RING_SIZE;
-	ret = 1;
-out:
-	put_cpu();
-	return ret;
-}
-
-/* Always runs in MCE context with preempt off */
-static int mce_ring_add(unsigned long pfn)
-{
-	struct mce_ring *r = &__get_cpu_var(mce_ring);
-	unsigned next;
-
-	next = (r->end + 1) % MCE_RING_SIZE;
-	if (next == r->start)
-		return -1;
-	r->ring[r->end] = pfn;
-	wmb();
-	r->end = next;
-	return 0;
-}
-
 int mce_available(struct cpuinfo_x86 *c)
 {
 	if (mca_cfg.disabled)
@@ -526,7 +478,7 @@ int mce_available(struct cpuinfo_x86 *c)
 
 static void mce_schedule_work(void)
 {
-	if (!mce_ring_empty())
+	if (!mce_genpool_empty())
 		schedule_work(&mce_work);
 }
 
@@ -553,6 +505,27 @@ static void mce_report_event(struct pt_regs *regs)
 	irq_work_queue(&mce_irq_work);
 }
 
+static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
+				void *data)
+{
+	struct mce *mce = (struct mce *)data;
+	unsigned long pfn;
+
+	if (!mce)
+		return NOTIFY_DONE;
+
+	if (mce->usable_addr && (mce->severity == MCE_AO_SEVERITY)) {
+		pfn = mce->addr >> PAGE_SHIFT;
+		memory_failure(pfn, MCE_VECTOR, 0);
+	}
+
+	return NOTIFY_OK;
+}
+static struct notifier_block mce_srao_nb = {
+	.notifier_call	= srao_decode_notifier,
+	.priority = INT_MAX,
+};
+
 /*
  * Read ADDR and MISC registers.
  */
@@ -1113,15 +1086,10 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 		mce_read_aux(&m, i);
 
-		/*
-		 * Action optional error. Queue address for later processing.
-		 * When the ring overflows we just ignore the AO error.
-		 * RED-PEN add some logging mechanism when
-		 * usable_address or mce_add_ring fails.
-		 * RED-PEN don't ignore overflow for mca_cfg.tolerant == 0
-		 */
-		if (severity == MCE_AO_SEVERITY && mce_usable_address(&m))
-			mce_ring_add(m.addr >> PAGE_SHIFT);
+		/* assuming valid severity level != 0 */
+		m.severity = severity;
+		m.usable_addr = mce_usable_address(&m);
+		mce_genpool_add(&m);
 
 		mce_log(&m);
 
@@ -1222,14 +1190,11 @@ void mce_notify_process(void)
 /*
  * Action optional processing happens here (picking up
  * from the list of faulting pages that do_machine_check()
- * placed into the "ring").
+ * placed into the genpool).
  */
 static void mce_process_work(struct work_struct *dummy)
 {
-	unsigned long pfn;
-
-	while (mce_ring_get(&pfn))
-		memory_failure(pfn, MCE_VECTOR, 0);
+	mce_genpool_process();
 }
 
 #ifdef CONFIG_X86_MCE_INTEL
@@ -2027,6 +1992,7 @@ __setup("mce", mcheck_enable);
 int __init mcheck_init(void)
 {
 	mcheck_intel_therm_init();
+	mce_register_decode_chain(&mce_srao_nb, false);
 
 	return 0;
 }
diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index b3842ffc19ba..07e012e74c1b 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -286,7 +286,7 @@ static int __init extlog_init(void)
 	 */
 	old_edac_report_status = get_edac_report_status();
 	set_edac_report_status(EDAC_REPORTING_DISABLED);
-	mce_register_decode_chain(&extlog_mce_dec);
+	mce_register_decode_chain(&extlog_mce_dec, true);
 	/* enable OS to be involved to take over management from BIOS */
 	((struct extlog_l1_head *)extlog_l1_addr)->flags |= FLAG_OS_OPTIN;
 
diff --git a/drivers/edac/i7core_edac.c b/drivers/edac/i7core_edac.c
index 9cd0b301f81b..7e347c4ce641 100644
--- a/drivers/edac/i7core_edac.c
+++ b/drivers/edac/i7core_edac.c
@@ -2432,7 +2432,7 @@ static int __init i7core_init(void)
 	pci_rc = pci_register_driver(&i7core_driver);
 
 	if (pci_rc >= 0) {
-		mce_register_decode_chain(&i7_mce_dec);
+		mce_register_decode_chain(&i7_mce_dec, true);
 		return 0;
 	}
 
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index f78c1c54dbd5..718c41c41f07 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -914,7 +914,7 @@ static int __init mce_amd_init(void)
 
 	pr_info("MCE: In-kernel MCE decoding enabled.\n");
 
-	mce_register_decode_chain(&amd_mce_dec_nb);
+	mce_register_decode_chain(&amd_mce_dec_nb, true);
 
 	return 0;
 }
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index deea0dc9999b..d2dc2929568d 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -2153,7 +2153,7 @@ static int __init sbridge_init(void)
 
 	pci_rc = pci_register_driver(&sbridge_driver);
 	if (pci_rc >= 0) {
-		mce_register_decode_chain(&sbridge_mce_dec);
+		mce_register_decode_chain(&sbridge_mce_dec, true);
 		if (get_edac_report_status() == EDAC_REPORTING_DISABLED)
 			sbridge_printk(KERN_WARNING, "Loading driver, error reporting disabled.\n");
 		return 0;
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-08-14  6:49 x86, MCE: MCE event ring management Chen, Gong
                   ` (2 preceding siblings ...)
  2014-08-14  6:49 ` [PATCH v2 3/5] x86, MCE: Remove mce_ring for SRAO error Chen, Gong
@ 2014-08-14  6:49 ` Chen, Gong
  2014-11-11 11:42   ` Borislav Petkov
  2014-08-14  6:49 ` [PATCH v2 5/5] x86, MCE: Avoid potential deadlock in MCE context Chen, Gong
  4 siblings, 1 reply; 19+ messages in thread
From: Chen, Gong @ 2014-08-14  6:49 UTC (permalink / raw)
  To: bp; +Cc: tony.luck, x86, linux-kernel, Chen, Gong

From: Tony Luck <tony.luck@intel.com>

Handling of uncorrected, but s/w recoverable, errors is a two phase process:

1) In the machine check handler we identify the signature of a recoverable
   error and use TIF_MCE_NOTIFY to mark the process so that we will not
   return to ring3 and re-execute the instruction that accessed the failed
   memory location.
2) In process context we see TIF_MCE_NOTIFY set and call mce_notify_process()
   to do all the recovery work that we could not do in machine check context.

We need to pass the memory address and a flag to say whether the faulting
instruction can be continued from stage "1" to stage "2".  The existing
code to do this is baroque.  We hunt through a fixed sized array of
"mce_info" structures and use atomic_cmpxchg to claim a free one (we
panic if all are busy). Then in phase 2 we scan back through the mce_info
array looking for the one that matches the current process.

We can do much better by admitting that the failing address found in
phase 1 is a property of the current task. We can only recover from
errors detected while trying to retire an instruction in ring3 (s/w
limitation). So we know that MCE_AR_SEVERITY means that we are running
on the cpu that was executing user code and hit an error that triggered
this machine check.  This is the reason why TIF_MCE_NOTIFY has been set,
and the memory error must be dealt with before we can either allow this
process to continue (if we can replace the page) or not (signalling the
process if the memory has been lost).

So we add two fields to the task structure to be saved in phase 1 and
retrieved in phase 2.  The mce_info[] array and routines to add/find/clear
entried from it can all be deleted.

Gong: Add missed CONFIG_MEMORY_FAILURE & '\n' to the end of printk
message, while at it.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h       |  2 ++
 arch/x86/kernel/cpu/mcheck/mce.c | 67 +++++++---------------------------------
 arch/x86/kernel/signal.c         |  2 +-
 include/linux/sched.h            |  4 +++
 4 files changed, 19 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index a713dabd000b..76e706767655 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -187,7 +187,9 @@ enum mcp_flags {
 void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 int mce_notify_irq(void);
+#ifdef CONFIG_MEMORY_FAILURE
 void mce_notify_process(void);
+#endif
 
 DECLARE_PER_CPU(struct mce, injectm);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 0e64a3c69b2b..b0a25d5f7185 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -928,51 +928,6 @@ static void mce_clear_state(unsigned long *toclear)
 }
 
 /*
- * Need to save faulting physical address associated with a process
- * in the machine check handler some place where we can grab it back
- * later in mce_notify_process()
- */
-#define	MCE_INFO_MAX	16
-
-struct mce_info {
-	atomic_t		inuse;
-	struct task_struct	*t;
-	__u64			paddr;
-	int			restartable;
-} mce_info[MCE_INFO_MAX];
-
-static void mce_save_info(__u64 addr, int c)
-{
-	struct mce_info *mi;
-
-	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++) {
-		if (atomic_cmpxchg(&mi->inuse, 0, 1) == 0) {
-			mi->t = current;
-			mi->paddr = addr;
-			mi->restartable = c;
-			return;
-		}
-	}
-
-	mce_panic("Too many concurrent recoverable errors", NULL, NULL);
-}
-
-static struct mce_info *mce_find_info(void)
-{
-	struct mce_info *mi;
-
-	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++)
-		if (atomic_read(&mi->inuse) && mi->t == current)
-			return mi;
-	return NULL;
-}
-
-static void mce_clear_info(struct mce_info *mi)
-{
-	atomic_set(&mi->inuse, 0);
-}
-
-/*
  * The actual machine check handler. This only handles real
  * exceptions when something got corrupted coming in through int 18.
  *
@@ -1121,11 +1076,15 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	if (cfg->tolerant < 3) {
 		if (no_way_out)
 			mce_panic("Fatal machine check on current CPU", &m, msg);
+#ifdef CONFIG_MEMORY_FAILURE
 		if (worst == MCE_AR_SEVERITY) {
 			/* schedule action before return to userland */
-			mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
+			current->paddr = m.addr;
+			current->restartable = !!(m.mcgstatus & MCG_STATUS_RIPV);
 			set_thread_flag(TIF_MCE_NOTIFY);
-		} else if (kill_it) {
+		} else
+#endif
+		if (kill_it) {
 			force_sig(SIGBUS, current);
 		}
 	}
@@ -1159,33 +1118,31 @@ int memory_failure(unsigned long pfn, int vector, int flags)
  * process any corrupted pages, and kill/signal current process if required.
  * Action required errors are handled here.
  */
+#ifdef CONFIG_MEMORY_FAILURE
 void mce_notify_process(void)
 {
 	unsigned long pfn;
-	struct mce_info *mi = mce_find_info();
 	int flags = MF_ACTION_REQUIRED;
 
-	if (!mi)
-		mce_panic("Lost physical address for unconsumed uncorrectable error", NULL, NULL);
-	pfn = mi->paddr >> PAGE_SHIFT;
+	pfn = current->paddr >> PAGE_SHIFT;
 
 	clear_thread_flag(TIF_MCE_NOTIFY);
 
-	pr_err("Uncorrected hardware memory error in user-access at %llx",
-		 mi->paddr);
+	pr_err("Uncorrected hardware memory error in user-access at %llx\n",
+		 current->paddr);
 	/*
 	 * We must call memory_failure() here even if the current process is
 	 * doomed. We still need to mark the page as poisoned and alert any
 	 * other users of the page.
 	 */
-	if (!mi->restartable)
+	if (!current->restartable)
 		flags |= MF_MUST_KILL;
 	if (memory_failure(pfn, MCE_VECTOR, flags) < 0) {
 		pr_err("Memory error not recovered");
 		force_sig(SIGBUS, current);
 	}
-	mce_clear_info(mi);
 }
+#endif
 
 /*
  * Action optional processing happens here (picking up
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 2851d63c1202..28fd70560b96 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -735,7 +735,7 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 {
 	user_exit();
 
-#ifdef CONFIG_X86_MCE
+#ifdef CONFIG_MEMORY_FAILURE
 	/* notify userspace of pending MCEs */
 	if (thread_info_flags & _TIF_MCE_NOTIFY)
 		mce_notify_process();
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40426ba..b41e839b9e1e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1645,6 +1645,10 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	__u64	paddr;
+	int	restartable;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 5/5] x86, MCE: Avoid potential deadlock in MCE context
  2014-08-14  6:49 x86, MCE: MCE event ring management Chen, Gong
                   ` (3 preceding siblings ...)
  2014-08-14  6:49 ` [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors Chen, Gong
@ 2014-08-14  6:49 ` Chen, Gong
  4 siblings, 0 replies; 19+ messages in thread
From: Chen, Gong @ 2014-08-14  6:49 UTC (permalink / raw)
  To: bp; +Cc: tony.luck, x86, linux-kernel, Chen, Gong, Borislav Petkov

Printing in MCE context is a no-no, currently, as printk is not
NMI-safe. If some of the notifiers on the MCE chain call *printk*, we
may deadlock. In order to avoid that, delay printk into process context
to fix it.

Background info at: https://lkml.org/lkml/2014/6/27/26

Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1406797523-28710-6-git-send-email-gong.chen@linux.intel.com
[ Boris: rewrite a bit. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h               |  1 +
 arch/x86/kernel/cpu/mcheck/mce-apei.c    |  2 +-
 arch/x86/kernel/cpu/mcheck/mce.c         | 11 ++++++-----
 arch/x86/kernel/cpu/mcheck/mce_intel.c   |  2 +-
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  1 +
 arch/x86/kernel/cpu/mcheck/threshold.c   |  1 +
 6 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 76e706767655..be2ec2902830 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -187,6 +187,7 @@ enum mcp_flags {
 void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 int mce_notify_irq(void);
+void mce_queue_irq_work(void);
 #ifdef CONFIG_MEMORY_FAILURE
 void mce_notify_process(void);
 #endif
diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index a1aef9533154..380e3ac8fb62 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -57,7 +57,7 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 
 	m.addr = mem_err->physical_addr;
 	mce_log(&m);
-	mce_notify_irq();
+	mce_queue_irq_work();
 }
 EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index b0a25d5f7185..c7ac50d8c4b5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -151,14 +151,11 @@ static struct mce_log mcelog = {
 void mce_log(struct mce *mce)
 {
 	unsigned next, entry;
-	int ret = 0;
 
 	/* Emit the trace record: */
 	trace_mce_record(mce);
 
-	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, mce);
-	if (ret == NOTIFY_STOP)
-		return;
+	mce_genpool_add(mce);
 
 	mce->finished = 0;
 	wmb();
@@ -488,6 +485,11 @@ static void mce_irq_work_cb(struct irq_work *entry)
 	mce_schedule_work();
 }
 
+void mce_queue_irq_work(void)
+{
+	irq_work_queue(&mce_irq_work);
+}
+
 static void mce_report_event(struct pt_regs *regs)
 {
 	if (regs->flags & (X86_VM_MASK|X86_EFLAGS_IF)) {
@@ -1044,7 +1046,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		/* assuming valid severity level != 0 */
 		m.severity = severity;
 		m.usable_addr = mce_usable_address(&m);
-		mce_genpool_add(&m);
 
 		mce_log(&m);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 3bdb95ae8c43..ec2cf614152e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -195,8 +195,8 @@ static void intel_threshold_interrupt(void)
 {
 	if (cmci_storm_detect())
 		return;
+
 	machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
-	mce_notify_irq();
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 36a1bb6d1ee0..d100e2bb8ed7 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -427,6 +427,7 @@ static inline void __smp_thermal_interrupt(void)
 {
 	inc_irq_stat(irq_thermal_count);
 	smp_thermal_vector();
+	mce_queue_irq_work();
 }
 
 asmlinkage __visible void smp_thermal_interrupt(struct pt_regs *regs)
diff --git a/arch/x86/kernel/cpu/mcheck/threshold.c b/arch/x86/kernel/cpu/mcheck/threshold.c
index 7245980186ee..d695faa234eb 100644
--- a/arch/x86/kernel/cpu/mcheck/threshold.c
+++ b/arch/x86/kernel/cpu/mcheck/threshold.c
@@ -22,6 +22,7 @@ static inline void __smp_threshold_interrupt(void)
 {
 	inc_irq_stat(irq_threshold_count);
 	mce_threshold_vector();
+	mce_queue_irq_work();
 }
 
 asmlinkage __visible void smp_threshold_interrupt(void)
-- 
2.0.0.rc2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-08-14  6:49 ` [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors Chen, Gong
@ 2014-11-11 11:42   ` Borislav Petkov
  2014-11-11 15:42     ` Andy Lutomirski
  2014-11-12 17:20     ` Oleg Nesterov
  0 siblings, 2 replies; 19+ messages in thread
From: Borislav Petkov @ 2014-11-11 11:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Chen, Gong, tony.luck, x86, Peter Zijlstra, Oleg Nesterov,
	Andy Lutomirski

Let me CC some more people:

So guys, we would like to mark a task as to-be-killed if it has
encountered an uncorrectable error during processing.

The current code is nasty (see Tony's patch and commit message below)
and we would like to be able to say:

	current->restartable = true/false;

in the machine check handler and then, when we get back to userspace but
*before* we sched in the task, to check whether it is restartable and if
not, to kill it and poison the memory it has been utilizing so that we
can contain the error instead of killing the whole system.

Currently, we're piggybacking on
do_notify_resume()->mce_notify_process() but I hear this is not that
clean and nice so I'd be open to suggestions on how to do this in a
nicer/cleaner way.

Roughly speaking, we want to be able to mark a task with the sign of
death and to kill it, if needed. The important part is *before* it
gets to run again. And the approach with adding two-more fields to
task_struct seems cleaner than what we have currently.

Any suggestion is appreciated,
thanks!

Oh, and btw, sorry for the top-posting - leaving in the whole mail for
reference.

On Thu, Aug 14, 2014 at 02:49:45AM -0400, Chen, Gong wrote:
> From: Tony Luck <tony.luck@intel.com>
> 
> Handling of uncorrected, but s/w recoverable, errors is a two phase process:
> 
> 1) In the machine check handler we identify the signature of a recoverable
>    error and use TIF_MCE_NOTIFY to mark the process so that we will not
>    return to ring3 and re-execute the instruction that accessed the failed
>    memory location.
> 2) In process context we see TIF_MCE_NOTIFY set and call mce_notify_process()
>    to do all the recovery work that we could not do in machine check context.
> 
> We need to pass the memory address and a flag to say whether the faulting
> instruction can be continued from stage "1" to stage "2".  The existing
> code to do this is baroque.  We hunt through a fixed sized array of
> "mce_info" structures and use atomic_cmpxchg to claim a free one (we
> panic if all are busy). Then in phase 2 we scan back through the mce_info
> array looking for the one that matches the current process.
> 
> We can do much better by admitting that the failing address found in
> phase 1 is a property of the current task. We can only recover from
> errors detected while trying to retire an instruction in ring3 (s/w
> limitation). So we know that MCE_AR_SEVERITY means that we are running
> on the cpu that was executing user code and hit an error that triggered
> this machine check.  This is the reason why TIF_MCE_NOTIFY has been set,
> and the memory error must be dealt with before we can either allow this
> process to continue (if we can replace the page) or not (signalling the
> process if the memory has been lost).
> 
> So we add two fields to the task structure to be saved in phase 1 and
> retrieved in phase 2.  The mce_info[] array and routines to add/find/clear
> entried from it can all be deleted.
> 
> Gong: Add missed CONFIG_MEMORY_FAILURE & '\n' to the end of printk
> message, while at it.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
> Acked-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/include/asm/mce.h       |  2 ++
>  arch/x86/kernel/cpu/mcheck/mce.c | 67 +++++++---------------------------------
>  arch/x86/kernel/signal.c         |  2 +-
>  include/linux/sched.h            |  4 +++
>  4 files changed, 19 insertions(+), 56 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index a713dabd000b..76e706767655 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -187,7 +187,9 @@ enum mcp_flags {
>  void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
>  
>  int mce_notify_irq(void);
> +#ifdef CONFIG_MEMORY_FAILURE
>  void mce_notify_process(void);
> +#endif
>  
>  DECLARE_PER_CPU(struct mce, injectm);
>  
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 0e64a3c69b2b..b0a25d5f7185 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -928,51 +928,6 @@ static void mce_clear_state(unsigned long *toclear)
>  }
>  
>  /*
> - * Need to save faulting physical address associated with a process
> - * in the machine check handler some place where we can grab it back
> - * later in mce_notify_process()
> - */
> -#define	MCE_INFO_MAX	16
> -
> -struct mce_info {
> -	atomic_t		inuse;
> -	struct task_struct	*t;
> -	__u64			paddr;
> -	int			restartable;
> -} mce_info[MCE_INFO_MAX];
> -
> -static void mce_save_info(__u64 addr, int c)
> -{
> -	struct mce_info *mi;
> -
> -	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++) {
> -		if (atomic_cmpxchg(&mi->inuse, 0, 1) == 0) {
> -			mi->t = current;
> -			mi->paddr = addr;
> -			mi->restartable = c;
> -			return;
> -		}
> -	}
> -
> -	mce_panic("Too many concurrent recoverable errors", NULL, NULL);
> -}
> -
> -static struct mce_info *mce_find_info(void)
> -{
> -	struct mce_info *mi;
> -
> -	for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++)
> -		if (atomic_read(&mi->inuse) && mi->t == current)
> -			return mi;
> -	return NULL;
> -}
> -
> -static void mce_clear_info(struct mce_info *mi)
> -{
> -	atomic_set(&mi->inuse, 0);
> -}
> -
> -/*
>   * The actual machine check handler. This only handles real
>   * exceptions when something got corrupted coming in through int 18.
>   *
> @@ -1121,11 +1076,15 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  	if (cfg->tolerant < 3) {
>  		if (no_way_out)
>  			mce_panic("Fatal machine check on current CPU", &m, msg);
> +#ifdef CONFIG_MEMORY_FAILURE
>  		if (worst == MCE_AR_SEVERITY) {
>  			/* schedule action before return to userland */
> -			mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
> +			current->paddr = m.addr;
> +			current->restartable = !!(m.mcgstatus & MCG_STATUS_RIPV);
>  			set_thread_flag(TIF_MCE_NOTIFY);
> -		} else if (kill_it) {
> +		} else
> +#endif
> +		if (kill_it) {
>  			force_sig(SIGBUS, current);
>  		}
>  	}
> @@ -1159,33 +1118,31 @@ int memory_failure(unsigned long pfn, int vector, int flags)
>   * process any corrupted pages, and kill/signal current process if required.
>   * Action required errors are handled here.
>   */
> +#ifdef CONFIG_MEMORY_FAILURE
>  void mce_notify_process(void)
>  {
>  	unsigned long pfn;
> -	struct mce_info *mi = mce_find_info();
>  	int flags = MF_ACTION_REQUIRED;
>  
> -	if (!mi)
> -		mce_panic("Lost physical address for unconsumed uncorrectable error", NULL, NULL);
> -	pfn = mi->paddr >> PAGE_SHIFT;
> +	pfn = current->paddr >> PAGE_SHIFT;
>  
>  	clear_thread_flag(TIF_MCE_NOTIFY);
>  
> -	pr_err("Uncorrected hardware memory error in user-access at %llx",
> -		 mi->paddr);
> +	pr_err("Uncorrected hardware memory error in user-access at %llx\n",
> +		 current->paddr);
>  	/*
>  	 * We must call memory_failure() here even if the current process is
>  	 * doomed. We still need to mark the page as poisoned and alert any
>  	 * other users of the page.
>  	 */
> -	if (!mi->restartable)
> +	if (!current->restartable)
>  		flags |= MF_MUST_KILL;
>  	if (memory_failure(pfn, MCE_VECTOR, flags) < 0) {
>  		pr_err("Memory error not recovered");
>  		force_sig(SIGBUS, current);
>  	}
> -	mce_clear_info(mi);
>  }
> +#endif
>  
>  /*
>   * Action optional processing happens here (picking up
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index 2851d63c1202..28fd70560b96 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -735,7 +735,7 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
>  {
>  	user_exit();
>  
> -#ifdef CONFIG_X86_MCE
> +#ifdef CONFIG_MEMORY_FAILURE
>  	/* notify userspace of pending MCEs */
>  	if (thread_info_flags & _TIF_MCE_NOTIFY)
>  		mce_notify_process();
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 857ba40426ba..b41e839b9e1e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1645,6 +1645,10 @@ struct task_struct {
>  	unsigned int	sequential_io;
>  	unsigned int	sequential_io_avg;
>  #endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	__u64	paddr;
> +	int	restartable;
> +#endif
>  };
>  
>  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> -- 
> 2.0.0.rc2
> 
> 

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 11:42   ` Borislav Petkov
@ 2014-11-11 15:42     ` Andy Lutomirski
  2014-11-11 16:13       ` Borislav Petkov
  2014-11-12 17:20     ` Oleg Nesterov
  1 sibling, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2014-11-11 15:42 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chen Gong, X86 ML, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	linux-kernel

On Nov 11, 2014 3:42 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>
> Let me CC some more people:
>
> So guys, we would like to mark a task as to-be-killed if it has
> encountered an uncorrectable error during processing.
>
> The current code is nasty (see Tony's patch and commit message below)
> and we would like to be able to say:
>
>         current->restartable = true/false;
>
> in the machine check handler and then, when we get back to userspace but
> *before* we sched in the task, to check whether it is restartable and if
> not, to kill it and poison the memory it has been utilizing so that we
> can contain the error instead of killing the whole system.
>
> Currently, we're piggybacking on
> do_notify_resume()->mce_notify_process() but I hear this is not that
> clean and nice so I'd be open to suggestions on how to do this in a
> nicer/cleaner way.
>
> Roughly speaking, we want to be able to mark a task with the sign of
> death and to kill it, if needed. The important part is *before* it
> gets to run again. And the approach with adding two-more fields to
> task_struct seems cleaner than what we have currently.

The last time I looked at the MCE code, I got a bit lost in the
control flow.  Is there ever a userspace-killing MCE that's delivered
from kernel mode?  By that, I mean that I think that all
userspace-killing MCEs go have user_mode_vm(regs) and go through
paranoid_exit.

If so, why do you need to jump through hoops at all?  You can't call
do_exit, but it should be completely safe to force a fatal signal and
let the scheduler and signal code take care of killing the process,
right?  For that matter, you should also be able to poke at vm
structures, etc.

Or is there a meaningful case where mce_notify_process needs to help
with recovery but the original MCE happened with !user_mode_vm(regs)?

--Andy

>
> Any suggestion is appreciated,
> thanks!
>
> Oh, and btw, sorry for the top-posting - leaving in the whole mail for
> reference.
>
> On Thu, Aug 14, 2014 at 02:49:45AM -0400, Chen, Gong wrote:
> > From: Tony Luck <tony.luck@intel.com>
> >
> > Handling of uncorrected, but s/w recoverable, errors is a two phase process:
> >
> > 1) In the machine check handler we identify the signature of a recoverable
> >    error and use TIF_MCE_NOTIFY to mark the process so that we will not
> >    return to ring3 and re-execute the instruction that accessed the failed
> >    memory location.
> > 2) In process context we see TIF_MCE_NOTIFY set and call mce_notify_process()
> >    to do all the recovery work that we could not do in machine check context.
> >
> > We need to pass the memory address and a flag to say whether the faulting
> > instruction can be continued from stage "1" to stage "2".  The existing
> > code to do this is baroque.  We hunt through a fixed sized array of
> > "mce_info" structures and use atomic_cmpxchg to claim a free one (we
> > panic if all are busy). Then in phase 2 we scan back through the mce_info
> > array looking for the one that matches the current process.
> >
> > We can do much better by admitting that the failing address found in
> > phase 1 is a property of the current task. We can only recover from
> > errors detected while trying to retire an instruction in ring3 (s/w
> > limitation). So we know that MCE_AR_SEVERITY means that we are running
> > on the cpu that was executing user code and hit an error that triggered
> > this machine check.  This is the reason why TIF_MCE_NOTIFY has been set,
> > and the memory error must be dealt with before we can either allow this
> > process to continue (if we can replace the page) or not (signalling the
> > process if the memory has been lost).
> >
> > So we add two fields to the task structure to be saved in phase 1 and
> > retrieved in phase 2.  The mce_info[] array and routines to add/find/clear
> > entried from it can all be deleted.
> >
> > Gong: Add missed CONFIG_MEMORY_FAILURE & '\n' to the end of printk
> > message, while at it.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> > Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
> > Acked-by: Borislav Petkov <bp@suse.de>
> > ---
> >  arch/x86/include/asm/mce.h       |  2 ++
> >  arch/x86/kernel/cpu/mcheck/mce.c | 67 +++++++---------------------------------
> >  arch/x86/kernel/signal.c         |  2 +-
> >  include/linux/sched.h            |  4 +++
> >  4 files changed, 19 insertions(+), 56 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> > index a713dabd000b..76e706767655 100644
> > --- a/arch/x86/include/asm/mce.h
> > +++ b/arch/x86/include/asm/mce.h
> > @@ -187,7 +187,9 @@ enum mcp_flags {
> >  void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
> >
> >  int mce_notify_irq(void);
> > +#ifdef CONFIG_MEMORY_FAILURE
> >  void mce_notify_process(void);
> > +#endif
> >
> >  DECLARE_PER_CPU(struct mce, injectm);
> >
> > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> > index 0e64a3c69b2b..b0a25d5f7185 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> > @@ -928,51 +928,6 @@ static void mce_clear_state(unsigned long *toclear)
> >  }
> >
> >  /*
> > - * Need to save faulting physical address associated with a process
> > - * in the machine check handler some place where we can grab it back
> > - * later in mce_notify_process()
> > - */
> > -#define      MCE_INFO_MAX    16
> > -
> > -struct mce_info {
> > -     atomic_t                inuse;
> > -     struct task_struct      *t;
> > -     __u64                   paddr;
> > -     int                     restartable;
> > -} mce_info[MCE_INFO_MAX];
> > -
> > -static void mce_save_info(__u64 addr, int c)
> > -{
> > -     struct mce_info *mi;
> > -
> > -     for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++) {
> > -             if (atomic_cmpxchg(&mi->inuse, 0, 1) == 0) {
> > -                     mi->t = current;
> > -                     mi->paddr = addr;
> > -                     mi->restartable = c;
> > -                     return;
> > -             }
> > -     }
> > -
> > -     mce_panic("Too many concurrent recoverable errors", NULL, NULL);
> > -}
> > -
> > -static struct mce_info *mce_find_info(void)
> > -{
> > -     struct mce_info *mi;
> > -
> > -     for (mi = mce_info; mi < &mce_info[MCE_INFO_MAX]; mi++)
> > -             if (atomic_read(&mi->inuse) && mi->t == current)
> > -                     return mi;
> > -     return NULL;
> > -}
> > -
> > -static void mce_clear_info(struct mce_info *mi)
> > -{
> > -     atomic_set(&mi->inuse, 0);
> > -}
> > -
> > -/*
> >   * The actual machine check handler. This only handles real
> >   * exceptions when something got corrupted coming in through int 18.
> >   *
> > @@ -1121,11 +1076,15 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> >       if (cfg->tolerant < 3) {
> >               if (no_way_out)
> >                       mce_panic("Fatal machine check on current CPU", &m, msg);
> > +#ifdef CONFIG_MEMORY_FAILURE
> >               if (worst == MCE_AR_SEVERITY) {
> >                       /* schedule action before return to userland */
> > -                     mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
> > +                     current->paddr = m.addr;
> > +                     current->restartable = !!(m.mcgstatus & MCG_STATUS_RIPV);
> >                       set_thread_flag(TIF_MCE_NOTIFY);
> > -             } else if (kill_it) {
> > +             } else
> > +#endif
> > +             if (kill_it) {
> >                       force_sig(SIGBUS, current);
> >               }
>>       }
> > @@ -1159,33 +1118,31 @@ int memory_failure(unsigned long pfn, int vector, int flags)
> >   * process any corrupted pages, and kill/signal current process if required.
> >   * Action required errors are handled here.
> >   */
> > +#ifdef CONFIG_MEMORY_FAILURE
> >  void mce_notify_process(void)
> >  {
> >       unsigned long pfn;
> > -     struct mce_info *mi = mce_find_info();
> >       int flags = MF_ACTION_REQUIRED;
> >
> > -     if (!mi)
> > -             mce_panic("Lost physical address for unconsumed uncorrectable error", NULL, NULL);
> > -     pfn = mi->paddr >> PAGE_SHIFT;
> > +     pfn = current->paddr >> PAGE_SHIFT;
> >
> >       clear_thread_flag(TIF_MCE_NOTIFY);
> >
> > -     pr_err("Uncorrected hardware memory error in user-access at %llx",
> > -              mi->paddr);
> > +     pr_err("Uncorrected hardware memory error in user-access at %llx\n",
> > +              current->paddr);
> >       /*
> >        * We must call memory_failure() here even if the current process is
> >        * doomed. We still need to mark the page as poisoned and alert any
> >        * other users of the page.
> >        */
> > -     if (!mi->restartable)
> > +     if (!current->restartable)
> >               flags |= MF_MUST_KILL;
> >       if (memory_failure(pfn, MCE_VECTOR, flags) < 0) {
> >               pr_err("Memory error not recovered");
> >               force_sig(SIGBUS, current);
> >       }
> > -     mce_clear_info(mi);
> >  }
> > +#endif
> >
> >  /*
> >   * Action optional processing happens here (picking up
> > diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> > index 2851d63c1202..28fd70560b96 100644
> > --- a/arch/x86/kernel/signal.c
> > +++ b/arch/x86/kernel/signal.c
> > @@ -735,7 +735,7 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
> >  {
> >       user_exit();
> >
> > -#ifdef CONFIG_X86_MCE
> > +#ifdef CONFIG_MEMORY_FAILURE
> >       /* notify userspace of pending MCEs */
> >       if (thread_info_flags & _TIF_MCE_NOTIFY)
> >               mce_notify_process();
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 857ba40426ba..b41e839b9e1e 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1645,6 +1645,10 @@ struct task_struct {
> >       unsigned int    sequential_io;
> >       unsigned int    sequential_io_avg;
> >  #endif
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +     __u64   paddr;
> > +     int     restartable;
> > +#endif
> >  };
> >
> >  /* Future-safe accessor for struct task_struct's cpus_allowed. */
> > --
> > 2.0.0.rc2
> >
> >
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 15:42     ` Andy Lutomirski
@ 2014-11-11 16:13       ` Borislav Petkov
  2014-11-11 16:22         ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2014-11-11 16:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chen Gong, X86 ML, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	linux-kernel

On Tue, Nov 11, 2014 at 07:42:48AM -0800, Andy Lutomirski wrote:
> The last time I looked at the MCE code, I got a bit lost in the
> control flow.  Is there ever a userspace-killing MCE that's delivered
> from kernel mode?

Yep, so while you're executing a userspace process, you get
an #MC raised which reports an error for which action is
required, i.e. look at all those MCE_AR_SEVERITY errors in
arch/x86/kernel/cpu/mcheck/mce-severity.c.

It happened within the context of current so we go and run the #MC
handler which decides that the process needs to be killed in order to
contain the error. So after we exit the handler and before we return to
try to sched in the process again on any core, we want to actually kill
it and poison all its memory.

> By that, I mean that I think that all userspace-killing MCEs go have
> user_mode_vm(regs) and go through paranoid_exit.

Yes.

> If so, why do you need to jump through hoops at all?  You can't call
> do_exit, but it should be completely safe to force a fatal signal and
> let the scheduler and signal code take care of killing the process,
> right?  For that matter, you should also be able to poke at vm
> structures, etc.

Well, we do that already. memory-failure.c does kill the processes when
it decides to.

The only question is whether adding two new members to task_struct is
ok. It is nicely convenient and it all falls into place.

In the #MC handler we do:

 		if (worst == MCE_AR_SEVERITY) {
 			/* schedule action before return to userland */
+			current->paddr = m.addr;
+			current->restartable = !!(m.mcgstatus & MCG_STATUS_RIPV);
			set_thread_flag(TIF_MCE_NOTIFY);
		}

and then before we return to userspace we do:

+	if (!current->restartable)
 		flags |= MF_MUST_KILL;
 	if (memory_failure(pfn, MCE_VECTOR, flags) < 0) {

and the MF_MUST_KILL makes sure memory_failure() does a force_sig().

So I think this is ok, I only think that people might oppose the two new
members to task_struct but it looks clean to me this way. IMHO at least.

> Or is there a meaningful case where mce_notify_process needs to help
> with recovery but the original MCE happened with !user_mode_vm(regs)?

Well, for the !user_mode_vm(regs) case we panic anyway.

Thanks Andy.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 16:13       ` Borislav Petkov
@ 2014-11-11 16:22         ` Andy Lutomirski
  2014-11-11 16:30           ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2014-11-11 16:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chen Gong, X86 ML, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	linux-kernel

On Tue, Nov 11, 2014 at 8:13 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Nov 11, 2014 at 07:42:48AM -0800, Andy Lutomirski wrote:
>> The last time I looked at the MCE code, I got a bit lost in the
>> control flow.  Is there ever a userspace-killing MCE that's delivered
>> from kernel mode?
>
> Yep, so while you're executing a userspace process, you get
> an #MC raised which reports an error for which action is
> required, i.e. look at all those MCE_AR_SEVERITY errors in
> arch/x86/kernel/cpu/mcheck/mce-severity.c.
>
> It happened within the context of current so we go and run the #MC
> handler which decides that the process needs to be killed in order to
> contain the error. So after we exit the handler and before we return to
> try to sched in the process again on any core, we want to actually kill
> it and poison all its memory.
>
>> By that, I mean that I think that all userspace-killing MCEs go have
>> user_mode_vm(regs) and go through paranoid_exit.
>
> Yes.
>
>> If so, why do you need to jump through hoops at all?  You can't call
>> do_exit, but it should be completely safe to force a fatal signal and
>> let the scheduler and signal code take care of killing the process,
>> right?  For that matter, you should also be able to poke at vm
>> structures, etc.
>
> Well, we do that already. memory-failure.c does kill the processes when
> it decides to.
>
> The only question is whether adding two new members to task_struct is
> ok. It is nicely convenient and it all falls into place.
>
> In the #MC handler we do:
>
>                 if (worst == MCE_AR_SEVERITY) {
>                         /* schedule action before return to userland */
> +                       current->paddr = m.addr;
> +                       current->restartable = !!(m.mcgstatus & MCG_STATUS_RIPV);
>                         set_thread_flag(TIF_MCE_NOTIFY);
>                 }
>
> and then before we return to userspace we do:
>
> +       if (!current->restartable)
>                 flags |= MF_MUST_KILL;
>         if (memory_failure(pfn, MCE_VECTOR, flags) < 0) {
>
> and the MF_MUST_KILL makes sure memory_failure() does a force_sig().
>
> So I think this is ok, I only think that people might oppose the two new
> members to task_struct but it looks clean to me this way. IMHO at least.
>

I think it's okay-ish, but only if it's necessary, and I still don't
see why it's necessary.

Can't you just remove TIF_MCE_NOTIFY entirely and just do all the
mce_notify_process work directly in do_machine_check?  IOW, why do you
need to store any state per-task when it's already on the stack
anyway.

Or am I missing something here?

--Andy

>> Or is there a meaningful case where mce_notify_process needs to help
>> with recovery but the original MCE happened with !user_mode_vm(regs)?
>
> Well, for the !user_mode_vm(regs) case we panic anyway.
>
> Thanks Andy.
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 16:22         ` Andy Lutomirski
@ 2014-11-11 16:30           ` Borislav Petkov
  2014-11-11 17:15             ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2014-11-11 16:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chen Gong, X86 ML, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	linux-kernel

On Tue, Nov 11, 2014 at 08:22:45AM -0800, Andy Lutomirski wrote:
> I think it's okay-ish, but only if it's necessary, and I still don't
> see why it's necessary.
> 
> Can't you just remove TIF_MCE_NOTIFY entirely and just do all the
> mce_notify_process work directly in do_machine_check?  IOW, why do you
> need to store any state per-task when it's already on the stack
> anyway.

I wish but memory_failure() can't run in #MC context as it noodles
quite a lot and grabs all kinds of locks and does a bunch of other
atomit-context-unsafe things.

And it needs to run *before* the process is killed as it looks at its
pages.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 16:30           ` Borislav Petkov
@ 2014-11-11 17:15             ` Andy Lutomirski
  2014-11-11 18:22               ` Borislav Petkov
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2014-11-11 17:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chen Gong, X86 ML, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	linux-kernel

On Tue, Nov 11, 2014 at 8:30 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Nov 11, 2014 at 08:22:45AM -0800, Andy Lutomirski wrote:
>> I think it's okay-ish, but only if it's necessary, and I still don't
>> see why it's necessary.
>>
>> Can't you just remove TIF_MCE_NOTIFY entirely and just do all the
>> mce_notify_process work directly in do_machine_check?  IOW, why do you
>> need to store any state per-task when it's already on the stack
>> anyway.
>
> I wish but memory_failure() can't run in #MC context as it noodles
> quite a lot and grabs all kinds of locks and does a bunch of other
> atomit-context-unsafe things.

Oh -- does it need to sleep?

I find myself wondering whether a much cleaner solution might be to
sync regs and switch stacks before invoking do_machine_check rather
than afterwards.  Then do_machine_check would really be completely
non-atomic.  It would add a few lines of asm, though.

--Andy

>
> And it needs to run *before* the process is killed as it looks at its
> pages.
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 17:15             ` Andy Lutomirski
@ 2014-11-11 18:22               ` Borislav Petkov
  2014-11-11 18:34                 ` Andy Lutomirski
  0 siblings, 1 reply; 19+ messages in thread
From: Borislav Petkov @ 2014-11-11 18:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chen Gong, X86 ML, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	linux-kernel

On Tue, Nov 11, 2014 at 09:15:18AM -0800, Andy Lutomirski wrote:
> Oh -- does it need to sleep?

Oh yeah, iterates over all tasks and generally async access to VM stuff.
You can have a look if bored: mm/memory-failure.c

:-)

> I find myself wondering whether a much cleaner solution might be to
> sync regs and switch stacks before invoking do_machine_check rather
> than afterwards.  Then do_machine_check would really be completely
> non-atomic.  It would add a few lines of asm, though.

That's like wagging the dog.

I'd prefer much more to actually have a mechanism to do something to
tasks before scheduling them in.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 18:22               ` Borislav Petkov
@ 2014-11-11 18:34                 ` Andy Lutomirski
  2014-11-11 18:38                   ` Borislav Petkov
  2014-11-11 20:10                   ` Andy Lutomirski
  0 siblings, 2 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-11-11 18:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chen Gong, X86 ML, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	linux-kernel

On Tue, Nov 11, 2014 at 10:22 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Nov 11, 2014 at 09:15:18AM -0800, Andy Lutomirski wrote:
>> Oh -- does it need to sleep?
>
> Oh yeah, iterates over all tasks and generally async access to VM stuff.
> You can have a look if bored: mm/memory-failure.c
>
> :-)
>
>> I find myself wondering whether a much cleaner solution might be to
>> sync regs and switch stacks before invoking do_machine_check rather
>> than afterwards.  Then do_machine_check would really be completely
>> non-atomic.  It would add a few lines of asm, though.
>
> That's like wagging the dog.

Wagging this particular dog might shake some loose fur off.  If we did
this for all paranoid entries, then paranoid_userspace could just be
deleted.  :)

>
> I'd prefer much more to actually have a mechanism to do something to
> tasks before scheduling them in.

Like task_work_add?

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 18:34                 ` Andy Lutomirski
@ 2014-11-11 18:38                   ` Borislav Petkov
  2014-11-11 20:10                   ` Andy Lutomirski
  1 sibling, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2014-11-11 18:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Chen Gong, X86 ML, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	linux-kernel

On Tue, Nov 11, 2014 at 10:34:46AM -0800, Andy Lutomirski wrote:
> Wagging this particular dog might shake some loose fur off.  If we did
> this for all paranoid entries, then paranoid_userspace could just be
> deleted.  :)

:-)

> > I'd prefer much more to actually have a mechanism to do something to
> > tasks before scheduling them in.
> 
> Like task_work_add?

Ah, there *is* something like that! I knew we would have something - we
always do :-)

Thanks.

/me goes to stare at it.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 18:34                 ` Andy Lutomirski
  2014-11-11 18:38                   ` Borislav Petkov
@ 2014-11-11 20:10                   ` Andy Lutomirski
  1 sibling, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-11-11 20:10 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Chen Gong, X86 ML, Peter Zijlstra, Oleg Nesterov, Tony Luck,
	linux-kernel

On Tue, Nov 11, 2014 at 10:34 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Tue, Nov 11, 2014 at 10:22 AM, Borislav Petkov <bp@alien8.de> wrote:
>> On Tue, Nov 11, 2014 at 09:15:18AM -0800, Andy Lutomirski wrote:
>>> Oh -- does it need to sleep?
>>
>> Oh yeah, iterates over all tasks and generally async access to VM stuff.
>> You can have a look if bored: mm/memory-failure.c
>>
>> :-)
>>
>>> I find myself wondering whether a much cleaner solution might be to
>>> sync regs and switch stacks before invoking do_machine_check rather
>>> than afterwards.  Then do_machine_check would really be completely
>>> non-atomic.  It would add a few lines of asm, though.
>>
>> That's like wagging the dog.
>
> Wagging this particular dog might shake some loose fur off.  If we did
> this for all paranoid entries, then paranoid_userspace could just be
> deleted.  :)

Andi is about to make this dog even messier.  I think I like my
suggestion more and more.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-11 11:42   ` Borislav Petkov
  2014-11-11 15:42     ` Andy Lutomirski
@ 2014-11-12 17:20     ` Oleg Nesterov
  2014-11-12 17:25       ` Andy Lutomirski
  2014-11-12 17:27       ` Borislav Petkov
  1 sibling, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2014-11-12 17:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Chen, Gong, tony.luck, x86, Peter Zijlstra,
	Andy Lutomirski

Sorry, I am a bit confused...

On 11/11, Borislav Petkov wrote:
>
> Roughly speaking, we want to be able to mark a task with the sign of
> death and to kill it, if needed.

"it" is current, yes?

So I agree with Andy, task_work_add() can work and you can also pass
paddr/restartable to the handler.

But,

> The important part is *before* it
> gets to run again.

But it is already running? Perhaps you meant "before it returns to
user-mode" ?

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-12 17:20     ` Oleg Nesterov
@ 2014-11-12 17:25       ` Andy Lutomirski
  2014-11-12 17:27       ` Borislav Petkov
  1 sibling, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2014-11-12 17:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Borislav Petkov, linux-kernel, Chen, Gong, Tony Luck, X86 ML,
	Peter Zijlstra

On Wed, Nov 12, 2014 at 9:20 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> Sorry, I am a bit confused...
>
> On 11/11, Borislav Petkov wrote:
>>
>> Roughly speaking, we want to be able to mark a task with the sign of
>> death and to kill it, if needed.
>
> "it" is current, yes?

This part I'm not sure about due to MCE broadcast.  "It" is current
for exactly one cpu that's in do_machine_check.

>
> So I agree with Andy, task_work_add() can work and you can also pass
> paddr/restartable to the handler.
>
> But,
>
>> The important part is *before* it
>> gets to run again.
>
> But it is already running? Perhaps you meant "before it returns to
> user-mode" ?
>

Right.  But killing it in the do_exit sense from do_machine_check is
currently impossible because do_machine_check runs on an IST stack.
With my patch, if do_machine_check sees user_mode_vm(regs) and
CONFIG_X86_64, then it will be running on the real task stack, so it
can enable interrupts, kill the task however it wants, etc.  This was
the original motivation for my patch.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors
  2014-11-12 17:20     ` Oleg Nesterov
  2014-11-12 17:25       ` Andy Lutomirski
@ 2014-11-12 17:27       ` Borislav Petkov
  1 sibling, 0 replies; 19+ messages in thread
From: Borislav Petkov @ 2014-11-12 17:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Chen, Gong, tony.luck, x86, Peter Zijlstra,
	Andy Lutomirski

On Wed, Nov 12, 2014 at 06:20:58PM +0100, Oleg Nesterov wrote:
> Sorry, I am a bit confused...
> 
> On 11/11, Borislav Petkov wrote:
> >
> > Roughly speaking, we want to be able to mark a task with the sign of
> > death and to kill it, if needed.
> 
> "it" is current, yes?

Yes.

> So I agree with Andy, task_work_add() can work and you can also pass
> paddr/restartable to the handler.

Yes, and it is much simpler than changing the entry.S glue to switch to
the kernel stack for our purposes.

> But,
> 
> > The important part is *before* it
> > gets to run again.
> 
> But it is already running? Perhaps you meant "before it returns to
> user-mode" ?

Right, it runs, while it does so, it triggers an MCE because it touches
some corrupted memory, we jump to the #MC handler and before we return
to user space, we kill that process.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-11-12 17:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-14  6:49 x86, MCE: MCE event ring management Chen, Gong
2014-08-14  6:49 ` [PATCH v2 1/5] x86, MCE: Provide a lock-less memory pool to save error record Chen, Gong
2014-08-14  6:49 ` [PATCH v2 2/5] x86, MCE: Don't use percpu for MCE workqueue/irq_work Chen, Gong
2014-08-14  6:49 ` [PATCH v2 3/5] x86, MCE: Remove mce_ring for SRAO error Chen, Gong
2014-08-14  6:49 ` [PATCH v2 4/5] x86/mce: Simplify flow when handling recoverable memory errors Chen, Gong
2014-11-11 11:42   ` Borislav Petkov
2014-11-11 15:42     ` Andy Lutomirski
2014-11-11 16:13       ` Borislav Petkov
2014-11-11 16:22         ` Andy Lutomirski
2014-11-11 16:30           ` Borislav Petkov
2014-11-11 17:15             ` Andy Lutomirski
2014-11-11 18:22               ` Borislav Petkov
2014-11-11 18:34                 ` Andy Lutomirski
2014-11-11 18:38                   ` Borislav Petkov
2014-11-11 20:10                   ` Andy Lutomirski
2014-11-12 17:20     ` Oleg Nesterov
2014-11-12 17:25       ` Andy Lutomirski
2014-11-12 17:27       ` Borislav Petkov
2014-08-14  6:49 ` [PATCH v2 5/5] x86, MCE: Avoid potential deadlock in MCE context Chen, Gong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.