All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] x86 mce fixes
@ 2014-07-09 17:09 Havard Skinnemoen
  2014-07-09 17:09 ` [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values Havard Skinnemoen
                   ` (5 more replies)
  0 siblings, 6 replies; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 17:09 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: linux-kernel, Havard Skinnemoen, Ewout van Bekkum

The following series contains a few fixes we came up with while testing the MCE
handling on our servers in the lab. These should fix the following symptoms:

  - Once entering CMCI storm mode, we would never exit. This was because we set
    the check_interval to be shorter than 30 seconds, so the condition to exit
    storm mode could never become true.
  - After a storm, the MCE banks previously handled by a CPU could not be
    reclaimed.
  - After a kexec reboot, none of the MCE banks could be claimed by any CPU.
  - Duplicate MCEs were being reported in some circumstances (e.g. with
    mce=no_cmci and/or mce=3).
  - Crashes because the polling timer was added multiple times.

We're not sure if these patches are the best way to fix these issues, and they
may introduce new, subtle bugs, but it's the best we managed to come up with.
Please take a good look and tell us what we got wrong.

Ewout did all the leg work in getting this implemented and tested, while I've
been providing advice and reviews.

Signed-off-by: Ewout van Bekkum <ewout@google.com>
Signed-off-by: Havard Skinnemoen <hskinnemoen@google.com>

Ewout van Bekkum (6):
  x86-mce: Modify CMCI poll interval to adjust for small check_interval
    values.
  x86-mce: Modify CMCI storm exit to reenable instead of rediscover
    banks.
  x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot.
  x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  x86-mce: check if no_way_out applies before deciding not to clear MCE
    banks.
  x86-mce: ensure the MCP timer is not already set in the mce_timer_fn.

 arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +
 arch/x86/kernel/cpu/mcheck/mce.c          | 39 +++++++++++--
 arch/x86/kernel/cpu/mcheck/mce_intel.c    | 95 ++++++++++++++++++++++++-------
 3 files changed, 111 insertions(+), 25 deletions(-)

-- 
2.0.0.526.g5318336


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
@ 2014-07-09 17:09 ` Havard Skinnemoen
  2014-07-09 19:17   ` Borislav Petkov
  2014-07-09 17:09 ` [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks Havard Skinnemoen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 17:09 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: linux-kernel, Ewout van Bekkum, Havard Skinnemoen

From: Ewout van Bekkum <ewout@google.com>

The CMCI poll interval was updated to pick the minimum interval between
the original 30 seconds and the check_interval divided by 8 (minimum of
3 polls).

This resolves a bug where the CMCI storm handler is unable to return to
interrupt mode from polling mode, if the check_interval shorter than the
CMCI poll interval. This problem is caused by the mce_timer_fn function
which only allows the poll interval to be incremented up to the
check_interval, while the mce_intel_adjust_timer function requires the
poll interval to be greater than the CMCI poll interval before leaving
the CMCI_STORM_ACTIVE state.

Signed-off-by: Ewout van Bekkum <ewout@google.com>
Signed-off-by: Havard Skinnemoen <hskinnemoen@google.com>
---
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  1 +
 arch/x86/kernel/cpu/mcheck/mce.c          |  5 +++++
 arch/x86/kernel/cpu/mcheck/mce_intel.c    | 15 +++++++++++----
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 09edd0b..2f0b1e8 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -40,6 +40,7 @@ static inline void cmci_disable_bank(int bank) { }
 #endif
 
 void mce_timer_kick(unsigned long interval);
+unsigned long current_check_interval(void);
 
 #ifdef CONFIG_ACPI_APEI
 int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bb92f38..1ebdd34 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1265,6 +1265,11 @@ void mce_log_therm_throt_event(__u64 status)
  */
 static unsigned long check_interval = 5 * 60; /* 5 minutes */
 
+unsigned long current_check_interval(void)
+{
+	return check_interval;
+}
+
 static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
 static DEFINE_PER_CPU(struct timer_list, mce_timer);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 9a316b2..26eb8d3 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -45,10 +45,17 @@ static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
 static DEFINE_SPINLOCK(cmci_discover_lock);
 
 #define CMCI_THRESHOLD		1
-#define CMCI_POLL_INTERVAL	(30 * HZ)
 #define CMCI_STORM_INTERVAL	(1 * HZ)
 #define CMCI_STORM_THRESHOLD	15
 
+/*
+ * Poll every 30 seconds unless the current check_interval / 8 is smaller.
+ */
+static unsigned long cmci_poll_interval(void)
+{
+	return min(30UL * HZ, current_check_interval() * HZ / 8);
+}
+
 static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
 static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
 static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
@@ -101,7 +108,7 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
 {
 	int r;
 
-	if (interval < CMCI_POLL_INTERVAL)
+	if (interval < cmci_poll_interval())
 		return interval;
 
 	switch (__this_cpu_read(cmci_storm_state)) {
@@ -128,7 +135,7 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
 			cmci_reenable();
 			cmci_recheck();
 		}
-		return CMCI_POLL_INTERVAL;
+		return cmci_poll_interval();
 	default:
 		/*
 		 * We have shiny weather. Let the poll do whatever it
@@ -178,7 +185,7 @@ static bool cmci_storm_detect(void)
 	cmci_storm_disable_banks();
 	__this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
 	r = atomic_add_return(1, &cmci_storm_on_cpus);
-	mce_timer_kick(CMCI_POLL_INTERVAL);
+	mce_timer_kick(cmci_poll_interval());
 
 	if (r == 1)
 		pr_notice("CMCI storm detected: switching to poll mode\n");
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks.
  2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
  2014-07-09 17:09 ` [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values Havard Skinnemoen
@ 2014-07-09 17:09 ` Havard Skinnemoen
  2014-07-09 20:20   ` Luck, Tony
  2014-07-09 17:09 ` [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot Havard Skinnemoen
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 17:09 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: linux-kernel, Ewout van Bekkum, Havard Skinnemoen

From: Ewout van Bekkum <ewout@google.com>

The CMCI storm handler previously called cmci_reenable() when exiting a
CMCI storm. However, when entering a CMCI storm the bank ownership was
not relinquished by the affected CPUs. The CMCIs were only disabled via
cmci_storm_disable_banks(). The handler was updated to instead call a
new function, cmci_storm_enable_banks(), to reenable CMCI on the already
owned banks instead of rediscovering CMCI banks (which were still owned
but disabled).

Signed-off-by: Ewout van Bekkum <ewout@google.com>
Signed-off-by: Havard Skinnemoen <hskinnemoen@google.com>
---
 arch/x86/kernel/cpu/mcheck/mce_intel.c | 50 ++++++++++++++++++++++------------
 1 file changed, 33 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 26eb8d3..d015daf 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -104,6 +104,38 @@ void mce_intel_hcpu_update(unsigned long cpu)
 	per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
 }
 
+static void cmci_storm_disable_banks(void)
+{
+	unsigned long flags, *owned;
+	int bank;
+	u64 val;
+
+	spin_lock_irqsave(&cmci_discover_lock, flags);
+	owned = __get_cpu_var(mce_banks_owned);
+	for_each_set_bit(bank, owned, MAX_NR_BANKS) {
+		rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+		val &= ~MCI_CTL2_CMCI_EN;
+		wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
+	}
+	spin_unlock_irqrestore(&cmci_discover_lock, flags);
+}
+
+static void cmci_storm_enable_banks(void)
+{
+	unsigned long flags, *owned;
+	int bank;
+	u64 val;
+
+	spin_lock_irqsave(&cmci_discover_lock, flags);
+	owned = __get_cpu_var(mce_banks_owned);
+	for_each_set_bit(bank, owned, MAX_NR_BANKS) {
+		rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+		val |= MCI_CTL2_CMCI_EN;
+		wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
+	}
+	spin_unlock_irqrestore(&cmci_discover_lock, flags);
+}
+
 unsigned long mce_intel_adjust_timer(unsigned long interval)
 {
 	int r;
@@ -132,7 +164,7 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
 		 */
 		if (!atomic_read(&cmci_storm_on_cpus)) {
 			__this_cpu_write(cmci_storm_state, CMCI_STORM_NONE);
-			cmci_reenable();
+			cmci_storm_enable_banks();
 			cmci_recheck();
 		}
 		return cmci_poll_interval();
@@ -145,22 +177,6 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
 	}
 }
 
-static void cmci_storm_disable_banks(void)
-{
-	unsigned long flags, *owned;
-	int bank;
-	u64 val;
-
-	spin_lock_irqsave(&cmci_discover_lock, flags);
-	owned = __get_cpu_var(mce_banks_owned);
-	for_each_set_bit(bank, owned, MAX_NR_BANKS) {
-		rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
-		val &= ~MCI_CTL2_CMCI_EN;
-		wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
-	}
-	spin_unlock_irqrestore(&cmci_discover_lock, flags);
-}
-
 static bool cmci_storm_detect(void)
 {
 	unsigned int cnt = __this_cpu_read(cmci_storm_cnt);
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot.
  2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
  2014-07-09 17:09 ` [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values Havard Skinnemoen
  2014-07-09 17:09 ` [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks Havard Skinnemoen
@ 2014-07-09 17:09 ` Havard Skinnemoen
  2014-07-09 20:36   ` Luck, Tony
  2014-07-09 17:09 ` [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports Havard Skinnemoen
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 17:09 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: linux-kernel, Ewout van Bekkum, Havard Skinnemoen

From: Ewout van Bekkum <ewout@google.com>

The existing CMCI code did not call cmci_clear() on each CPU upon
reboot. This meant that a kexec would leave all the CMCI banks
unclaimable and with polling disabled as the cmci_discover() method
relies on the CMCI enable bit for each bank to determine if they can be
claimed.

Signed-off-by: Ewout van Bekkum <ewout@google.com>
Signed-off-by: Havard Skinnemoen <hskinnemoen@google.com>
---
 arch/x86/kernel/cpu/mcheck/mce_intel.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index d015daf..36a1948 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -14,6 +14,8 @@
 #include <asm/processor.h>
 #include <asm/msr.h>
 #include <asm/mce.h>
+#include <linux/reboot.h>
+#include <linux/atomic.h>
 
 #include "mce-internal.h"
 
@@ -388,6 +390,31 @@ void cmci_disable_bank(int bank)
 	spin_unlock_irqrestore(&cmci_discover_lock, flags);
 }
 
+/*
+ * Fully disable CMCI on this CPU in case of shutdown.
+ */
+static void cmci_disable(void *dummy)
+{
+	cmci_clear();
+}
+
+static int cmci_reboot_event(struct notifier_block *this,
+			     unsigned long event, void *ptr)
+{
+	/*
+	 * Disable CMCI on this CPU for all banks it owns so CPUs after the
+	 * reboot can reclaim the banks through rediscovery.
+	 */
+	on_each_cpu(cmci_disable, NULL, 1);
+	return NOTIFY_OK;
+}
+
+static struct notifier_block cmci_reboot_notifier = {
+	.notifier_call = cmci_reboot_event,
+};
+
+static bool reboot_notifier_registered;
+
 static void intel_init_cmci(void)
 {
 	int banks;
@@ -395,6 +422,9 @@ static void intel_init_cmci(void)
 	if (!cmci_supported(&banks))
 		return;
 
+	if (!xchg(&reboot_notifier_registered, true))
+		register_reboot_notifier(&cmci_reboot_notifier);
+
 	mce_threshold_vector = intel_threshold_interrupt;
 	cmci_discover(banks);
 	/*
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
                   ` (2 preceding siblings ...)
  2014-07-09 17:09 ` [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot Havard Skinnemoen
@ 2014-07-09 17:09 ` Havard Skinnemoen
  2014-07-09 20:35   ` Andi Kleen
                     ` (2 more replies)
  2014-07-09 17:09 ` [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks Havard Skinnemoen
  2014-07-09 17:09 ` [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn Havard Skinnemoen
  5 siblings, 3 replies; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 17:09 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: linux-kernel, Ewout van Bekkum, Havard Skinnemoen

From: Ewout van Bekkum <ewout@google.com>

machine_check_poll() was modified to use spin_lock_irqsave independently
per bank when a valid MCE is found to prevent duplicated MCE reports by
the CMCI and polling methods. In the common case no MCE will be found,
so the lock is not acquired until a valid MCE is found. The status is
reread after the lock is acquired in case the MCE was already handled by
a different thread. A unique spinlock is used per bank number, so
contention should be mostly limited to non-shared banks.

Signed-off-by: Ewout van Bekkum <ewout@google.com>
Signed-off-by: Havard Skinnemoen <hskinnemoen@google.com>
---
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  1 +
 arch/x86/kernel/cpu/mcheck/mce.c          | 21 ++++++++++++++++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 2f0b1e8..aa6843a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -19,6 +19,7 @@ struct mce_bank {
 	unsigned char init;				/* initialise bank? */
 	struct device_attribute attr;			/* device attribute */
 	char			attrname[ATTR_LEN];	/* attribute name */
+	spinlock_t poll_spinlock;			/* lock for polling */
 };
 
 int mce_severity(struct mce *a, int tolerant, char **msg);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 1ebdd34..64270d7 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -41,6 +41,8 @@
 #include <linux/debugfs.h>
 #include <linux/irq_work.h>
 #include <linux/export.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
 
 #include <asm/processor.h>
 #include <asm/mce.h>
@@ -596,6 +598,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 {
 	struct mce m;
 	int i;
+	unsigned long irq_flags;
 
 	this_cpu_inc(mce_poll_count);
 
@@ -617,14 +620,28 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 
 		this_cpu_write(mce_polled_error, 1);
 		/*
+		 * Optimize for the common case where no MCEs are found.
+		 */
+		spin_lock_irqsave(&mce_banks[i].poll_spinlock, irq_flags);
+		m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+		if (!(m.status & MCI_STATUS_VAL)) {
+			spin_unlock_irqrestore(&mce_banks[i].poll_spinlock,
+					       irq_flags);
+			continue;
+		}
+
+		/*
 		 * Uncorrected or signalled events are handled by the exception
 		 * handler when it is enabled, so don't process those here.
 		 *
 		 * TBD do the same check for MCI_STATUS_EN here?
 		 */
 		if (!(flags & MCP_UC) &&
-		    (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC)))
+		    (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) {
+			spin_unlock_irqrestore(&mce_banks[i].poll_spinlock,
+					       irq_flags);
 			continue;
+		}
 
 		mce_read_aux(&m, i);
 
@@ -641,6 +658,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		 * Clear state for this bank.
 		 */
 		mce_wrmsrl(MSR_IA32_MCx_STATUS(i), 0);
+		spin_unlock_irqrestore(&mce_banks[i].poll_spinlock, irq_flags);
 	}
 
 	/*
@@ -1399,6 +1417,7 @@ static int __mcheck_cpu_mce_banks_init(void)
 
 		b->ctl = -1ULL;
 		b->init = 1;
+		spin_lock_init(&b->poll_spinlock);
 	}
 	return 0;
 }
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks.
  2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
                   ` (3 preceding siblings ...)
  2014-07-09 17:09 ` [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports Havard Skinnemoen
@ 2014-07-09 17:09 ` Havard Skinnemoen
  2014-07-09 21:00   ` Luck, Tony
  2014-07-09 17:09 ` [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn Havard Skinnemoen
  5 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 17:09 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: linux-kernel, Ewout van Bekkum, Havard Skinnemoen

From: Ewout van Bekkum <ewout@google.com>

The machine check handler, do_machine_check(), has a sanity check before
clearing the MCE banks in case the system has no_way_out and has to
crash.  However, this sanity check does not take into account the
configured MCE tolerant level as the system may still keep running. The
sanity check was updated to check if the system has no_way_out and that
no_way_out is relevant (tolerant level is less than 3).

Signed-off-by: Ewout van Bekkum <ewout@google.com>
Signed-off-by: Havard Skinnemoen <hskinnemoen@google.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 64270d7..1587b18 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1153,7 +1153,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 	/* mce_clear_state will clear *final, save locally for use later */
 	m = *final;
 
-	if (!no_way_out)
+	if (!(no_way_out && cfg->tolerant < 3))
 		mce_clear_state(toclear);
 
 	/*
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn.
  2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
                   ` (4 preceding siblings ...)
  2014-07-09 17:09 ` [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks Havard Skinnemoen
@ 2014-07-09 17:09 ` Havard Skinnemoen
  2014-07-09 21:04   ` Luck, Tony
  5 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 17:09 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov
  Cc: linux-kernel, Ewout van Bekkum, Havard Skinnemoen

From: Ewout van Bekkum <ewout@google.com>

The machine check poll timer function, mce_timer_fn(), did not check
whether the MCP timer had already been set in case the function was
preempted by an interrupt. A CMCI interrupt could preempt this function
and enable the MCP timer through mce_timer_kick(). The race condition
was resolved by disabling interrupts during the mce_timer_fn() function
and by verifying the timer isn't already set before starting the timer.

Signed-off-by: Ewout van Bekkum <ewout@google.com>
Signed-off-by: Havard Skinnemoen <hskinnemoen@google.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 1587b18..c5e40cb 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1310,10 +1310,13 @@ static void mce_timer_fn(unsigned long data)
 {
 	struct timer_list *t = &__get_cpu_var(mce_timer);
 	unsigned long iv;
+	unsigned long flags;
 	int notify;
 
 	WARN_ON(smp_processor_id() != data);
 
+	/* Ensure a CMCI interrupt can't preempt this. */
+	local_irq_save(flags);
 	if (mce_available(__this_cpu_ptr(&cpu_info))) {
 		machine_check_poll(MCP_TIMESTAMP,
 				&__get_cpu_var(mce_poll_banks));
@@ -1334,11 +1337,15 @@ static void mce_timer_fn(unsigned long data)
 		iv = mce_adjust_timer(iv);
 	}
 	__this_cpu_write(mce_next_interval, iv);
-	/* Might have become 0 after CMCI storm subsided */
-	if (iv) {
+
+	/* Ensure the timer isn't already set and the interval has not become
+	 * 0 after a CMCMI storm has subsided.
+	 */
+	if (!timer_pending(t) && iv) {
 		t->expires = jiffies + iv;
 		add_timer_on(t, smp_processor_id());
 	}
+	local_irq_restore(flags);
 }
 
 /*
-- 
2.0.0.526.g5318336


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-09 17:09 ` [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values Havard Skinnemoen
@ 2014-07-09 19:17   ` Borislav Petkov
  2014-07-09 21:24     ` Havard Skinnemoen
  0 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-09 19:17 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, linux-kernel, Ewout van Bekkum

On Wed, Jul 09, 2014 at 10:09:21AM -0700, Havard Skinnemoen wrote:
> From: Ewout van Bekkum <ewout@google.com>
> 
> The CMCI poll interval was updated to pick the minimum interval between
> the original 30 seconds and the check_interval divided by 8 (minimum of
> 3 polls).

Why min 3 polls? How do you come up with exactly that frequency?

> This resolves a bug where the CMCI storm handler is unable to return to
> interrupt mode from polling mode, if the check_interval shorter than the
> CMCI poll interval. This problem is caused by the mce_timer_fn function
> which only allows the poll interval to be incremented up to the
> check_interval, while the mce_intel_adjust_timer function requires the
> poll interval to be greater than the CMCI poll interval before leaving
> the CMCI_STORM_ACTIVE state.

Interesting. So it seems you guys want to set the check_interval to
something < 30 secs.

Out of curiosity, what is your use case which requires such small
check_interval setting?

Maybe we need to redesign and simplify this intervals thing to make it
more user-friendly...

Btw, on a related note, we're working on a small mechanism which
collects correctable errors in the kernel and when a certain count for a
physical error address has been reached, we soft-offline that page. We'd
appreciate it if you guys took a look and told us whether it makes sense
to you:

http://lkml.kernel.org/r/1404242623-10094-1-git-send-email-bp@alien8.de

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks.
  2014-07-09 17:09 ` [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks Havard Skinnemoen
@ 2014-07-09 20:20   ` Luck, Tony
  2014-07-09 21:34     ` Havard Skinnemoen
  0 siblings, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2014-07-09 20:20 UTC (permalink / raw)
  To: Havard Skinnemoen, Borislav Petkov; +Cc: linux-kernel, Ewout van Bekkum

> The CMCI storm handler previously called cmci_reenable() when exiting a
> CMCI storm. However, when entering a CMCI storm the bank ownership was
> not relinquished by the affected CPUs. The CMCIs were only disabled via
> cmci_storm_disable_banks(). The handler was updated to instead call a
> new function, cmci_storm_enable_banks(), to reenable CMCI on the already
> owned banks instead of rediscovering CMCI banks (which were still owned
> but disabled).

Won't this cause problems if we online a cpu during the storm. We will
re-run the discovery algorithm and some other cpu that shares the bank
will see MCi_CTL2{30} is zero and claim ownership.

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-09 17:09 ` [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports Havard Skinnemoen
@ 2014-07-09 20:35   ` Andi Kleen
  2014-07-09 21:51     ` Havard Skinnemoen
  2014-07-09 20:47   ` Luck, Tony
  2014-07-10 16:41   ` Borislav Petkov
  2 siblings, 1 reply; 61+ messages in thread
From: Andi Kleen @ 2014-07-09 20:35 UTC (permalink / raw)
  To: Havard Skinnemoen
  Cc: Tony Luck, Borislav Petkov, linux-kernel, Ewout van Bekkum

Havard Skinnemoen <hskinnemoen@google.com> writes:

> machine_check_poll() was modified to use spin_lock_irqsave independently
> per bank when a valid MCE is found to prevent duplicated MCE reports by
> the CMCI and polling methods. In the common case no MCE will be found,
> so the lock is not acquired until a valid MCE is found. The status is
> reread after the lock is acquired in case the MCE was already handled by
> a different thread. A unique spinlock is used per bank number, so
> contention should be mostly limited to non-shared banks.

This doesn't make sense. Banks are either owned by CMCI or by poll,
not by both. If you have true duplicates the bug must be somewhere else.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot.
  2014-07-09 17:09 ` [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot Havard Skinnemoen
@ 2014-07-09 20:36   ` Luck, Tony
  2014-07-09 21:40     ` Havard Skinnemoen
  0 siblings, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2014-07-09 20:36 UTC (permalink / raw)
  To: Havard Skinnemoen, Borislav Petkov; +Cc: linux-kernel, Ewout van Bekkum

+	if (!xchg(&reboot_notifier_registered, true))
+		register_reboot_notifier(&cmci_reboot_notifier);

This is super-safe ... but isn't the xchg() overkill? I thought we serialized bringup
of other cpus.

Has this "do it once" caught on elsewhere in the kernel ... I suppose it is more concise than

	if (!reboot_notifier_registered) {
		reboot_notifier_registered = 1;
		register_reboot_notifier(&cmci_reboot_notifier);
	}

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-09 17:09 ` [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports Havard Skinnemoen
  2014-07-09 20:35   ` Andi Kleen
@ 2014-07-09 20:47   ` Luck, Tony
  2014-07-09 21:56     ` Havard Skinnemoen
  2014-07-10 16:41   ` Borislav Petkov
  2 siblings, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2014-07-09 20:47 UTC (permalink / raw)
  To: Havard Skinnemoen, Borislav Petkov; +Cc: linux-kernel, Ewout van Bekkum

		if (!(flags & MCP_UC) &&
-		    (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC)))
+		    (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) {
+			spin_unlock_irqrestore(&mce_banks[i].poll_spinlock,
+					       irq_flags);
 			continue;
+		}

Perhaps move this test for UC up to before you grab the lock?

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks.
  2014-07-09 17:09 ` [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks Havard Skinnemoen
@ 2014-07-09 21:00   ` Luck, Tony
  2014-07-09 23:00     ` Havard Skinnemoen
  0 siblings, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2014-07-09 21:00 UTC (permalink / raw)
  To: Havard Skinnemoen, Borislav Petkov; +Cc: linux-kernel, Ewout van Bekkum

+	if (!(no_way_out && cfg->tolerant < 3))
 		mce_clear_state(toclear);

Style - I think this is easier to grok:

	if (!no_way_out || cfg->tolerant >=3)
		mce_clear_state(toclear);

but not too strongly if other like !(a && b) form.

I'm never sure how to treat the crazy levels of "tolerant" though.  Do
we really want to clear the banks?  In one sense we do ... we are still
running and might see more UC errors. Since newer UC errors don't
overwrite older ones, clearing the banks allows us to see how many
errors are piling up and being ignored.

But running with tolerant==3 is likely to end in tears ... should we erase
the evidence on what bad things happened?

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn.
  2014-07-09 17:09 ` [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn Havard Skinnemoen
@ 2014-07-09 21:04   ` Luck, Tony
  2014-07-09 23:01     ` Havard Skinnemoen
  0 siblings, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2014-07-09 21:04 UTC (permalink / raw)
  To: Havard Skinnemoen, Borislav Petkov; +Cc: linux-kernel, Ewout van Bekkum

+	/* Ensure a CMCI interrupt can't preempt this. */
+	local_irq_save(flags);
 	if (mce_available(__this_cpu_ptr(&cpu_info))) {
 		machine_check_poll(MCP_TIMESTAMP,
 				&__get_cpu_var(mce_poll_banks));

Does this remove the problem that you fixed in part4?  If a CMCI can
no longer interrupt a poll ... do we really need the locks you added?

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-09 19:17   ` Borislav Petkov
@ 2014-07-09 21:24     ` Havard Skinnemoen
  2014-07-10  9:01       ` Chen, Gong
  2014-07-10 11:42       ` Borislav Petkov
  0 siblings, 2 replies; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 21:24 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum

On Wed, Jul 9, 2014 at 12:17 PM, Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Jul 09, 2014 at 10:09:21AM -0700, Havard Skinnemoen wrote:
> > From: Ewout van Bekkum <ewout@google.com>
> >
> > The CMCI poll interval was updated to pick the minimum interval between
> > the original 30 seconds and the check_interval divided by 8 (minimum of
> > 3 polls).
>
> Why min 3 polls? How do you come up with exactly that frequency?

The idea is that if we make it equal to check_interval, it might
bounce back and forth a lot. So we need to divide by something, and 8
seems like a nice, safe value, and it seems to work well. We're not
opposed to considering other values, of course (e.g. 2 and 4 might
work well too, but with somewhat higher risk of ping-ponging).

> > This resolves a bug where the CMCI storm handler is unable to return to
> > interrupt mode from polling mode, if the check_interval shorter than the
> > CMCI poll interval. This problem is caused by the mce_timer_fn function
> > which only allows the poll interval to be incremented up to the
> > check_interval, while the mce_intel_adjust_timer function requires the
> > poll interval to be greater than the CMCI poll interval before leaving
> > the CMCI_STORM_ACTIVE state.
>
> Interesting. So it seems you guys want to set the check_interval to
> something < 30 secs.
>
> Out of curiosity, what is your use case which requires such small
> check_interval setting?

I'm not entirely sure. At some point, it ended up that way, and it
broke in non-obvious ways, so we wanted to fix it.

> Maybe we need to redesign and simplify this intervals thing to make it
> more user-friendly...
>
> Btw, on a related note, we're working on a small mechanism which
> collects correctable errors in the kernel and when a certain count for a
> physical error address has been reached, we soft-offline that page. We'd
> appreciate it if you guys took a look and told us whether it makes sense
> to you:
>
> http://lkml.kernel.org/r/1404242623-10094-1-git-send-email-bp@alien8.de

We will definitely take a look, thanks. Looks interesting, though it's
not always obvious what works for us until we actually go and try it.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks.
  2014-07-09 20:20   ` Luck, Tony
@ 2014-07-09 21:34     ` Havard Skinnemoen
  2014-07-10 15:51       ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 21:34 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, linux-kernel, Ewout van Bekkum

On Wed, Jul 9, 2014 at 1:20 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> The CMCI storm handler previously called cmci_reenable() when exiting a
>> CMCI storm. However, when entering a CMCI storm the bank ownership was
>> not relinquished by the affected CPUs. The CMCIs were only disabled via
>> cmci_storm_disable_banks(). The handler was updated to instead call a
>> new function, cmci_storm_enable_banks(), to reenable CMCI on the already
>> owned banks instead of rediscovering CMCI banks (which were still owned
>> but disabled).
>
> Won't this cause problems if we online a cpu during the storm. We will
> re-run the discovery algorithm and some other cpu that shares the bank
> will see MCi_CTL2{30} is zero and claim ownership.

Yes, I think you're right. We didn't test this with CPU hotplugging.

I'm at loss about how to fix it though. We need the CMCI bits to
detect shared banks, but they're not reflecting the actual state of
things at that point. If the CPU gives up ownership of the banks, then
we might just see the storm move from CPU to CPU, right?

We could keep a separate bitmask somewhere to indicate ownership, but
even if we can see that the bank is shared with some other CPU, we
don't know if it will be shared with a new CPU which we've never seen
before...

Perhaps we need to temporarily disable the storm handling when we're
bringing up a new CPU?

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot.
  2014-07-09 20:36   ` Luck, Tony
@ 2014-07-09 21:40     ` Havard Skinnemoen
  2014-07-10 16:24       ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 21:40 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, linux-kernel, Ewout van Bekkum

On Wed, Jul 9, 2014 at 1:36 PM, Luck, Tony <tony.luck@intel.com> wrote:
> +       if (!xchg(&reboot_notifier_registered, true))
> +               register_reboot_notifier(&cmci_reboot_notifier);
>
> This is super-safe ... but isn't the xchg() overkill? I thought we serialized bringup
> of other cpus.

Could be. There are spinlocks being taken elsewhere in that path
though, and I'm not sure if I see any downside with the xchg(), unless
this path is actually performance-critical.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-09 20:35   ` Andi Kleen
@ 2014-07-09 21:51     ` Havard Skinnemoen
  2014-07-09 23:32       ` Luck, Tony
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 21:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tony Luck, Borislav Petkov, Linux Kernel, Ewout van Bekkum

On Wed, Jul 9, 2014 at 1:35 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Havard Skinnemoen <hskinnemoen@google.com> writes:
>
>> machine_check_poll() was modified to use spin_lock_irqsave independently
>> per bank when a valid MCE is found to prevent duplicated MCE reports by
>> the CMCI and polling methods. In the common case no MCE will be found,
>> so the lock is not acquired until a valid MCE is found. The status is
>> reread after the lock is acquired in case the MCE was already handled by
>> a different thread. A unique spinlock is used per bank number, so
>> contention should be mostly limited to non-shared banks.
>
> This doesn't make sense. Banks are either owned by CMCI or by poll,
> not by both. If you have true duplicates the bug must be somewhere else.

I don't think we got the description right here. I think the real
issue here was machine check polls happening on multiple CPUs with
shared banks, all reporting the same MCEs. This is very reproducible
when booting with mce=no_cmci, since all CPUs will handle all banks,
and there's AFAICT no good way to identify shared banks without
enabling CMCI.

There may have been an interaction with CMCI here too at some point,
but it's possible that went away with the timer patch (which we did a
bit later).

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-09 20:47   ` Luck, Tony
@ 2014-07-09 21:56     ` Havard Skinnemoen
  0 siblings, 0 replies; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 21:56 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, linux-kernel, Ewout van Bekkum

On Wed, Jul 9, 2014 at 1:47 PM, Luck, Tony <tony.luck@intel.com> wrote:
>                 if (!(flags & MCP_UC) &&
> -                   (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC)))
> +                   (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) {
> +                       spin_unlock_irqrestore(&mce_banks[i].poll_spinlock,
> +                                              irq_flags);
>                         continue;
> +               }
>
> Perhaps move this test for UC up to before you grab the lock?

Will do, thanks.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks.
  2014-07-09 21:00   ` Luck, Tony
@ 2014-07-09 23:00     ` Havard Skinnemoen
  2014-07-09 23:27       ` Luck, Tony
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 23:00 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, linux-kernel, Ewout van Bekkum

On Wed, Jul 9, 2014 at 2:00 PM, Luck, Tony <tony.luck@intel.com> wrote:
> +       if (!(no_way_out && cfg->tolerant < 3))
>                 mce_clear_state(toclear);
>
> Style - I think this is easier to grok:
>
>         if (!no_way_out || cfg->tolerant >=3)
>                 mce_clear_state(toclear);
>
> but not too strongly if other like !(a && b) form.

I tend to agree with you. It came up during our internal review, and
others argued the other way. But since I'm in charge now, I'll change
it back ;-)

> I'm never sure how to treat the crazy levels of "tolerant" though.  Do
> we really want to clear the banks?  In one sense we do ... we are still
> running and might see more UC errors. Since newer UC errors don't
> overwrite older ones, clearing the banks allows us to see how many
> errors are piling up and being ignored.
>
> But running with tolerant==3 is likely to end in tears ... should we erase
> the evidence on what bad things happened?

It probably doesn't make a huge difference since you're not supposed
to run with tolerant=3, but I kind of understood the logic to be that
if we're going to keep running, we need to clear the banks, and if
we're going to crash, we need to leave them intact so whatever runs
next gets a chance to look at them. So with tolerant==3, we are going
to continue running, and I think for debugging purposes, it's useful
to see how many additional errors are happening.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn.
  2014-07-09 21:04   ` Luck, Tony
@ 2014-07-09 23:01     ` Havard Skinnemoen
  0 siblings, 0 replies; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-09 23:01 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, linux-kernel, Ewout van Bekkum

On Wed, Jul 9, 2014 at 2:04 PM, Luck, Tony <tony.luck@intel.com> wrote:
> +       /* Ensure a CMCI interrupt can't preempt this. */
> +       local_irq_save(flags);
>         if (mce_available(__this_cpu_ptr(&cpu_info))) {
>                 machine_check_poll(MCP_TIMESTAMP,
>                                 &__get_cpu_var(mce_poll_banks));
>
> Does this remove the problem that you fixed in part4?  If a CMCI can
> no longer interrupt a poll ... do we really need the locks you added?

It's possible this fixes part of the problem. But I still think
there's a race between multiple CPUs polling shared banks.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks.
  2014-07-09 23:00     ` Havard Skinnemoen
@ 2014-07-09 23:27       ` Luck, Tony
  2014-07-10 16:49         ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2014-07-09 23:27 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Borislav Petkov, linux-kernel, Ewout van Bekkum

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 493 bytes --]

> to run with tolerant=3, but I kind of understood the logic to be that
> if we're going to keep running, we need to clear the banks, and if
> we're going to crash, we need to leave them intact 

That makes sense ... fold some text like that into the commit
description, and this part is:

Acked-by: Tony Luck <tony.luck@intel.com>

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-09 21:51     ` Havard Skinnemoen
@ 2014-07-09 23:32       ` Luck, Tony
  2014-07-10  8:16         ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Luck, Tony @ 2014-07-09 23:32 UTC (permalink / raw)
  To: Havard Skinnemoen, Andi Kleen
  Cc: Borislav Petkov, Linux Kernel, Ewout van Bekkum

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 712 bytes --]

> I don't think we got the description right here. I think the real
> issue here was machine check polls happening on multiple CPUs with
> shared banks, all reporting the same MCEs. This is very reproducible
> when booting with mce=no_cmci, since all CPUs will handle all banks,
> and there's AFAICT no good way to identify shared banks without
> enabling CMCI.

Ok - then update the description to enumerate the problem cases.
Probably:

*) mce=no_cmci
*) Intel cpus with MCG_CAP.MCG_CMCI_P=0
*) AMD cpus (they don't do CMCI at all, do they??)

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-09 23:32       ` Luck, Tony
@ 2014-07-10  8:16         ` Borislav Petkov
  0 siblings, 0 replies; 61+ messages in thread
From: Borislav Petkov @ 2014-07-10  8:16 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Havard Skinnemoen, Andi Kleen, Linux Kernel, Ewout van Bekkum

On Wed, Jul 09, 2014 at 11:32:53PM +0000, Luck, Tony wrote:
> *) AMD cpus (they don't do CMCI at all, do they??)

They have something similar. See arch/x86/kernel/cpu/mcheck/mce_amd.c

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-09 21:24     ` Havard Skinnemoen
@ 2014-07-10  9:01       ` Chen, Gong
  2014-07-10 17:16         ` Havard Skinnemoen
  2014-07-10 11:42       ` Borislav Petkov
  1 sibling, 1 reply; 61+ messages in thread
From: Chen, Gong @ 2014-07-10  9:01 UTC (permalink / raw)
  To: Havard Skinnemoen
  Cc: Borislav Petkov, Tony Luck, Linux Kernel, Ewout van Bekkum

[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]

On Wed, Jul 09, 2014 at 02:24:31PM -0700, Havard Skinnemoen wrote:
> Date:	Wed, 9 Jul 2014 14:24:31 -0700
> From: Havard Skinnemoen <hskinnemoen@google.com>
> To: Borislav Petkov <bp@alien8.de>
> Cc: Tony Luck <tony.luck@intel.com>, Linux Kernel
>  <linux-kernel@vger.kernel.org>, Ewout van Bekkum <ewout@google.com>
> Subject: Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for
>  small check_interval values.
> 
> On Wed, Jul 9, 2014 at 12:17 PM, Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Wed, Jul 09, 2014 at 10:09:21AM -0700, Havard Skinnemoen wrote:
> > > From: Ewout van Bekkum <ewout@google.com>
> > >
> > > The CMCI poll interval was updated to pick the minimum interval between
> > > the original 30 seconds and the check_interval divided by 8 (minimum of
> > > 3 polls).
> >
> > Why min 3 polls? How do you come up with exactly that frequency?
> 
> The idea is that if we make it equal to check_interval, it might
> bounce back and forth a lot. So we need to divide by something, and 8
> seems like a nice, safe value, and it seems to work well. We're not
> opposed to considering other values, of course (e.g. 2 and 4 might
> work well too, but with somewhat higher risk of ping-ponging).
That value looks chosen a little bit at will. How about updating
CMCI_POLL_INTERVAL when check_interval is changed to ensure
CMCI_POLL_INTERVAL <= check_interval always.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-09 21:24     ` Havard Skinnemoen
  2014-07-10  9:01       ` Chen, Gong
@ 2014-07-10 11:42       ` Borislav Petkov
  2014-07-10 17:51         ` Havard Skinnemoen
  1 sibling, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-10 11:42 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

+ linux-edac.

On Wed, Jul 09, 2014 at 02:24:31PM -0700, Havard Skinnemoen wrote:
> > > The CMCI poll interval was updated to pick the minimum interval between
> > > the original 30 seconds and the check_interval divided by 8 (minimum of
> > > 3 polls).
> >
> > Why min 3 polls? How do you come up with exactly that frequency?
> 
> The idea is that if we make it equal to check_interval, it might
> bounce back and forth a lot. So we need to divide by something, and 8
> seems like a nice, safe value, and it seems to work well. We're not
> opposed to considering other values, of course (e.g. 2 and 4 might
> work well too, but with somewhat higher risk of ping-ponging).

Yep, this is exactly why I'm asking about your use case. Because if we
set it to any number, someone down the line will appear and say that
this doesn't suit her/his needs.

So, I'm thinking more in the direction of controlling it settings, maybe
even restricting check_interval and the CMCI poll interval, relative to
each other maybe, but still configurable with the max flexibility.

For that we'll need to answer questions like

* Which min value is sane?
* Do check_interval and CMCI poll interval need to be related at all?
* Which max value makes sense?
* What about check_interval, do we want to touch that too?
...

Just throwing out a bunch of questions, off the top of my head, to get
some opinions/rants, etc.

> I'm not entirely sure. At some point, it ended up that way, and it
> broke in non-obvious ways, so we wanted to fix it.

Right, so if we restrict it, the fix is even simpler. Unless you have a
more valid use than "[a]t some point, it ended up that way... " :-)

> We will definitely take a look, thanks. Looks interesting, though it's
> not always obvious what works for us until we actually go and try it.

Cool, thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks.
  2014-07-09 21:34     ` Havard Skinnemoen
@ 2014-07-10 15:51       ` Borislav Petkov
  2014-07-10 18:32         ` Havard Skinnemoen
  0 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-10 15:51 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Luck, Tony, linux-kernel, Ewout van Bekkum

On Wed, Jul 09, 2014 at 02:34:39PM -0700, Havard Skinnemoen wrote:
> On Wed, Jul 9, 2014 at 1:20 PM, Luck, Tony <tony.luck@intel.com> wrote:
> >> The CMCI storm handler previously called cmci_reenable() when exiting a
> >> CMCI storm. However, when entering a CMCI storm the bank ownership was
> >> not relinquished by the affected CPUs. The CMCIs were only disabled via
> >> cmci_storm_disable_banks(). The handler was updated to instead call a
> >> new function, cmci_storm_enable_banks(), to reenable CMCI on the already
> >> owned banks instead of rediscovering CMCI banks (which were still owned
> >> but disabled).
> >
> > Won't this cause problems if we online a cpu during the storm. We will
> > re-run the discovery algorithm and some other cpu that shares the bank
> > will see MCi_CTL2{30} is zero and claim ownership.
> 
> Yes, I think you're right. We didn't test this with CPU hotplugging.
> 
> I'm at loss about how to fix it though. We need the CMCI bits to
> detect shared banks, but they're not reflecting the actual state of
> things at that point. If the CPU gives up ownership of the banks, then
> we might just see the storm move from CPU to CPU, right?
> 
> We could keep a separate bitmask somewhere to indicate ownership, but
> even if we can see that the bank is shared with some other CPU, we
> don't know if it will be shared with a new CPU which we've never seen
> before...
> 
> Perhaps we need to temporarily disable the storm handling when we're
> bringing up a new CPU?

Looking at this more, maybe cmci_storm_disable_banks() was a bad idea
after all. There's __cmci_disable_bank() which properly drops ownership
after having disabled CMCI.

Maybe we should kill cmci_storm_disable_banks() and do
__cmci_disable_bank by iterating over them all...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot.
  2014-07-09 21:40     ` Havard Skinnemoen
@ 2014-07-10 16:24       ` Borislav Petkov
  2014-07-10 16:33         ` Tony Luck
  2014-07-10 17:56         ` Havard Skinnemoen
  0 siblings, 2 replies; 61+ messages in thread
From: Borislav Petkov @ 2014-07-10 16:24 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Luck, Tony, linux-kernel, Ewout van Bekkum

On Wed, Jul 09, 2014 at 02:40:55PM -0700, Havard Skinnemoen wrote:
> On Wed, Jul 9, 2014 at 1:36 PM, Luck, Tony <tony.luck@intel.com> wrote:
> > +       if (!xchg(&reboot_notifier_registered, true))
> > +               register_reboot_notifier(&cmci_reboot_notifier);
> >
> > This is super-safe ... but isn't the xchg() overkill? I thought we serialized bringup
> > of other cpus.
> 
> Could be. There are spinlocks being taken elsewhere in that path
> though, and I'm not sure if I see any downside with the xchg(), unless
> this path is actually performance-critical.

Several things:

We don't need all that atomicity special fun if we register the reboot
notifier on the BSP, say from mcheck_init() which is done even pre-SMP.

If that's too early, we can add an initcall or whatever...

Also, since this is kexec-only, this all should be nicely enabled only
on KEXEC kernels. Please use the ifdeffery sparingly :-)

I'm saying that because I'm assuming BIOS will clear those MSRs upon
warm reset. If it doesn't, then we have a bigger problem.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot.
  2014-07-10 16:24       ` Borislav Petkov
@ 2014-07-10 16:33         ` Tony Luck
  2014-07-10 17:56         ` Havard Skinnemoen
  1 sibling, 0 replies; 61+ messages in thread
From: Tony Luck @ 2014-07-10 16:33 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Havard Skinnemoen, linux-kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 9:24 AM, Borislav Petkov <bp@alien8.de> wrote:
> I'm saying that because I'm assuming BIOS will clear those MSRs upon
> warm reset. If it doesn't, then we have a bigger problem.

We  don't have to trust BIOS to do that. SDM section 15.3.2.5
"IA32_MCi_CTL2 MSRs" says:

After processor reset, IA32_MCi_CTL2 MSRs are zero’ed.

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-09 17:09 ` [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports Havard Skinnemoen
  2014-07-09 20:35   ` Andi Kleen
  2014-07-09 20:47   ` Luck, Tony
@ 2014-07-10 16:41   ` Borislav Petkov
  2014-07-10 18:03     ` Havard Skinnemoen
  2 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-10 16:41 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, linux-kernel, Ewout van Bekkum

On Wed, Jul 09, 2014 at 10:09:24AM -0700, Havard Skinnemoen wrote:
> From: Ewout van Bekkum <ewout@google.com>
> 
> machine_check_poll() was modified to use spin_lock_irqsave independently
> per bank when a valid MCE is found to prevent duplicated MCE reports by
> the CMCI and polling methods. In the common case no MCE will be found,
> so the lock is not acquired until a valid MCE is found. The status is
> reread after the lock is acquired in case the MCE was already handled by
> a different thread. A unique spinlock is used per bank number, so
> contention should be mostly limited to non-shared banks.
> 
> Signed-off-by: Ewout van Bekkum <ewout@google.com>
> Signed-off-by: Havard Skinnemoen <hskinnemoen@google.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce-internal.h |  1 +
>  arch/x86/kernel/cpu/mcheck/mce.c          | 21 ++++++++++++++++++++-
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index 2f0b1e8..aa6843a 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -19,6 +19,7 @@ struct mce_bank {
>  	unsigned char init;				/* initialise bank? */
>  	struct device_attribute attr;			/* device attribute */
>  	char			attrname[ATTR_LEN];	/* attribute name */
> +	spinlock_t poll_spinlock;			/* lock for polling */
>  };
>  
>  int mce_severity(struct mce *a, int tolerant, char **msg);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 1ebdd34..64270d7 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -41,6 +41,8 @@
>  #include <linux/debugfs.h>
>  #include <linux/irq_work.h>
>  #include <linux/export.h>
> +#include <linux/mutex.h>
> +#include <linux/spinlock.h>
>  
>  #include <asm/processor.h>
>  #include <asm/mce.h>
> @@ -596,6 +598,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
>  {
>  	struct mce m;
>  	int i;
> +	unsigned long irq_flags;
>  
>  	this_cpu_inc(mce_poll_count);
>  
> @@ -617,14 +620,28 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
>  
>  		this_cpu_write(mce_polled_error, 1);
>  		/*
> +		 * Optimize for the common case where no MCEs are found.
> +		 */
> +		spin_lock_irqsave(&mce_banks[i].poll_spinlock, irq_flags);

This is pretty heavy - we're disabling interrupts for *every* bank and
with shorter polling intervals, this could become problematic fast.

What's wrong with doing this with cheap atomic_inc/dec_and_test?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks.
  2014-07-09 23:27       ` Luck, Tony
@ 2014-07-10 16:49         ` Borislav Petkov
  0 siblings, 0 replies; 61+ messages in thread
From: Borislav Petkov @ 2014-07-10 16:49 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Havard Skinnemoen, linux-kernel, Ewout van Bekkum

On Wed, Jul 09, 2014 at 11:27:07PM +0000, Luck, Tony wrote:
> > to run with tolerant=3, but I kind of understood the logic to be that
> > if we're going to keep running, we need to clear the banks, and if
> > we're going to crash, we need to leave them intact 
> 
> That makes sense ... fold some text like that into the commit
> description, and this part is:
> 
> Acked-by: Tony Luck <tony.luck@intel.com>

Yep, makes sense.

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-10  9:01       ` Chen, Gong
@ 2014-07-10 17:16         ` Havard Skinnemoen
  2014-07-11  2:12           ` Chen, Gong
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-10 17:16 UTC (permalink / raw)
  To: Chen, Gong; +Cc: Borislav Petkov, Tony Luck, Linux Kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 2:01 AM, Chen, Gong <gong.chen@linux.intel.com> wrote:
> On Wed, Jul 09, 2014 at 02:24:31PM -0700, Havard Skinnemoen wrote:
>> On Wed, Jul 9, 2014 at 12:17 PM, Borislav Petkov <bp@alien8.de> wrote:
>> > Why min 3 polls? How do you come up with exactly that frequency?
>>
>> The idea is that if we make it equal to check_interval, it might
>> bounce back and forth a lot. So we need to divide by something, and 8
>> seems like a nice, safe value, and it seems to work well. We're not
>> opposed to considering other values, of course (e.g. 2 and 4 might
>> work well too, but with somewhat higher risk of ping-ponging).
> That value looks chosen a little bit at will. How about updating
> CMCI_POLL_INTERVAL when check_interval is changed to ensure
> CMCI_POLL_INTERVAL <= check_interval always.

I guess that would work equally well, but we still need to determine
the magic number of how much less we want CMCI_POLL_INTERVAL to be.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-10 11:42       ` Borislav Petkov
@ 2014-07-10 17:51         ` Havard Skinnemoen
  2014-07-10 18:55           ` Tony Luck
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-10 17:51 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

On Thu, Jul 10, 2014 at 4:42 AM, Borislav Petkov <bp@alien8.de> wrote:
> + linux-edac.
>
> On Wed, Jul 09, 2014 at 02:24:31PM -0700, Havard Skinnemoen wrote:
>> > > The CMCI poll interval was updated to pick the minimum interval between
>> > > the original 30 seconds and the check_interval divided by 8 (minimum of
>> > > 3 polls).
>> >
>> > Why min 3 polls? How do you come up with exactly that frequency?
>>
>> The idea is that if we make it equal to check_interval, it might
>> bounce back and forth a lot. So we need to divide by something, and 8
>> seems like a nice, safe value, and it seems to work well. We're not
>> opposed to considering other values, of course (e.g. 2 and 4 might
>> work well too, but with somewhat higher risk of ping-ponging).
>
> Yep, this is exactly why I'm asking about your use case. Because if we
> set it to any number, someone down the line will appear and say that
> this doesn't suit her/his needs.

Note that if check_interval is left at its default setting, this patch
doesn't change anything. But I admit it might change things
unnecessarily for values of 1-4 minutes.

> So, I'm thinking more in the direction of controlling it settings, maybe
> even restricting check_interval and the CMCI poll interval, relative to
> each other maybe, but still configurable with the max flexibility.
>
> For that we'll need to answer questions like
>
> * Which min value is sane?

What's the typical interrupt rate during a storm? We should make it
significantly less frequent than that, otherwise there's no point
switching to polling.

IIRC we've seen at least several hundred CMCIs per second, so perhaps
100 ms would be a reasonable minimum? Or perhaps 10 ms, which is the
current minimum polling interval enforced by mce_timer_fn.

> * Do check_interval and CMCI poll interval need to be related at all?

CMCI poll interval can't be higher than check_interval, or the storm
handling will break. I don't recall if making them equal causes things
to break or if it's just suboptimal. Apart from that, we probably
shouldn't enforce any relationship.

If we export both knobs, we could make it the user's responsibility to
keep a sensible relationship between them, but it would require some
documentation work, I think.

> * Which max value makes sense?

I think it only needs to be limited by check_interval. Though I don't
think we'll hurt anyone by reducing the current 30-second cap --
you're still getting substantial savings vs a CMCI storm even if you
turn it down to a second or even less.

> * What about check_interval, do we want to touch that too?

If we're going to enforce a relationship between that and
CMCI_POLL_INTERVAL, I guess we'll have to?

> ...
>
> Just throwing out a bunch of questions, off the top of my head, to get
> some opinions/rants, etc.
>
>> I'm not entirely sure. At some point, it ended up that way, and it
>> broke in non-obvious ways, so we wanted to fix it.
>
> Right, so if we restrict it, the fix is even simpler. Unless you have a
> more valid use than "[a]t some point, it ended up that way... " :-)

Well, we've tried turning it all the way down to 5 seconds and nothing
broke except for the CMCI storm handling. We could probably turn it
down even further before it becomes prohibitively expensive. I don't
think the implementation-specific interaction with a hard-coded
30-second CMCI poll interval is a good reason to restrict it, though
I'll admit it would make things simpler.

A short polling interval is useful for detecting problems early, and
to see quicker results when experimenting.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot.
  2014-07-10 16:24       ` Borislav Petkov
  2014-07-10 16:33         ` Tony Luck
@ 2014-07-10 17:56         ` Havard Skinnemoen
  2014-07-10 18:27           ` Tony Luck
  2014-07-10 18:30           ` Borislav Petkov
  1 sibling, 2 replies; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-10 17:56 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, linux-kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 9:24 AM, Borislav Petkov <bp@alien8.de> wrote:
> We don't need all that atomicity special fun if we register the reboot
> notifier on the BSP, say from mcheck_init() which is done even pre-SMP.
>
> If that's too early, we can add an initcall or whatever...

OK, will see if I can find a better place for it.

> Also, since this is kexec-only, this all should be nicely enabled only
> on KEXEC kernels. Please use the ifdeffery sparingly :-)
>
> I'm saying that because I'm assuming BIOS will clear those MSRs upon
> warm reset. If it doesn't, then we have a bigger problem.

Even on kexec-enabled kernels, we might be doing a regular BIOS boot.
Is there any way to detect that we're doing a kexec boot and only
clearing the MSRs in that case?

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-10 16:41   ` Borislav Petkov
@ 2014-07-10 18:03     ` Havard Skinnemoen
  2014-07-10 18:44       ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-10 18:03 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 9:41 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Jul 09, 2014 at 10:09:24AM -0700, Havard Skinnemoen wrote:
>> @@ -617,14 +620,28 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
>>
>>               this_cpu_write(mce_polled_error, 1);
>>               /*
>> +              * Optimize for the common case where no MCEs are found.
>> +              */
>> +             spin_lock_irqsave(&mce_banks[i].poll_spinlock, irq_flags);
>
> This is pretty heavy - we're disabling interrupts for *every* bank and
> with shorter polling intervals, this could become problematic fast.

The lock is only taken if there are actual MCEs to be handled. The
following check is left in place above this:

        m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
        if (!(m.status & MCI_STATUS_VAL))
                continue;

But yeah, if there are lots of errors happening, it might get expensive.

> What's wrong with doing this with cheap atomic_inc/dec_and_test?

For non-shared banks, we might risk some CPUs not being able to poll
their banks in a long time if they happen to be more or less
synchronized with a different CPU. This will also get worse with
shorter polling intervals, and with larger numbers of CPUs.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot.
  2014-07-10 17:56         ` Havard Skinnemoen
@ 2014-07-10 18:27           ` Tony Luck
  2014-07-10 18:30           ` Borislav Petkov
  1 sibling, 0 replies; 61+ messages in thread
From: Tony Luck @ 2014-07-10 18:27 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Borislav Petkov, linux-kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 10:56 AM, Havard Skinnemoen
<hskinnemoen@google.com> wrote:
> Even on kexec-enabled kernels, we might be doing a regular BIOS boot.
> Is there any way to detect that we're doing a kexec boot and only
> clearing the MSRs in that case?

I'm not sure if we should worry about that ... not having the code bloat
for the functions is kexec is not configured is a good goal. Worrying
about a few extra microseconds in the shutdown path if kexec is configured
but we are actually doing a normal shutdown/reboot doesn't seem as
big of a worry.

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot.
  2014-07-10 17:56         ` Havard Skinnemoen
  2014-07-10 18:27           ` Tony Luck
@ 2014-07-10 18:30           ` Borislav Petkov
  1 sibling, 0 replies; 61+ messages in thread
From: Borislav Petkov @ 2014-07-10 18:30 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Luck, Tony, linux-kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 10:56:05AM -0700, Havard Skinnemoen wrote:
> Even on kexec-enabled kernels, we might be doing a regular BIOS boot.
> Is there any way to detect that we're doing a kexec boot and only
> clearing the MSRs in that case?

What Tony said: if the processor resets, those MSRs are cleared, forget
BIOS.

I dunno though whether the processor gets reset during kexec though. It
needs code staring...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks.
  2014-07-10 15:51       ` Borislav Petkov
@ 2014-07-10 18:32         ` Havard Skinnemoen
  0 siblings, 0 replies; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-10 18:32 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Luck, Tony, linux-kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 8:51 AM, Borislav Petkov <bp@alien8.de> wrote:
>
> On Wed, Jul 09, 2014 at 02:34:39PM -0700, Havard Skinnemoen wrote:
> > On Wed, Jul 9, 2014 at 1:20 PM, Luck, Tony <tony.luck@intel.com> wrote:
> > >> The CMCI storm handler previously called cmci_reenable() when exiting a
> > >> CMCI storm. However, when entering a CMCI storm the bank ownership was
> > >> not relinquished by the affected CPUs. The CMCIs were only disabled via
> > >> cmci_storm_disable_banks(). The handler was updated to instead call a
> > >> new function, cmci_storm_enable_banks(), to reenable CMCI on the already
> > >> owned banks instead of rediscovering CMCI banks (which were still owned
> > >> but disabled).
> > >
> > > Won't this cause problems if we online a cpu during the storm. We will
> > > re-run the discovery algorithm and some other cpu that shares the bank
> > > will see MCi_CTL2{30} is zero and claim ownership.
> >
> > Yes, I think you're right. We didn't test this with CPU hotplugging.
> >
> > I'm at loss about how to fix it though. We need the CMCI bits to
> > detect shared banks, but they're not reflecting the actual state of
> > things at that point. If the CPU gives up ownership of the banks, then
> > we might just see the storm move from CPU to CPU, right?
> >
> > We could keep a separate bitmask somewhere to indicate ownership, but
> > even if we can see that the bank is shared with some other CPU, we
> > don't know if it will be shared with a new CPU which we've never seen
> > before...
> >
> > Perhaps we need to temporarily disable the storm handling when we're
> > bringing up a new CPU?
>
> Looking at this more, maybe cmci_storm_disable_banks() was a bad idea
> after all. There's __cmci_disable_bank() which properly drops ownership
> after having disabled CMCI.
>
> Maybe we should kill cmci_storm_disable_banks() and do
> __cmci_disable_bank by iterating over them all...

But that will clear mce_banks_owned as well, so the bank won't get
polled while the storm is active...

If mce_banks_owned isn't cleared, cmci_reenable won't reclaim the
banks after the storm subsides.

Perhaps cmci_storm_enable_banks just needs to relinquish the bank if
CMCI_EN is already set? I don't think this is a perfect solution
though -- if a new CPU comes up while the current owner is in storm
mode, claims the bank, and then enters storm mode itself, they might
both end up claiming the bank, at least for a while.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-10 18:03     ` Havard Skinnemoen
@ 2014-07-10 18:44       ` Borislav Petkov
  2014-07-10 18:57         ` Tony Luck
  0 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-10 18:44 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 11:03:43AM -0700, Havard Skinnemoen wrote:
> For non-shared banks, we might risk some CPUs not being able to
> poll their banks in a long time if they happen to be more or less
> synchronized with a different CPU. This will also get worse with
> shorter polling intervals, and with larger numbers of CPUs.

No, I meant to do something like

	if (atomic_dec_and_test(&mce_banks[i].poll))
		m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));

	atomic_add_unless(&mce_banks[i].poll, 1, 1);

so that you have only one CPU read the status register of mce_banks[i].

For non-shared banks, this will always work because no other CPU will
dec that variable anyway.

Or am I missing something...?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-10 17:51         ` Havard Skinnemoen
@ 2014-07-10 18:55           ` Tony Luck
  2014-07-10 22:45             ` Havard Skinnemoen
  0 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2014-07-10 18:55 UTC (permalink / raw)
  To: Havard Skinnemoen
  Cc: Borislav Petkov, Linux Kernel, Ewout van Bekkum, linux-edac

On Thu, Jul 10, 2014 at 10:51 AM, Havard Skinnemoen
<hskinnemoen@google.com> wrote:
> What's the typical interrupt rate during a storm? We should make it
> significantly less frequent than that, otherwise there's no point
> switching to polling.
>
> IIRC we've seen at least several hundred CMCIs per second, so perhaps
> 100 ms would be a reasonable minimum? Or perhaps 10 ms, which is the
> current minimum polling interval enforced by mce_timer_fn.

I don't think we have a solid point to really declare "storm!".  The
CMCI rates between normal and abnormal rates are vast:

Normal rates are a few CMCI per year (or maybe per month ... if
you have a multi-terabyte machine perhaps even "per day" is normal).

So if you see two CMCI inside the same minute, you could declare
a storm.  Realistically we want the threshold a bit higher.

It then becomes a balance between seeing all the errors (so our PFA
mechanisms get enough data to spot bad pages and take action) and
processing so many interrupts that we begin to take a performamce
hit.

Once we do decide there is a storm - we know we have given up on
seeing all the errors ... the polling rate will only decide how fast we
can determine that the storm has ended.  I don't see a lot of value
in detecting the end at milli-second granularity. But we probably don't
want to give up minutes worth of PFA data if the storm does end.

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-10 18:44       ` Borislav Petkov
@ 2014-07-10 18:57         ` Tony Luck
  2014-07-10 19:12           ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2014-07-10 18:57 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Havard Skinnemoen, Linux Kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 11:44 AM, Borislav Petkov <bp@alien8.de> wrote:
>         if (atomic_dec_and_test(&mce_banks[i].poll))
>                 m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
>
>         atomic_add_unless(&mce_banks[i].poll, 1, 1);
>
> so that you have only one CPU read the status register of mce_banks[i].
>
> For non-shared banks, this will always work because no other CPU will
> dec that variable anyway.

You don't know if the bank is shared or not.  If it is not shared, then the
above code might skip reading the bank because some other cpu is busy
reading that bank number - and seeing its own private bank.

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-10 18:57         ` Tony Luck
@ 2014-07-10 19:12           ` Borislav Petkov
  2014-07-11  9:24             ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-10 19:12 UTC (permalink / raw)
  To: Tony Luck; +Cc: Havard Skinnemoen, Linux Kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 11:57:06AM -0700, Tony Luck wrote:
> You don't know if the bank is shared or not. If it is not shared, then
> the above code might skip reading the bank because some other cpu is
> busy reading that bank number - and seeing its own private bank.

Grrr, I see what you mean.

Hmm, can we serialize machine_check_poll per CPU in a cheaper way?
Disabling interrupts around the reads is certainly fine but still too
heavy-handed IMO.

I'll think about it more tomorrow - my brain is twisted enough for
today. :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-10 18:55           ` Tony Luck
@ 2014-07-10 22:45             ` Havard Skinnemoen
  2014-07-11 15:35               ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-10 22:45 UTC (permalink / raw)
  To: Tony Luck; +Cc: Borislav Petkov, Linux Kernel, Ewout van Bekkum, linux-edac

On Thu, Jul 10, 2014 at 11:55 AM, Tony Luck <tony.luck@gmail.com> wrote:
> On Thu, Jul 10, 2014 at 10:51 AM, Havard Skinnemoen
> <hskinnemoen@google.com> wrote:
>> What's the typical interrupt rate during a storm? We should make it
>> significantly less frequent than that, otherwise there's no point
>> switching to polling.
>>
>> IIRC we've seen at least several hundred CMCIs per second, so perhaps
>> 100 ms would be a reasonable minimum? Or perhaps 10 ms, which is the
>> current minimum polling interval enforced by mce_timer_fn.
>
> I don't think we have a solid point to really declare "storm!".  The
> CMCI rates between normal and abnormal rates are vast:

Right, I'm talking about "typical" abnormal rates, if you understand
what I mean. I probably shouldn't have used the word "typical".

To determine a minimum value, I think we need to consider machines
which are really bad, but not so bad that they cause non-correctable
errors. We use pushbutton DIMMs to simulate this in the lab.

So assuming the worst machines produce a few hundred CMCIs per second,
you're probably not going to see any performance improvement from the
CMCI storm handling if you set the polling interval to less than 10
ms. So that's what the minimum should be, I think. Or perhaps a second
if dealing with sub-second intervals make the userspace interface
ugly.

I'm not arguing that's a _sensible_ value, just that there's no point
in seting it to anything lower than that.

> Normal rates are a few CMCI per year (or maybe per month ... if
> you have a multi-terabyte machine perhaps even "per day" is normal).
>
> So if you see two CMCI inside the same minute, you could declare
> a storm.  Realistically we want the threshold a bit higher.
>
> It then becomes a balance between seeing all the errors (so our PFA
> mechanisms get enough data to spot bad pages and take action) and
> processing so many interrupts that we begin to take a performamce
> hit.
>
> Once we do decide there is a storm - we know we have given up on
> seeing all the errors ... the polling rate will only decide how fast we
> can determine that the storm has ended.  I don't see a lot of value
> in detecting the end at milli-second granularity. But we probably don't
> want to give up minutes worth of PFA data if the storm does end.

Right, and since we're talking about a balance, it may be best to give
the user as much room as possible to configure the rate according to
their system. I think the current defaults are sensible, but they're
not optimal for all machines.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-10 17:16         ` Havard Skinnemoen
@ 2014-07-11  2:12           ` Chen, Gong
  0 siblings, 0 replies; 61+ messages in thread
From: Chen, Gong @ 2014-07-11  2:12 UTC (permalink / raw)
  To: Havard Skinnemoen
  Cc: Borislav Petkov, Tony Luck, Linux Kernel, Ewout van Bekkum

[-- Attachment #1: Type: text/plain, Size: 1594 bytes --]

On Thu, Jul 10, 2014 at 10:16:56AM -0700, Havard Skinnemoen wrote:
> Date: Thu, 10 Jul 2014 10:16:56 -0700
> From: Havard Skinnemoen <hskinnemoen@google.com>
> To: "Chen, Gong" <gong.chen@linux.intel.com>
> Cc: Borislav Petkov <bp@alien8.de>, Tony Luck <tony.luck@intel.com>, Linux
>  Kernel <linux-kernel@vger.kernel.org>, Ewout van Bekkum <ewout@google.com>
> Subject: Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for
>  small check_interval values.
> 
> On Thu, Jul 10, 2014 at 2:01 AM, Chen, Gong <gong.chen@linux.intel.com> wrote:
> > On Wed, Jul 09, 2014 at 02:24:31PM -0700, Havard Skinnemoen wrote:
> >> On Wed, Jul 9, 2014 at 12:17 PM, Borislav Petkov <bp@alien8.de> wrote:
> >> > Why min 3 polls? How do you come up with exactly that frequency?
> >>
> >> The idea is that if we make it equal to check_interval, it might
> >> bounce back and forth a lot. So we need to divide by something, and 8
> >> seems like a nice, safe value, and it seems to work well. We're not
> >> opposed to considering other values, of course (e.g. 2 and 4 might
> >> work well too, but with somewhat higher risk of ping-ponging).
> > That value looks chosen a little bit at will. How about updating
> > CMCI_POLL_INTERVAL when check_interval is changed to ensure
> > CMCI_POLL_INTERVAL <= check_interval always.
> 
> I guess that would work equally well, but we still need to determine
> the magic number of how much less we want CMCI_POLL_INTERVAL to be.
> 
> Havard
I mean you can change CMCI_POLL_INTERVAL from a macro to a variable and
then do prove check.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-10 19:12           ` Borislav Petkov
@ 2014-07-11  9:24             ` Borislav Petkov
  2014-07-11 19:06               ` Tony Luck
  0 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-11  9:24 UTC (permalink / raw)
  To: Tony Luck; +Cc: Havard Skinnemoen, Linux Kernel, Ewout van Bekkum

On Thu, Jul 10, 2014 at 09:12:24PM +0200, Borislav Petkov wrote:
> I'll think about it more tomorrow - my brain is twisted enough for
> today. :-)

Ok, new day, new luck. :-)

So, following yesterday's discussion, our problem is IMHO that shared
banks could be read multiple times before they're finally cleared,
leading to repeated MCE records.

Now, staring at machine_check_poll, the processing is controlled by one
bit - MCI_STATUS_VAL - which decides what happens next.

So how about we change processing around this one bit: we let only one
reader access MSR_IA32_MCx_STATUS(i) and clear it right afterwards by
saving its contents to m.status previously.

Concurrent callers of machine_check_poll will not read the MCI_STATUS
MSR and since they look at the local copy m.status which is 0, they'll
go to the next bank.

And this for the cost of a locked CMPXCHG when we have to inc
poll_reader which should be cheaper than disabling IRQs everytime.

I.e., something like that. Hmm...

---
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 09edd0b65fef..5483b507025a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -19,6 +19,7 @@ struct mce_bank {
 	unsigned char init;				/* initialise bank? */
 	struct device_attribute attr;			/* device attribute */
 	char			attrname[ATTR_LEN];	/* attribute name */
+	atomic_t		poll_reader;		/* sync for polled shared banks */
 };
 
 int mce_severity(struct mce *a, int tolerant, char **msg);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bb92f38153b2..443861da86e4 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -609,9 +609,20 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		m.addr = 0;
 		m.bank = i;
 		m.tsc = 0;
+		m.status = 0;
 
 		barrier();
-		m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+
+		if (atomic_add_unless(&mce_banks[i].poll_reader, 1, 1)) {
+			m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
+
+			if (m.status & MCI_STATUS_VAL)
+				/* clear status register for this bank */
+				mce_wrmsrl(MSR_IA32_MCx_STATUS(i), 0);
+
+			atomic_dec(&mce_banks[i].poll_reader);
+		}
+
 		if (!(m.status & MCI_STATUS_VAL))
 			continue;
 
@@ -637,17 +648,12 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)
 			mce_log(&m);
 
-		/*
-		 * Clear state for this bank.
-		 */
-		mce_wrmsrl(MSR_IA32_MCx_STATUS(i), 0);
 	}
 
 	/*
 	 * Don't clear MCG_STATUS here because it's only defined for
 	 * exceptions.
 	 */
-
 	sync_core();
 }
 EXPORT_SYMBOL_GPL(machine_check_poll);



-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-10 22:45             ` Havard Skinnemoen
@ 2014-07-11 15:35               ` Borislav Petkov
  2014-07-11 18:56                 ` Havard Skinnemoen
  0 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-11 15:35 UTC (permalink / raw)
  To: Havard Skinnemoen, Tony Luck; +Cc: Linux Kernel, Ewout van Bekkum, linux-edac

On Thu, Jul 10, 2014 at 03:45:22PM -0700, Havard Skinnemoen wrote:
> I'm not arguing that's a _sensible_ value, just that there's no point
> in seting it to anything lower than that.

Ok,

right now, during the CMCI interrupt, we increment the count of how
many times we fire. If during one CMCI_STORM_INTERVAL we fire more than
CMCI_STORM_THRESHOLD times, we declare storm.

And this is count-based and does not necessarily mean that with more
than CMCI_STORM_THRESHOLD CMCIs, we can't continue using CMCI instead of
switching to polling.

An IRQ->POLL switch, however, is normally done because the interrupt
fires too often and with an overhead where we just as well can simply
poll.

So how about we change the whole scheme a bit, maybe even simplify it in
the process:

So, with roughly few hundred CMCIs per second, we can be generous and
say we can handle 100 CMCIs per second just fine. Which would mean, if
the CMCI handler takes 10ms, with 100 CMCIs per second, we spend the
whole time handling CMCIs. And we don't want that so we better poll.
Those numbers are which tell us whether we should poll or not.

But since we're very cautious, we go an order of magnitude up and say,
if we get a second CMCI in under 100ms, we switch to polling. Or as Tony
says, we switch to polling if we see a second CMCI in the same minute.
Let's put the exact way of determining that aside for now.

Then, we start polling. We poll every min interval, say 10ms for, say,
a second. We do this relatively long so that we save us unnecessary
ping-ponging between CMCI and poll.

If during that second we have seen errors, we extend the polling
interval by another second. And so on...

After a second where we haven't seen any errors, we switch back to CMCI.
check_interval relaxes back to 5 min and all gets to its normal boring
existence. Otherwise, we enter storm mode quickly again.

This way we change the heuristic when we switch to storm mode from based
on the number of CMCIs per interval to closeness of occurrence of CMCIs.
They're similar but the second method will get us in storm mode pretty
quickly and get us polling.

The more important follow up from this is that if we can decide upon

* duration of CMCI, i.e. the 10ms above

* max number of CMCIs per second a system can sustain fine, i.e. the 100
above

* total polling duration during storm, i.e. the 1 second above

and if those are chosen generously for every system out there, then we
don't need to dynamically adjust the polling interval.

Basically the scheme becomes the following:

* We switch to polling if we detect a second CMCI under an interval X
* We poll Y times, each polling with a duration Z.
* If during those Y*Z msec of polling, we've encountered errors, we
enlarge the polling interval to additional Y*Z msec.


check_interval will be capped on the low end to something bigger than
the polling duration Y*Z and only the storm detection code will be
allowed to go to lower intervals and switch to polling.

At least something like that. In general, I'd like to make it more
robust for every system without the need for user interaction, i.e.
adjusting check_interval and where it just works.

I don't know whether any of the above makes sense - I hope that the
gist of it at least shows what IO think we should be doing: instead
of letting users configure the check_interval and influence the CMCI
polling interval, we should rely purely on machine characteristics to
set minimum values under which we poll and above which, we do the normal
duration enlarging dance.

So, flame away... :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-11 15:35               ` Borislav Petkov
@ 2014-07-11 18:56                 ` Havard Skinnemoen
  2014-07-11 20:10                   ` Borislav Petkov
                                     ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-11 18:56 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

On Fri, Jul 11, 2014 at 8:35 AM, Borislav Petkov <bp@alien8.de> wrote:
> So, with roughly few hundred CMCIs per second, we can be generous and
> say we can handle 100 CMCIs per second just fine. Which would mean, if
> the CMCI handler takes 10ms, with 100 CMCIs per second, we spend the
> whole time handling CMCIs. And we don't want that so we better poll.
> Those numbers are which tell us whether we should poll or not.
>
> But since we're very cautious, we go an order of magnitude up and say,
> if we get a second CMCI in under 100ms, we switch to polling. Or as Tony
> says, we switch to polling if we see a second CMCI in the same minute.
> Let's put the exact way of determining that aside for now.

So a short burst of CMCIs would send us instantly into polling mode,
which would probably be suboptimal if things are quiet after that.
Counting is a lot more robust against this.

> Then, we start polling. We poll every min interval, say 10ms for, say,
> a second. We do this relatively long so that we save us unnecessary
> ping-ponging between CMCI and poll.
>
> If during that second we have seen errors, we extend the polling
> interval by another second. And so on...

If we see two errors every 2 seconds (for example due to a bug causing
us to see duplicate MCEs), we'd ping-pong back and forth between CMCI
and polling mode on every error, polling 51 times per second on
average. This seems a lot more expensive than just staying in CMCI
mode. And we risk losing information if there are instead, say, 4
errors every 2 seconds.

> After a second where we haven't seen any errors, we switch back to CMCI.
> check_interval relaxes back to 5 min and all gets to its normal boring
> existence. Otherwise, we enter storm mode quickly again.

Since the storm detection is now independent of check_interval, we
don't need to place any restrictions on it right?

> This way we change the heuristic when we switch to storm mode from based
> on the number of CMCIs per interval to closeness of occurrence of CMCIs.
> They're similar but the second method will get us in storm mode pretty
> quickly and get us polling.
>
> The more important follow up from this is that if we can decide upon
>
> * duration of CMCI, i.e. the 10ms above
>
> * max number of CMCIs per second a system can sustain fine, i.e. the 100
> above

What's the definition of "fine"? 1% performance hit? 10%? How can we
make that decision without knowing how hard the users are pushing
their systems?

> * total polling duration during storm, i.e. the 1 second above
>
> and if those are chosen generously for every system out there, then we
> don't need to dynamically adjust the polling interval.

I'm not sure how we can be generous when there's a tradeoff involved.
If we make the interval "generously low", we end up hurting
performance. if we make it "generously high", we'll lose information.

> Basically the scheme becomes the following:
>
> * We switch to polling if we detect a second CMCI under an interval X
> * We poll Y times, each polling with a duration Z.
> * If during those Y*Z msec of polling, we've encountered errors, we
> enlarge the polling interval to additional Y*Z msec.
>
>
> check_interval will be capped on the low end to something bigger than
> the polling duration Y*Z and only the storm detection code will be
> allowed to go to lower intervals and switch to polling.
>
> At least something like that. In general, I'd like to make it more
> robust for every system without the need for user interaction, i.e.
> adjusting check_interval and where it just works.

But at the same time, this scheme introduces even more variables that
need careful tuning, e.g. storm polling interval and storm duration,
while not really doing anything to make check_interval superfluous. Do
you really think we can tune these variables correctly for every
system out there?

Or if we want to be generous: How about we just hardcode
check_interval to 5 seconds. Would that be fine with everyone?

> I don't know whether any of the above makes sense - I hope that the
> gist of it at least shows what IO think we should be doing: instead
> of letting users configure the check_interval and influence the CMCI
> polling interval, we should rely purely on machine characteristics to
> set minimum values under which we poll and above which, we do the normal
> duration enlarging dance.

I think the scheme may work, although I'm worried about the burstiness
mentioned above.

But I don't really buy that pulling a handful of numbers out of thin
air and saying it should work for everyone is going to work. Either we
need solid data to back up those numbers, or we need to make them
configurable so people can experiment and find what works best for
them.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-11  9:24             ` Borislav Petkov
@ 2014-07-11 19:06               ` Tony Luck
  2014-07-11 19:52                 ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2014-07-11 19:06 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Havard Skinnemoen, Linux Kernel, Ewout van Bekkum

> +               if (atomic_add_unless(&mce_banks[i].poll_reader, 1, 1)) {
> +                       m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));

Same as yesterday. You may skip reading a bank because someone else
is reading the same bank number, even though you don't share that bank
with them.

If we are willing to be rather flexible amount when polling happens,
and not allow very fast poll rates. Then we could do something like
have the lowest numbered online cpu be the only one that sets a
timer. When it goes off, it scans its own banks, and then uses an
async cross-processor call to poke the next highest numbered
online cpu to have it scan banks and poke the next guy.

That way we know that two cpus can't be polling at the same time,
because we convoy them all one at a time.

Fast poll rates would be a problem on very large systems. Might
need to have the highest numbered cpu notice that it is at the
end of the chain and set some flag so the lowest one can tell
whether it is safe to begin the next ripple.

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-11 19:06               ` Tony Luck
@ 2014-07-11 19:52                 ` Borislav Petkov
  2014-07-11 21:15                   ` Havard Skinnemoen
  0 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-11 19:52 UTC (permalink / raw)
  To: Tony Luck; +Cc: Havard Skinnemoen, Linux Kernel, Ewout van Bekkum

On Fri, Jul 11, 2014 at 12:06:40PM -0700, Tony Luck wrote:
> > +               if (atomic_add_unless(&mce_banks[i].poll_reader, 1, 1)) {
> > +                       m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
> 
> Same as yesterday. You may skip reading a bank because someone else
> is reading the same bank number, even though you don't share that bank
> with them.

Not if those banks are in a percpu variable. And this is what
machine_check_poll gets. The ->poll_reader thing is then per-cpu too.

For shared banks it should work also as expected since we want there
only one reader to see the MCE signature.

> If we are willing to be rather flexible amount when polling happens,
> and not allow very fast poll rates. Then we could do something like
> have the lowest numbered online cpu be the only one that sets a
> timer. When it goes off, it scans its own banks, and then uses an
> async cross-processor call to poke the next highest numbered
> online cpu to have it scan banks and poke the next guy.
> 
> That way we know that two cpus can't be polling at the same time,
> because we convoy them all one at a time.

See above - those banks are percpu. And besides, mce_timer_fn already
has the WARN_ON which otherwise be firing left and right.

It seems, Havard's issue is only with shared banks. I think they only
cause the repeated error records.

> Fast poll rates would be a problem on very large systems. Might
> need to have the highest numbered cpu notice that it is at the
> end of the chain and set some flag so the lowest one can tell
> whether it is safe to begin the next ripple.

Well, if fast polling rates will be a problem anyway, we probably should
talk about adjusting the polling alg. too.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-11 18:56                 ` Havard Skinnemoen
@ 2014-07-11 20:10                   ` Borislav Petkov
  2014-07-11 20:39                     ` Havard Skinnemoen
  2014-07-11 20:22                   ` Borislav Petkov
  2014-07-11 20:36                   ` Borislav Petkov
  2 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-11 20:10 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

I'm going to reply with multiple mails so that we can keep the things
separate and not let replies grow out of proportion.

On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
> So a short burst of CMCIs would send us instantly into polling mode,
> which would probably be suboptimal if things are quiet after that.
> Counting is a lot more robust against this.

Yes, but CMCI_STORM_THRESHOLD is arbitrary too. How is getting 15 CMCIs
per second an interrupt storm? Apparently boxes can handle couple of
hundred CMCIs per second just fine...

> If we see two errors every 2 seconds (for example due to a bug causing
> us to see duplicate MCEs), we'd ping-pong back and forth between CMCI
> and polling mode on every error, polling 51 times per second on
> average. This seems a lot more expensive than just staying in CMCI
> mode. And we risk losing information if there are instead, say, 4
> errors every 2 seconds.
> 
> > After a second where we haven't seen any errors, we switch back to CMCI.
> > check_interval relaxes back to 5 min and all gets to its normal boring
> > existence. Otherwise, we enter storm mode quickly again.
> 
> Since the storm detection is now independent of check_interval, we
> don't need to place any restrictions on it right?

Ok, so my initial storm detection was dumb, ok. Counting the way we do
it now is purely sucked out of thin air too.

Instead, the criteria should probably be something like: what is the
number of CMCIs per second which we can process while leaving system
operation relatively unaffected? Anything above that number constitutes
a CMCI storm.

Now, how we'll come up with an answer to that question is a whole
another story...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-11 18:56                 ` Havard Skinnemoen
  2014-07-11 20:10                   ` Borislav Petkov
@ 2014-07-11 20:22                   ` Borislav Petkov
  2014-07-12  0:10                     ` Havard Skinnemoen
  2014-07-11 20:36                   ` Borislav Petkov
  2 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-11 20:22 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
> > * max number of CMCIs per second a system can sustain fine, i.e. the 100
> > above
> 
> What's the definition of "fine"? 1% performance hit? 10%? How can we
> make that decision without knowing how hard the users are pushing
> their systems?

Those are definitely unchartered territories we're moving into so yes,
answering that won't be easy.

Let's try it: if the anount of time we spend per second in the CMCI
handler becomes higher than the amount of time we spend polling for that
same second, then we might just as well poll and save us the interrupt
load.

So, for example, say for 10ms poll rate and single poll duration of
2ms, if time spent in CMCI exceeds 200ms for that second, we switch to
polling. Hypothetical numbers, of course.

Can we measure it on every system? Probably not. So we need to
approximate it somehow.

> 
> > * total polling duration during storm, i.e. the 1 second above
> >
> > and if those are chosen generously for every system out there, then we
> > don't need to dynamically adjust the polling interval.
> 
> I'm not sure how we can be generous when there's a tradeoff involved.
> If we make the interval "generously low", we end up hurting
> performance. if we make it "generously high", we'll lose information.

Yeah, by "generous" I meant, choose values which fit all. But I realize
now that this is a dumb idea. Maybe we could measure it on each system,
read the TSC on CMCI entry and exit and thus get an average CMCI
duration...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-11 18:56                 ` Havard Skinnemoen
  2014-07-11 20:10                   ` Borislav Petkov
  2014-07-11 20:22                   ` Borislav Petkov
@ 2014-07-11 20:36                   ` Borislav Petkov
  2014-07-11 21:05                     ` Havard Skinnemoen
  2 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-11 20:36 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
> > Basically the scheme becomes the following:
> >
> > * We switch to polling if we detect a second CMCI under an interval X
> > * We poll Y times, each polling with a duration Z.
> > * If during those Y*Z msec of polling, we've encountered errors, we
> > enlarge the polling interval to additional Y*Z msec.
> >
> >
> > check_interval will be capped on the low end to something bigger than
> > the polling duration Y*Z and only the storm detection code will be
> > allowed to go to lower intervals and switch to polling.
> >
> > At least something like that. In general, I'd like to make it more
> > robust for every system without the need for user interaction, i.e.
> > adjusting check_interval and where it just works.
> 
> But at the same time, this scheme introduces even more variables that
> need careful tuning, e.g. storm polling interval and storm duration,
> while not really doing anything to make check_interval superfluous. Do

Oh, we can't make check_interval superfluous - it is API to userspace
for a long time now.

> you really think we can tune these variables correctly for every
> system out there?

Right, I was trying to figure out a scheme first where polling intervals
and thresholds would actually make sense and not be arbitrary.

We probably won't be able to have the exact values for each system but a
smart approximation could do the job nicely enough.

> Or if we want to be generous: How about we just hardcode
> check_interval to 5 seconds. Would that be fine with everyone?

We could but again, it is an API to userspace exported through sysfs.

Besides, on a healthy system, you see errors so seldomly that 5sec is
pure waste of energy.

> > I don't know whether any of the above makes sense - I hope that the
> > gist of it at least shows what IO think we should be doing: instead
> > of letting users configure the check_interval and influence the CMCI
> > polling interval, we should rely purely on machine characteristics to
> > set minimum values under which we poll and above which, we do the normal
> > duration enlarging dance.
> 
> I think the scheme may work, although I'm worried about the burstiness
> mentioned above.
>
> But I don't really buy that pulling a handful of numbers out of thin
> air and saying it should work for everyone is going to work.

No no, absolutely not. This is exactly what I think should be fixed as
the current numbers are likely pulled out of thin air. Simply because
figuring the optimal ones is a very hard task, as we come to realize.

> Either we need solid data to back up those numbers, or we need to make
> them configurable so people can experiment and find what works best
> for them.

..., or, we could measure them on each system and approximate them to
the ones close to optimal for that particular system, over the course of
its runtime.

Thanks for taking the time and humouring me with that crazy
brainstorming!

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-11 20:10                   ` Borislav Petkov
@ 2014-07-11 20:39                     ` Havard Skinnemoen
  2014-07-14 14:57                       ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-11 20:39 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

On Fri, Jul 11, 2014 at 1:10 PM, Borislav Petkov <bp@alien8.de> wrote:
> I'm going to reply with multiple mails so that we can keep the things
> separate and not let replies grow out of proportion.
>
> On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
>> So a short burst of CMCIs would send us instantly into polling mode,
>> which would probably be suboptimal if things are quiet after that.
>> Counting is a lot more robust against this.
>
> Yes, but CMCI_STORM_THRESHOLD is arbitrary too. How is getting 15 CMCIs
> per second an interrupt storm? Apparently boxes can handle couple of
> hundred CMCIs per second just fine...

Sorry, I was being unclear. I was actually arguing the opposite:
Getting 15 CMCIs per second is fine and shouldn't cause any switch to
polling mode, especially if the polling will happen at 100 times per
second. But your proposal would switch to polling if we ever see 2
CMCIs within a period, which seems way too trigger-happy, even if the
period is short.

I do agree there are already a lot of arbitrary numbers in the code.

>> If we see two errors every 2 seconds (for example due to a bug causing
>> us to see duplicate MCEs), we'd ping-pong back and forth between CMCI
>> and polling mode on every error, polling 51 times per second on
>> average. This seems a lot more expensive than just staying in CMCI
>> mode. And we risk losing information if there are instead, say, 4
>> errors every 2 seconds.
>>
>> > After a second where we haven't seen any errors, we switch back to CMCI.
>> > check_interval relaxes back to 5 min and all gets to its normal boring
>> > existence. Otherwise, we enter storm mode quickly again.
>>
>> Since the storm detection is now independent of check_interval, we
>> don't need to place any restrictions on it right?
>
> Ok, so my initial storm detection was dumb, ok. Counting the way we do
> it now is purely sucked out of thin air too.
>
> Instead, the criteria should probably be something like: what is the
> number of CMCIs per second which we can process while leaving system
> operation relatively unaffected? Anything above that number constitutes
> a CMCI storm.

That sounds good to me. But now you're talking about CMCIs per second,
which seems to imply some form of counting right?

> Now, how we'll come up with an answer to that question is a whole
> another story...

Right. If we can come up with an answer, that's great, but if we
don't, I think we're better off exporting a nice knob and letting the
user tune his system according to his needs.

Just to throw another number out, how about doing CMCI storm polling
at a fixed interval of 100 ms? Since check_interval is an integer
representing a number of seconds, it can never get lower than 10x this
number, so we won't need to restrict it any further.

If we see more than X CMCIs in a second, we switch to polling. If less
than Y out of 10 polls see an error, we switch back to CMCI.

Now, we still leave 3 magic numbers to be figured out...but I think
their range is somewhat more limited.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-11 20:36                   ` Borislav Petkov
@ 2014-07-11 21:05                     ` Havard Skinnemoen
  0 siblings, 0 replies; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-11 21:05 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

On Fri, Jul 11, 2014 at 1:36 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
>> > Basically the scheme becomes the following:
>> >
>> > * We switch to polling if we detect a second CMCI under an interval X
>> > * We poll Y times, each polling with a duration Z.
>> > * If during those Y*Z msec of polling, we've encountered errors, we
>> > enlarge the polling interval to additional Y*Z msec.
>> >
>> >
>> > check_interval will be capped on the low end to something bigger than
>> > the polling duration Y*Z and only the storm detection code will be
>> > allowed to go to lower intervals and switch to polling.
>> >
>> > At least something like that. In general, I'd like to make it more
>> > robust for every system without the need for user interaction, i.e.
>> > adjusting check_interval and where it just works.
>>
>> But at the same time, this scheme introduces even more variables that
>> need careful tuning, e.g. storm polling interval and storm duration,
>> while not really doing anything to make check_interval superfluous. Do
>
> Oh, we can't make check_interval superfluous - it is API to userspace
> for a long time now.

Oh, I guess I misunderstood. I thought you were actually talking about
removing that knob.

>> you really think we can tune these variables correctly for every
>> system out there?
>
> Right, I was trying to figure out a scheme first where polling intervals
> and thresholds would actually make sense and not be arbitrary.
>
> We probably won't be able to have the exact values for each system but a
> smart approximation could do the job nicely enough.

Sounds good, but we need to limit the complexity (which is why we
can't get exact values).

>> Or if we want to be generous: How about we just hardcode
>> check_interval to 5 seconds. Would that be fine with everyone?
>
> We could but again, it is an API to userspace exported through sysfs.
>
> Besides, on a healthy system, you see errors so seldomly that 5sec is
> pure waste of energy.

True, but it sometimes makes sense to turn it down to a seemingly
insane value, e.g. during hardware testing and qualification. Which is
why I want to make sure values in that range work.

But please disregard my suggestion to hardcode check_interval -- it's
a bad idea and we're not going to remove that knob anyway.

>> > I don't know whether any of the above makes sense - I hope that the
>> > gist of it at least shows what IO think we should be doing: instead
>> > of letting users configure the check_interval and influence the CMCI
>> > polling interval, we should rely purely on machine characteristics to
>> > set minimum values under which we poll and above which, we do the normal
>> > duration enlarging dance.
>>
>> I think the scheme may work, although I'm worried about the burstiness
>> mentioned above.
>>
>> But I don't really buy that pulling a handful of numbers out of thin
>> air and saying it should work for everyone is going to work.
>
> No no, absolutely not. This is exactly what I think should be fixed as
> the current numbers are likely pulled out of thin air. Simply because
> figuring the optimal ones is a very hard task, as we come to realize.
>
>> Either we need solid data to back up those numbers, or we need to make
>> them configurable so people can experiment and find what works best
>> for them.
>
> ..., or, we could measure them on each system and approximate them to
> the ones close to optimal for that particular system, over the course of
> its runtime.

I like the idea, but I'm worried about the complexity. Maybe what you
said elsewhere makes sense -- I'll have to look at it more closely.

> Thanks for taking the time and humouring me with that crazy
> brainstorming!

You're welcome, and likewise :)

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-11 19:52                 ` Borislav Petkov
@ 2014-07-11 21:15                   ` Havard Skinnemoen
  2014-07-17 10:50                     ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-11 21:15 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum

On Fri, Jul 11, 2014 at 12:52 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 11, 2014 at 12:06:40PM -0700, Tony Luck wrote:
>> > +               if (atomic_add_unless(&mce_banks[i].poll_reader, 1, 1)) {
>> > +                       m.status = mce_rdmsrl(MSR_IA32_MCx_STATUS(i));
>>
>> Same as yesterday. You may skip reading a bank because someone else
>> is reading the same bank number, even though you don't share that bank
>> with them.
>
> Not if those banks are in a percpu variable. And this is what
> machine_check_poll gets. The ->poll_reader thing is then per-cpu too.
>
> For shared banks it should work also as expected since we want there
> only one reader to see the MCE signature.

If poll_reader is per-cpu, what will prevent multiple CPUs from
reading the same error from a shared bank?

>> If we are willing to be rather flexible amount when polling happens,
>> and not allow very fast poll rates. Then we could do something like
>> have the lowest numbered online cpu be the only one that sets a
>> timer. When it goes off, it scans its own banks, and then uses an
>> async cross-processor call to poke the next highest numbered
>> online cpu to have it scan banks and poke the next guy.
>>
>> That way we know that two cpus can't be polling at the same time,
>> because we convoy them all one at a time.
>
> See above - those banks are percpu. And besides, mce_timer_fn already
> has the WARN_ON which otherwise be firing left and right.

mce_banks isn't currently percpu.

> It seems, Havard's issue is only with shared banks. I think they only
> cause the repeated error records.

But the problem with shared banks is that multiple CPUs read them, and
we can't tell if the errors are duplicate unless we make sure that
only one CPU gets to read and clear it at a time. Any cpu-local
synchronization mechanism isn't going to work.

If you're worried about disabling interrupts, it's possible we don't
really need to make the spinlocks irqsafe. I'm not sure if we had any
reason for that other than "just to be safe".

Or we could keep the (irqsafe) spinlocks but move the clearing much
earlier. There may have been a reason why the current code clears the
bank status last though -- perhaps we also need to read out all the
state while we hold the lock, before we clear the status bit.

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-11 20:22                   ` Borislav Petkov
@ 2014-07-12  0:10                     ` Havard Skinnemoen
  2014-07-14 15:14                       ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Havard Skinnemoen @ 2014-07-12  0:10 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

On Fri, Jul 11, 2014 at 1:22 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jul 11, 2014 at 11:56:11AM -0700, Havard Skinnemoen wrote:
>> > * max number of CMCIs per second a system can sustain fine, i.e. the 100
>> > above
>>
>> What's the definition of "fine"? 1% performance hit? 10%? How can we
>> make that decision without knowing how hard the users are pushing
>> their systems?
>
> Those are definitely unchartered territories we're moving into so yes,
> answering that won't be easy.
>
> Let's try it: if the anount of time we spend per second in the CMCI
> handler becomes higher than the amount of time we spend polling for that
> same second, then we might just as well poll and save us the interrupt
> load.
>
> So, for example, say for 10ms poll rate and single poll duration of
> 2ms, if time spent in CMCI exceeds 200ms for that second, we switch to
> polling. Hypothetical numbers, of course.

200ms per second means we're using 20% of that CPU. I'd say that's
definitely too much. But I like the general approach.

> Can we measure it on every system? Probably not. So we need to
> approximate it somehow.
>
>>
>> > * total polling duration during storm, i.e. the 1 second above
>> >
>> > and if those are chosen generously for every system out there, then we
>> > don't need to dynamically adjust the polling interval.
>>
>> I'm not sure how we can be generous when there's a tradeoff involved.
>> If we make the interval "generously low", we end up hurting
>> performance. if we make it "generously high", we'll lose information.
>
> Yeah, by "generous" I meant, choose values which fit all. But I realize
> now that this is a dumb idea. Maybe we could measure it on each system,
> read the TSC on CMCI entry and exit and thus get an average CMCI
> duration...

Sounds interesting. Some things that may need some more thought:

1. What percentage of CPU is OK to use before we consider it a storm?

2. How do we map that number to polling mode, where we may not see all
the errors? If we get it wrong, we may end up bouncing at a very high
rate.

3. If we go for a fixed polling rate, how do we make sure it doesn't
require more CPU than what we determined in (1)?

Havard

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-11 20:39                     ` Havard Skinnemoen
@ 2014-07-14 14:57                       ` Borislav Petkov
  0 siblings, 0 replies; 61+ messages in thread
From: Borislav Petkov @ 2014-07-14 14:57 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

On Fri, Jul 11, 2014 at 01:39:19PM -0700, Havard Skinnemoen wrote:
> Sorry, I was being unclear. I was actually arguing the opposite:
> Getting 15 CMCIs per second is fine and shouldn't cause any switch to
> polling mode, especially if the polling will happen at 100 times per
> second. But your proposal would switch to polling if we ever see 2
> CMCIs within a period, which seems way too trigger-happy, even if the
> period is short.
>
> I do agree there are already a lot of arbitrary numbers in the code.

Yes, triggerhappy is no good either.

The thing is, even if we would come up with a correct number now, who's
to say that that same number would be correct on future uarches? I like
the idea of approximating storm entry point on each system and I, like
you, worry about complexity. This needs to be done really conservatively
and without rushing...

Thankfully, this thread has some nice starting ideas. :)

> > Instead, the criteria should probably be something like: what is the
> > number of CMCIs per second which we can process while leaving system
> > operation relatively unaffected? Anything above that number constitutes
> > a CMCI storm.
> 
> That sounds good to me. But now you're talking about CMCIs per second,
> which seems to imply some form of counting right?

<thinking out loud>

Well, I was thinking of measuring the average duration of the CMCI
interrupt handler (which basically is machine_check_poll) and then maybe
allowing x% of that per second. Any higher count above x% switches to
storm.

So we'll probably end up counting again but CMCI_STORM_THRESHOLD will be
determined dynamically by doing:

	CMCI_STORM_THRESHOLD = (1000ms / average duration of CMCI in ms) * x%

Then, making that x user-configurable would probably be fine too. It'll
basically allow users to say what percentage of time they'd want the
system to spend handling CMCIs before polling.

And, it'll have a sane, conservative default for the majority of people
who don't want to deal with this at all.

The usual conserns about exporting stuff to userspace apply, see below.

</thinking out loud>

> > Now, how we'll come up with an answer to that question is a whole
> > another story...
> 
> Right. If we can come up with an answer, that's great, but if we
> don't, I think we're better off exporting a nice knob and letting the
> user tune his system according to his needs.

Yeah, just remember that exporting all kinds of knobs means we're forced
to support it forever. So I'm very cautious with exposing anything to
userspace as it becomes an API and we're stuck with it.

> Just to throw another number out, how about doing CMCI storm polling
> at a fixed interval of 100 ms? Since check_interval is an integer
> representing a number of seconds, it can never get lower than 10x this
> number, so we won't need to restrict it any further.

Yep, this is basically the approach where we do find a static number
default for all machines out there. It could be a temporary solution ...

> If we see more than X CMCIs in a second, we switch to polling. If less
> than Y out of 10 polls see an error, we switch back to CMCI.
> 
> Now, we still leave 3 magic numbers to be figured out...but I think
> their range is somewhat more limited.

Makes sense.

So X will always be < 10, (== 10 means we automatically switch to
polling).

The Y could contain a historic aspect by setting it to some value and
decrementing it by one if we haven't seen an error and incrementing it
if we saw an error during the last poll. It will saturate at Y errors
and when it reaches 0, it will switch back to CMCI.

Hrrm, sounds interesting :)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values.
  2014-07-12  0:10                     ` Havard Skinnemoen
@ 2014-07-14 15:14                       ` Borislav Petkov
  0 siblings, 0 replies; 61+ messages in thread
From: Borislav Petkov @ 2014-07-14 15:14 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum, linux-edac

On Fri, Jul 11, 2014 at 05:10:07PM -0700, Havard Skinnemoen wrote:
> 200ms per second means we're using 20% of that CPU. I'd say that's
> definitely too much. But I like the general approach.

Right.

> > Yeah, by "generous" I meant, choose values which fit all. But I realize
> > now that this is a dumb idea. Maybe we could measure it on each system,
> > read the TSC on CMCI entry and exit and thus get an average CMCI
> > duration...
> 
> Sounds interesting. Some things that may need some more thought:
> 
> 1. What percentage of CPU is OK to use before we consider it a storm?

That is a very good question. Normally, when we don't know that answer,
we leave it user-configurable with a sane default :-)

But if we have to be realistic, anything above 20% of CPU time spent in
storm mode for prolonged periods of time would probably mean this system
needs to get scheduled for maintenance anyway.

The whole storm thing is basically showing that a system is about to
fail soon and we're trying to alleviate performance hit from too high
CMCI counts by switching to polling, i.e., prolonged, more graceful hw
fail. :-)

> 2. How do we map that number to polling mode, where we may not see all
> the errors? If we get it wrong, we may end up bouncing at a very high
> rate.

Well, with polling you're bound to miss some errors anyway.

> 3. If we go for a fixed polling rate, how do we make sure it doesn't
> require more CPU than what we determined in (1)?

Yeah, that's the disadvantage of fixed polling rate - we won't know.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-11 21:15                   ` Havard Skinnemoen
@ 2014-07-17 10:50                     ` Borislav Petkov
  2014-07-18 21:23                       ` Tony Luck
  0 siblings, 1 reply; 61+ messages in thread
From: Borislav Petkov @ 2014-07-17 10:50 UTC (permalink / raw)
  To: Havard Skinnemoen; +Cc: Tony Luck, Linux Kernel, Ewout van Bekkum

On Fri, Jul 11, 2014 at 02:15:49PM -0700, Havard Skinnemoen wrote:
> But the problem with shared banks is that multiple CPUs read them,
> and we can't tell if the errors are duplicate unless we make sure
> that only one CPU gets to read and clear it at a time. Any cpu-local
> synchronization mechanism isn't going to work.

Well, maybe it is about time we tracked shared banks.

> If you're worried about disabling interrupts, it's possible we don't
> really need to make the spinlocks irqsafe. I'm not sure if we had any
> reason for that other than "just to be safe".

Hmm, so machine_check_poll gets called mostly in irqs off context except
in the timer callback which runs in softirq context.

I hear hrtimer callbacks will be made to run *always* in interrupt
context so we could switch cmci to use an hrtimer at some point...

> Or we could keep the (irqsafe) spinlocks but move the clearing much
> earlier. There may have been a reason why the current code clears the
> bank status last though -- perhaps we also need to read out all the
> state while we hold the lock, before we clear the status bit.

... or, yeah, do what you do currently and disable IRQs but do all the
MSR accesses in there, after you've detected MCI_STATUS_VAL set. That
would make the critical section at least shorter even if we disable
interrupts for non-shared banks too.

The clearing of the MCI_STATUS_VAL bit is unrelated to the rest of the
MSRs in the MCA bank so you can read them all in one go in the critical
section.

We can evaluate later if the IRQs disabling is too heavy after all.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-17 10:50                     ` Borislav Petkov
@ 2014-07-18 21:23                       ` Tony Luck
  2014-07-18 21:31                         ` Borislav Petkov
  0 siblings, 1 reply; 61+ messages in thread
From: Tony Luck @ 2014-07-18 21:23 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Havard Skinnemoen, Linux Kernel, Ewout van Bekkum

On Thu, Jul 17, 2014 at 3:50 AM, Borislav Petkov <bp@alien8.de> wrote:
> Well, maybe it is about time we tracked shared banks.

For cpus that support CMCI and the MCi_CTL2 registers we do track
sharing. Only one cpu gets to be the "owner" of a bank that supports
CMCI (the first one to find it and set bit 30 in the CTL2 register).

The test_bit()  at the top of the loop in machine_check_poll() makes
sure only the owner of a bank actually looks at it.

        for (i = 0; i < mca_cfg.banks; i++) {
                if (!mce_banks[i].ctl || !test_bit(i, *b))
                        continue;

If we don't have CMCI, then we don't have the CTL2 registers, and
so have no way to find out which banks are shared.


> We can evaluate later if the IRQs disabling is too heavy after all.

I'd be surprised if it was a problem in practice. If we have  CMCI,
then we limit the banks that we look at (and if we see a high rate
of interrupts, then we turn off interrupts an poll).

If we don't have CMCI, then we are polling at a pretty low rate
(current code adjusts the rate higher if we are finding errors to
log, but we don't let that rate rise forever ... cap is ~ 1HZ).

-Tony

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports.
  2014-07-18 21:23                       ` Tony Luck
@ 2014-07-18 21:31                         ` Borislav Petkov
  0 siblings, 0 replies; 61+ messages in thread
From: Borislav Petkov @ 2014-07-18 21:31 UTC (permalink / raw)
  To: Tony Luck; +Cc: Havard Skinnemoen, Linux Kernel, Ewout van Bekkum

On Fri, Jul 18, 2014 at 02:23:04PM -0700, Tony Luck wrote:
> On Thu, Jul 17, 2014 at 3:50 AM, Borislav Petkov <bp@alien8.de> wrote:
> > Well, maybe it is about time we tracked shared banks.
> 
> For cpus that support CMCI and the MCi_CTL2 registers we do track
> sharing. Only one cpu gets to be the "owner" of a bank that supports
> CMCI (the first one to find it and set bit 30 in the CTL2 register).
> 
> The test_bit()  at the top of the loop in machine_check_poll() makes
> sure only the owner of a bank actually looks at it.
> 
>         for (i = 0; i < mca_cfg.banks; i++) {
>                 if (!mce_banks[i].ctl || !test_bit(i, *b))
>                         continue;
> 
> If we don't have CMCI, then we don't have the CTL2 registers, and
> so have no way to find out which banks are shared.

Ah, so Havard's corrected explanation was this:

"I don't think we got the description right here. I think the real issue
here was machine check polls happening on multiple CPUs with shared
banks, all reporting the same MCEs. This is very reproducible when
booting with mce=no_cmci, since all CPUs will handle all banks, and
there's AFAICT no good way to identify shared banks without enabling
CMCI."

Remind me, why would one boot with mce=no_cmci at all, on a CMCI
machine?

> I'd be surprised if it was a problem in practice. If we have  CMCI,
> then we limit the banks that we look at (and if we see a high rate
> of interrupts, then we turn off interrupts an poll).
>
> If we don't have CMCI, then we are polling at a pretty low rate
> (current code adjusts the rate higher if we are finding errors to
> log, but we don't let that rate rise forever ... cap is ~ 1HZ).

Right, it would be interesting to see how a huuge machine (4 sockets
with lotsa memory) behaves under a CMCI storm...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2014-07-18 21:32 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-09 17:09 [PATCH 0/6] x86 mce fixes Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 1/6] x86-mce: Modify CMCI poll interval to adjust for small check_interval values Havard Skinnemoen
2014-07-09 19:17   ` Borislav Petkov
2014-07-09 21:24     ` Havard Skinnemoen
2014-07-10  9:01       ` Chen, Gong
2014-07-10 17:16         ` Havard Skinnemoen
2014-07-11  2:12           ` Chen, Gong
2014-07-10 11:42       ` Borislav Petkov
2014-07-10 17:51         ` Havard Skinnemoen
2014-07-10 18:55           ` Tony Luck
2014-07-10 22:45             ` Havard Skinnemoen
2014-07-11 15:35               ` Borislav Petkov
2014-07-11 18:56                 ` Havard Skinnemoen
2014-07-11 20:10                   ` Borislav Petkov
2014-07-11 20:39                     ` Havard Skinnemoen
2014-07-14 14:57                       ` Borislav Petkov
2014-07-11 20:22                   ` Borislav Petkov
2014-07-12  0:10                     ` Havard Skinnemoen
2014-07-14 15:14                       ` Borislav Petkov
2014-07-11 20:36                   ` Borislav Petkov
2014-07-11 21:05                     ` Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 2/6] x86-mce: Modify CMCI storm exit to reenable instead of rediscover banks Havard Skinnemoen
2014-07-09 20:20   ` Luck, Tony
2014-07-09 21:34     ` Havard Skinnemoen
2014-07-10 15:51       ` Borislav Petkov
2014-07-10 18:32         ` Havard Skinnemoen
2014-07-09 17:09 ` [PATCH 3/6] x86-mce: Clear CMCI enable on all claimed CMCI banks before reboot Havard Skinnemoen
2014-07-09 20:36   ` Luck, Tony
2014-07-09 21:40     ` Havard Skinnemoen
2014-07-10 16:24       ` Borislav Petkov
2014-07-10 16:33         ` Tony Luck
2014-07-10 17:56         ` Havard Skinnemoen
2014-07-10 18:27           ` Tony Luck
2014-07-10 18:30           ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 4/6] x86-mce: Add spinlocks to prevent duplicated MCP and CMCI reports Havard Skinnemoen
2014-07-09 20:35   ` Andi Kleen
2014-07-09 21:51     ` Havard Skinnemoen
2014-07-09 23:32       ` Luck, Tony
2014-07-10  8:16         ` Borislav Petkov
2014-07-09 20:47   ` Luck, Tony
2014-07-09 21:56     ` Havard Skinnemoen
2014-07-10 16:41   ` Borislav Petkov
2014-07-10 18:03     ` Havard Skinnemoen
2014-07-10 18:44       ` Borislav Petkov
2014-07-10 18:57         ` Tony Luck
2014-07-10 19:12           ` Borislav Petkov
2014-07-11  9:24             ` Borislav Petkov
2014-07-11 19:06               ` Tony Luck
2014-07-11 19:52                 ` Borislav Petkov
2014-07-11 21:15                   ` Havard Skinnemoen
2014-07-17 10:50                     ` Borislav Petkov
2014-07-18 21:23                       ` Tony Luck
2014-07-18 21:31                         ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 5/6] x86-mce: check if no_way_out applies before deciding not to clear MCE banks Havard Skinnemoen
2014-07-09 21:00   ` Luck, Tony
2014-07-09 23:00     ` Havard Skinnemoen
2014-07-09 23:27       ` Luck, Tony
2014-07-10 16:49         ` Borislav Petkov
2014-07-09 17:09 ` [PATCH 6/6] x86-mce: ensure the MCP timer is not already set in the mce_timer_fn Havard Skinnemoen
2014-07-09 21:04   ` Luck, Tony
2014-07-09 23:01     ` Havard Skinnemoen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.