* Re: [patch 0/5] x86: mce: Bugfixes, cleanups and a new CMCI poll version
2012-06-08 7:49 ` Thomas Gleixner
@ 2012-06-11 5:46 ` Chen Gong
2012-06-11 6:09 ` Chen Gong
2012-06-14 13:49 ` [PATCH] tmp patch to fix hotplug issue in CMCI storm Chen Gong
2 siblings, 0 replies; 33+ messages in thread
From: Chen Gong @ 2012-06-11 5:46 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: LKML, Tony Luck, Borislav Petkov, x86, Peter Zijlstra
于 2012/6/8 15:49, Thomas Gleixner 写道:
> On Thu, 7 Jun 2012, Chen Gong wrote:
>>
>> But during the CPU online/offline test I found an issue. After *STORM*
>> mode is entered, it can't come back from *STORM* mode to normal
>> interrupt mode. At least there exists such an issue: when *STORM* is
>> entered, in the meanwhile, one CPU is offline during this period,
>> which means *cmci_storm_on_cpus* can't decrease to 0 because there
>> is one bit stuck on this offlined CPU. So we should detect such
>> situation and decrease on *cmci_storm_on_cpus* at proper time.
>
> Yes, we need to reset the storm state as well I think.
>
>> BTW, even I online the *CPU* in above situation, the normal CMCI
>> still doesn't come back, strange.
>
> That's weird.
Here is the reason. When CPU offline and then online, the CMCI is
reenabled, which means it opens a backdoor to enable *CMCI storm detect*
again. The result is the counter cmci_sotrm_cnt is increased again
before it decreases to zero. At last, this counter will never be back
to zero, irrelevant to CPU online. Obviously, we must stop CMCI enabling
again before CMCI storm ends, even during CPU online/offline. Besides
this, we still need to remove the count if one CPU is offlined. Like
below tmp patch I wrote (not guarantee it is OK :-))
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h
b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 2cd73ce..6a05c1d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -31,9 +31,11 @@ extern struct mce_bank *mce_banks;
#ifdef CONFIG_X86_MCE_INTEL
unsigned long mce_intel_adjust_timer(unsigned long interval);
void mce_intel_cmci_poll(void);
+void mce_intel_hcpu_update(unsigned long cpu);
#else
# define mce_intel_adjust_timer mce_adjust_timer_default
static inline void mce_intel_cmci_poll(void) { }
+static inline void mce_intel_hcpu_update(unsigned long cpu) { }
#endif
void mce_timer_kick(unsigned long interval);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c
b/arch/x86/kernel/cpu/mcheck/mce.c
index e3f8b94..5e22d99 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2306,6 +2306,7 @@ mce_cpu_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
if (threshold_cpu_callback)
threshold_cpu_callback(action, cpu);
mce_device_remove(cpu);
+ mce_intel_hcpu_update(cpu);
break;
case CPU_DOWN_PREPARE:
smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c
b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 92d8b5c..0051b2d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -40,6 +40,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
+static DEFINE_PER_CPU(unsigned long, cmci_storm_hcpu_status);
enum {
CMCI_STORM_NONE,
@@ -47,6 +48,11 @@ enum {
CMCI_STORM_SUBSIDED,
};
+enum {
+ CMCI_STORM_HCPU_NONE,
+ CMCI_STORM_HCPU_ACTIVE,
+};
+
static atomic_t cmci_storm_on_cpus;
static int cmci_supported(int *banks)
@@ -77,6 +83,17 @@ void mce_intel_cmci_poll(void)
machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
}
+void mce_intel_hcpu_update(unsigned long cpu)
+{
+ unsigned long *status = &per_cpu(cmci_storm_hcpu_status, cpu);
+
+ if (*status == CMCI_STORM_HCPU_ACTIVE) {
+ *status = CMCI_STORM_HCPU_NONE;
+ atomic_dec(&cmci_storm_on_cpus);
+ per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
+ }
+}
+
unsigned long mce_intel_adjust_timer(unsigned long interval)
{
if (interval < CMCI_POLL_INTERVAL)
@@ -132,6 +149,7 @@ static bool cmci_storm_detect(void)
cmci_clear();
__this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
+ __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_ACTIVE);
atomic_inc(&cmci_storm_on_cpus);
mce_timer_kick(CMCI_POLL_INTERVAL);
return true;
>
>> I still have another question: When we handle following case:
>> mce_cpu_callback(struct notifier_block *
>> mce_device_remove(cpu);
>> break;
>> case CPU_DOWN_PREPARE:
>> - del_timer_sync(t);
>> smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
>> + del_timer_sync(t);
>> break;
>>
>> Where we add this timer back? I can't find it in "case CPU_ONLINE".
>
> The timer gets added back via mcheck_cpu_init(), which is called on
> the newly onlined cpu from smp_callin().
Stupid me! I must too busy that day so that I lost my head. :-(.
Thanks Tomas for your explanation!
>
> Thanks,
>
> tglx
>
>
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [patch 0/5] x86: mce: Bugfixes, cleanups and a new CMCI poll version
2012-06-08 7:49 ` Thomas Gleixner
2012-06-11 5:46 ` Chen Gong
@ 2012-06-11 6:09 ` Chen Gong
2012-06-14 13:49 ` [PATCH] tmp patch to fix hotplug issue in CMCI storm Chen Gong
2 siblings, 0 replies; 33+ messages in thread
From: Chen Gong @ 2012-06-11 6:09 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: LKML, Tony Luck, Borislav Petkov, x86, Peter Zijlstra
I don't know why my reply format is changed in the previous mail, I have
to resend it again, sorry.
于 2012/6/8 15:49, Thomas Gleixner 写道:
> On Thu, 7 Jun 2012, Chen Gong wrote:
>>
>> But during the CPU online/offline test I found an issue. After *STORM*
>> mode is entered, it can't come back from *STORM* mode to normal
>> interrupt mode. At least there exists such an issue: when *STORM* is
>> entered, in the meanwhile, one CPU is offline during this period,
>> which means *cmci_storm_on_cpus* can't decrease to 0 because there
>> is one bit stuck on this offlined CPU. So we should detect such
>> situation and decrease on *cmci_storm_on_cpus* at proper time.
>
> Yes, we need to reset the storm state as well I think.
>
>> BTW, even I online the *CPU* in above situation, the normal CMCI
>> still doesn't come back, strange.
>
> That's weird.
Here is the reason. When CPU offline and then online, the CMCI is
reenabled, which means it opens a backdoor to enable *CMCI storm detect*
again. The result is the counter cmci_sotrm_cnt is increased again
before it decreases to zero. At last, this counter will never be back
to zero, irrelevant to CPU online. Obviously, we must stop CMCI enabling
again before CMCI storm ends, even during CPU online/offline. Besides
this, we still need to remove the count if one CPU is offlined. Like
below tmp patch I wrote (not guarantee it is OK :-))
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h
b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 2cd73ce..6a05c1d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -31,9 +31,11 @@ extern struct mce_bank *mce_banks;
#ifdef CONFIG_X86_MCE_INTEL
unsigned long mce_intel_adjust_timer(unsigned long interval);
void mce_intel_cmci_poll(void);
+void mce_intel_hcpu_update(unsigned long cpu);
#else
# define mce_intel_adjust_timer mce_adjust_timer_default
static inline void mce_intel_cmci_poll(void) { }
+static inline void mce_intel_hcpu_update(unsigned long cpu) { }
#endif
void mce_timer_kick(unsigned long interval);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c
b/arch/x86/kernel/cpu/mcheck/mce.c
index e3f8b94..5e22d99 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2306,6 +2306,7 @@ mce_cpu_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
if (threshold_cpu_callback)
threshold_cpu_callback(action, cpu);
mce_device_remove(cpu);
+ mce_intel_hcpu_update(cpu);
break;
case CPU_DOWN_PREPARE:
smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c
b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 92d8b5c..0051b2d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -40,6 +40,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
+static DEFINE_PER_CPU(unsigned long, cmci_storm_hcpu_status);
enum {
CMCI_STORM_NONE,
@@ -47,6 +48,11 @@ enum {
CMCI_STORM_SUBSIDED,
};
+enum {
+ CMCI_STORM_HCPU_NONE,
+ CMCI_STORM_HCPU_ACTIVE,
+};
+
static atomic_t cmci_storm_on_cpus;
static int cmci_supported(int *banks)
@@ -77,6 +83,17 @@ void mce_intel_cmci_poll(void)
machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
}
+void mce_intel_hcpu_update(unsigned long cpu)
+{
+ unsigned long *status = &per_cpu(cmci_storm_hcpu_status, cpu);
+
+ if (*status == CMCI_STORM_HCPU_ACTIVE) {
+ *status = CMCI_STORM_HCPU_NONE;
+ atomic_dec(&cmci_storm_on_cpus);
+ per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
+ }
+}
+
unsigned long mce_intel_adjust_timer(unsigned long interval)
{
if (interval < CMCI_POLL_INTERVAL)
@@ -132,6 +149,7 @@ static bool cmci_storm_detect(void)
cmci_clear();
__this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
+ __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_ACTIVE);
atomic_inc(&cmci_storm_on_cpus);
mce_timer_kick(CMCI_POLL_INTERVAL);
return true;
>
>> I still have another question: When we handle following case:
>> mce_cpu_callback(struct notifier_block *
>> mce_device_remove(cpu);
>> break;
>> case CPU_DOWN_PREPARE:
>> - del_timer_sync(t);
>> smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
>> + del_timer_sync(t);
>> break;
>>
>> Where we add this timer back? I can't find it in "case CPU_ONLINE".
>
> The timer gets added back via mcheck_cpu_init(), which is called on
> the newly onlined cpu from smp_callin().
>
Stupid me! I must too busy that day so that I lost my head. :-(.
Thanks Tomas for your explanation!
> Thanks,
>
> tglx
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH] tmp patch to fix hotplug issue in CMCI storm
2012-06-08 7:49 ` Thomas Gleixner
2012-06-11 5:46 ` Chen Gong
2012-06-11 6:09 ` Chen Gong
@ 2012-06-14 13:49 ` Chen Gong
2012-06-14 14:07 ` Thomas Gleixner
2 siblings, 1 reply; 33+ messages in thread
From: Chen Gong @ 2012-06-14 13:49 UTC (permalink / raw)
To: tglx; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel, Chen Gong
this patch is based on tip tree and previous 5 patches.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
---
arch/x86/kernel/cpu/mcheck/mce-internal.h | 2 +
arch/x86/kernel/cpu/mcheck/mce.c | 1 +
arch/x86/kernel/cpu/mcheck/mce_intel.c | 49 ++++++++++++++++++++++++++++-
3 files changed, 51 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 2cd73ce..6a05c1d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -31,9 +31,11 @@ extern struct mce_bank *mce_banks;
#ifdef CONFIG_X86_MCE_INTEL
unsigned long mce_intel_adjust_timer(unsigned long interval);
void mce_intel_cmci_poll(void);
+void mce_intel_hcpu_update(unsigned long cpu);
#else
# define mce_intel_adjust_timer mce_adjust_timer_default
static inline void mce_intel_cmci_poll(void) { }
+static inline void mce_intel_hcpu_update(unsigned long cpu) { }
#endif
void mce_timer_kick(unsigned long interval);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e3f8b94..5e22d99 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2306,6 +2306,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
if (threshold_cpu_callback)
threshold_cpu_callback(action, cpu);
mce_device_remove(cpu);
+ mce_intel_hcpu_update(cpu);
break;
case CPU_DOWN_PREPARE:
smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 92d8b5c..ef687df 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -40,6 +40,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
+static DEFINE_PER_CPU(unsigned long, cmci_storm_hcpu_status);
enum {
CMCI_STORM_NONE,
@@ -47,6 +48,12 @@ enum {
CMCI_STORM_SUBSIDED,
};
+enum {
+ CMCI_STORM_HCPU_NONE,
+ CMCI_STORM_HCPU_ACTIVE,
+ CMCI_STORM_HCPU_SUBSIDED,
+};
+
static atomic_t cmci_storm_on_cpus;
static int cmci_supported(int *banks)
@@ -77,6 +84,17 @@ void mce_intel_cmci_poll(void)
machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
}
+void mce_intel_hcpu_update(unsigned long cpu)
+{
+ unsigned long *status = &per_cpu(cmci_storm_hcpu_status, cpu);
+
+ if (*status == CMCI_STORM_HCPU_ACTIVE) {
+ per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
+ *status = CMCI_STORM_HCPU_SUBSIDED;
+ atomic_dec(&cmci_storm_on_cpus);
+ }
+}
+
unsigned long mce_intel_adjust_timer(unsigned long interval)
{
if (interval < CMCI_POLL_INTERVAL)
@@ -90,6 +108,7 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
* timer interval is back to our poll interval.
*/
__this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
+ __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_NONE);
atomic_dec(&cmci_storm_on_cpus);
case CMCI_STORM_SUBSIDED:
@@ -109,6 +128,21 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
* We have shiny wheather, let the poll do whatever it
* thinks.
*/
+
+ /*
+ * If one CPU is offlined and onlined during CMCI storm, it
+ * has no chance to enable CMCI again. Here is the portal.
+ */
+ if ((!atomic_read(&cmci_storm_on_cpus)) &&
+ (CMCI_STORM_HCPU_SUBSIDED ==
+ __this_cpu_read(cmci_storm_hcpu_status))) {
+ __this_cpu_write(cmci_storm_hcpu_status,
+ CMCI_STORM_HCPU_NONE);
+ cmci_reenable();
+ apic_write(APIC_LVTCMCI,
+ THRESHOLD_APIC_VECTOR|APIC_DM_FIXED);
+ cmci_recheck();
+ }
return interval;
}
}
@@ -132,6 +166,7 @@ static bool cmci_storm_detect(void)
cmci_clear();
__this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
+ __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_ACTIVE);
atomic_inc(&cmci_storm_on_cpus);
mce_timer_kick(CMCI_POLL_INTERVAL);
return true;
@@ -259,7 +294,9 @@ void cmci_rediscover(int dying)
int cpu;
cpumask_var_t old;
- if (!cmci_supported(&banks))
+ if (!cmci_supported(&banks) ||
+ /* if still in CMCI storm, don't enable it */
+ (0 != atomic_read(&cmci_storm_on_cpus)))
return;
if (!alloc_cpumask_var(&old, GFP_KERNEL))
return;
@@ -297,6 +334,16 @@ static void intel_init_cmci(void)
return;
mce_threshold_vector = intel_threshold_interrupt;
+ /* if still in CMCI storm, don't enable it */
+ if (0 != atomic_read(&cmci_storm_on_cpus))
+ return;
+ /*
+ * If one CPU is offlined during CMCI storm and onlined after
+ * CMCI storm, this *hCPU* status must be updated to avoid
+ * to reenable CMCI twice.
+ */
+ __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_NONE);
+
cmci_discover(banks, 1);
/*
* For CPU #0 this runs with still disabled APIC, but that's
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH] tmp patch to fix hotplug issue in CMCI storm
2012-06-14 13:49 ` [PATCH] tmp patch to fix hotplug issue in CMCI storm Chen Gong
@ 2012-06-14 14:07 ` Thomas Gleixner
2012-06-15 6:51 ` Chen Gong
0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2012-06-14 14:07 UTC (permalink / raw)
To: Chen Gong; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel
On Thu, 14 Jun 2012, Chen Gong wrote:
> this patch is based on tip tree and previous 5 patches.
You really don't need all this complexity to handle that. The main
thing is that you clear the storm state and adjust the storm counter
when the cpu goes offline (in case the state is ACTIVE).
When it comes online again then you can simply let it restart cmci. If
the storm on this cpu (or node) still exists then it will notice and
everything falls in place.
Keep it simple, really!
Thanks,
tglx
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] tmp patch to fix hotplug issue in CMCI storm
2012-06-14 14:07 ` Thomas Gleixner
@ 2012-06-15 6:51 ` Chen Gong
2012-06-15 9:55 ` Thomas Gleixner
0 siblings, 1 reply; 33+ messages in thread
From: Chen Gong @ 2012-06-15 6:51 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel
于 2012/6/14 22:07, Thomas Gleixner 写道:
> On Thu, 14 Jun 2012, Chen Gong wrote:
>> this patch is based on tip tree and previous 5 patches.
>
> You really don't need all this complexity to handle that. The main
> thing is that you clear the storm state and adjust the storm counter
> when the cpu goes offline (in case the state is ACTIVE).
>
> When it comes online again then you can simply let it restart cmci. If
> the storm on this cpu (or node) still exists then it will notice and
> everything falls in place.
I ever tested some different scenarios, if storm on this cpu still
exists, it triggers the CMCI and broadcast it on the sibling CPU,
which means the counter *cmci_storm_on_cpus* will increase beyond
the upper limit. E.g. on a 2 sockets SandyBridge-EP system (one socket
has 8 cores and 16 threads), inject one error on one socket, you can
watch *cmci_storm_on_cpus* = 16 becuase of CMCI broadcast, during
this time, offline and online one CPU on this socket, firstly
*cmci_storm_on_cpus* = 15 because of offline and ACTIVE status, and then
*cmci_storm_on_cpus* = 31 in that CMCI is actived because of
online.That's why I have to disable CMCI during whole online/offline
until CMCI storm is subsided. Frankly, the logic is a little bit
complex so that I write many comments to avoid I forget it after some
time :-)
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] tmp patch to fix hotplug issue in CMCI storm
2012-06-15 6:51 ` Chen Gong
@ 2012-06-15 9:55 ` Thomas Gleixner
2012-06-18 6:42 ` Chen Gong
2012-06-18 6:45 ` [PATCH V2] " Chen Gong
0 siblings, 2 replies; 33+ messages in thread
From: Thomas Gleixner @ 2012-06-15 9:55 UTC (permalink / raw)
To: Chen Gong; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2778 bytes --]
On Fri, 15 Jun 2012, Chen Gong wrote:
> 于 2012/6/14 22:07, Thomas Gleixner 写道:
> > On Thu, 14 Jun 2012, Chen Gong wrote:
> > > this patch is based on tip tree and previous 5 patches.
> >
> > You really don't need all this complexity to handle that. The main
> > thing is that you clear the storm state and adjust the storm counter
> > when the cpu goes offline (in case the state is ACTIVE).
> >
> > When it comes online again then you can simply let it restart cmci. If
> > the storm on this cpu (or node) still exists then it will notice and
> > everything falls in place.
>
> I ever tested some different scenarios, if storm on this cpu still
> exists, it triggers the CMCI and broadcast it on the sibling CPU,
> which means the counter *cmci_storm_on_cpus* will increase beyond
> the upper limit. E.g. on a 2 sockets SandyBridge-EP system (one socket
> has 8 cores and 16 threads), inject one error on one socket, you can
> watch *cmci_storm_on_cpus* = 16 becuase of CMCI broadcast, during
> this time, offline and online one CPU on this socket, firstly
> *cmci_storm_on_cpus* = 15 because of offline and ACTIVE status, and then
> *cmci_storm_on_cpus* = 31 in that CMCI is actived because of
> online.That's why I have to disable CMCI during whole online/offline
> until CMCI storm is subsided. Frankly, the logic is a little bit
> complex so that I write many comments to avoid I forget it after some
> time :-)
This does not make any sense at all.
What you are saying is that even if CPU0 run cmci_clear() the CMCI
raised on CPU1 will cause the CMCI vector to be triggered on CPU0.
So how does the whole storm machinery work in the following case:
CPU0 CPU1
cmci incoming cmci incoming
storm detected no storm detected yet
cmci_clear()
switch to poll
cmci raised
So according to your explanation that would cause the cmci vector to
be broadcasted to CPU0 as well. Now that would cause the counter to
get a bogus increment, right ?
So instead of hacking insane crap into the code, we have simply to do
the obvious Right Thing:
Index: linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -119,6 +119,9 @@ static bool cmci_storm_detect(void)
unsigned long ts = __this_cpu_read(cmci_time_stamp);
unsigned long now = jiffies;
+ if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
+ return true;
+
if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
cnt++;
} else {
That will prevent damage under all circumstances, cpu hotplug
included.
But that's too simple and comprehensible I fear.
Thanks,
tglx
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH] tmp patch to fix hotplug issue in CMCI storm
2012-06-15 9:55 ` Thomas Gleixner
@ 2012-06-18 6:42 ` Chen Gong
2012-06-18 6:45 ` [PATCH V2] " Chen Gong
1 sibling, 0 replies; 33+ messages in thread
From: Chen Gong @ 2012-06-18 6:42 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel
[...]
>
> So according to your explanation that would cause the cmci vector to
> be broadcasted to CPU0 as well. Now that would cause the counter to
> get a bogus increment, right ?
>
> So instead of hacking insane crap into the code, we have simply to do
> the obvious Right Thing:
>
> Index: linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/mcheck/mce_intel.c
> +++ linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
> @@ -119,6 +119,9 @@ static bool cmci_storm_detect(void)
> unsigned long ts = __this_cpu_read(cmci_time_stamp);
> unsigned long now = jiffies;
>
> + if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
> + return true;
> +
> if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
> cnt++;
> } else {
>
> That will prevent damage under all circumstances, cpu hotplug
> included.
>
> But that's too simple and comprehensible I fear.
>
Oh, yes, your suggestion can simplify my logic, but not all. Obviously,
when hotplug is happened, it can be considered quitting CMCI storm in
another way, so the corresponding counter and status should be
decreased from that path. And in my original patch, I defined three
status:
enum {
CMCI_STORM_HCPU_NONE,
CMCI_STORM_HCPU_ACTIVE,
CMCI_STORM_HCPU_SUBSIDED,
};
I use CMCI_STORM_HCPU_SUBSIDED to descirbe stalled CPU in hotplug
progressI to stop CMCI enable during whole hotplog status, but
according to your suggestion, now the CMCI_STORM_HCPU_SUBSIDED can
be removed (FIXME: because CMCI can be enabled, if CPU offline and
then online again during CMCI storm, it will enter CMCI storm status
right now. It can simplify the logic, but a little bit performance
degradation).
I will send the 2nd patch based on previous 5 patches in a separate
mail.
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH V2] tmp patch to fix hotplug issue in CMCI storm
2012-06-15 9:55 ` Thomas Gleixner
2012-06-18 6:42 ` Chen Gong
@ 2012-06-18 6:45 ` Chen Gong
2012-06-18 8:00 ` Thomas Gleixner
1 sibling, 1 reply; 33+ messages in thread
From: Chen Gong @ 2012-06-18 6:45 UTC (permalink / raw)
To: tglx; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel, Chen Gong
This patch is based on previous 5 patches and tuned based on
Thomas' suggestion.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
---
arch/x86/kernel/cpu/mcheck/mce-internal.h | 2 ++
arch/x86/kernel/cpu/mcheck/mce.c | 1 +
arch/x86/kernel/cpu/mcheck/mce_intel.c | 22 ++++++++++++++++++++++
3 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 2cd73ce..6a05c1d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -31,9 +31,11 @@ extern struct mce_bank *mce_banks;
#ifdef CONFIG_X86_MCE_INTEL
unsigned long mce_intel_adjust_timer(unsigned long interval);
void mce_intel_cmci_poll(void);
+void mce_intel_hcpu_update(unsigned long cpu);
#else
# define mce_intel_adjust_timer mce_adjust_timer_default
static inline void mce_intel_cmci_poll(void) { }
+static inline void mce_intel_hcpu_update(unsigned long cpu) { }
#endif
void mce_timer_kick(unsigned long interval);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e3f8b94..5e22d99 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2306,6 +2306,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
if (threshold_cpu_callback)
threshold_cpu_callback(action, cpu);
mce_device_remove(cpu);
+ mce_intel_hcpu_update(cpu);
break;
case CPU_DOWN_PREPARE:
smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 92d8b5c..0493525 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -40,6 +40,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
+static DEFINE_PER_CPU(unsigned long, cmci_storm_hcpu_status);
enum {
CMCI_STORM_NONE,
@@ -47,6 +48,11 @@ enum {
CMCI_STORM_SUBSIDED,
};
+enum {
+ CMCI_STORM_HCPU_NONE,
+ CMCI_STORM_HCPU_ACTIVE,
+};
+
static atomic_t cmci_storm_on_cpus;
static int cmci_supported(int *banks)
@@ -77,6 +83,17 @@ void mce_intel_cmci_poll(void)
machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
}
+void mce_intel_hcpu_update(unsigned long cpu)
+{
+ unsigned long *status = &per_cpu(cmci_storm_hcpu_status, cpu);
+
+ if (*status == CMCI_STORM_HCPU_ACTIVE) {
+ per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
+ *status = CMCI_STORM_HCPU_NONE;
+ atomic_dec(&cmci_storm_on_cpus);
+ }
+}
+
unsigned long mce_intel_adjust_timer(unsigned long interval)
{
if (interval < CMCI_POLL_INTERVAL)
@@ -90,6 +107,7 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
* timer interval is back to our poll interval.
*/
__this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
+ __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_NONE);
atomic_dec(&cmci_storm_on_cpus);
case CMCI_STORM_SUBSIDED:
@@ -119,6 +137,9 @@ static bool cmci_storm_detect(void)
unsigned long ts = __this_cpu_read(cmci_time_stamp);
unsigned long now = jiffies;
+ if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
+ return true;
+
if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
cnt++;
} else {
@@ -132,6 +153,7 @@ static bool cmci_storm_detect(void)
cmci_clear();
__this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
+ __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_ACTIVE);
atomic_inc(&cmci_storm_on_cpus);
mce_timer_kick(CMCI_POLL_INTERVAL);
return true;
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH V2] tmp patch to fix hotplug issue in CMCI storm
2012-06-18 6:45 ` [PATCH V2] " Chen Gong
@ 2012-06-18 8:00 ` Thomas Gleixner
2012-06-18 10:13 ` Chen Gong
0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2012-06-18 8:00 UTC (permalink / raw)
To: Chen Gong; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel
On Mon, 18 Jun 2012, Chen Gong wrote:
> index 92d8b5c..0493525 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> @@ -40,6 +40,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
> static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
> static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
> static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
> +static DEFINE_PER_CPU(unsigned long, cmci_storm_hcpu_status);
Why do you insist on having another status variable, which does
actually nothing than obfuscate the code?
Look at the usage sites:
> __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
> + __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_NONE);
> __this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
> + __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_ACTIVE);
So it's a shadow variable of cmci_storm_state for no value.
And all you do with it is:
> +void mce_intel_hcpu_update(unsigned long cpu)
> +{
> + unsigned long *status = &per_cpu(cmci_storm_hcpu_status, cpu);
> +
> + if (*status == CMCI_STORM_HCPU_ACTIVE) {
This can be checked with the existing variable as well. And your check
leaves CMCI_STORM_SUBSIDED as a stale value around.
This simply wants to check
if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
atomic_dec(&cmci_storm_on_cpus);
and unconditionally clear the state
per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
Right?
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH V2] tmp patch to fix hotplug issue in CMCI storm
2012-06-18 8:00 ` Thomas Gleixner
@ 2012-06-18 10:13 ` Chen Gong
2012-06-18 12:23 ` Thomas Gleixner
0 siblings, 1 reply; 33+ messages in thread
From: Chen Gong @ 2012-06-18 10:13 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel
于 2012/6/18 16:00, Thomas Gleixner 写道:
> On Mon, 18 Jun 2012, Chen Gong wrote:
>> index 92d8b5c..0493525 100644
>> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
>> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
>> @@ -40,6 +40,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
>> static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
>> static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
>> static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
>> +static DEFINE_PER_CPU(unsigned long, cmci_storm_hcpu_status);
>
> Why do you insist on having another status variable, which does
> actually nothing than obfuscate the code?
>
> Look at the usage sites:
>
>> __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
>> + __this_cpu_write(cmci_storm_hcpu_status, CMCI_STORM_HCPU_NONE);
Because here I can't substitute CMCI_STORM_HCPU_NONE with
CMCI_STORM_SUBSIDED. If so, the status is chaos.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH V2] tmp patch to fix hotplug issue in CMCI storm
2012-06-18 10:13 ` Chen Gong
@ 2012-06-18 12:23 ` Thomas Gleixner
2012-06-19 6:05 ` Chen Gong
2012-06-19 6:09 ` [PATCH V3] " Chen Gong
0 siblings, 2 replies; 33+ messages in thread
From: Thomas Gleixner @ 2012-06-18 12:23 UTC (permalink / raw)
To: Chen Gong; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1098 bytes --]
On Mon, 18 Jun 2012, Chen Gong wrote:
> 于 2012/6/18 16:00, Thomas Gleixner 写道:
> > On Mon, 18 Jun 2012, Chen Gong wrote:
> > > index 92d8b5c..0493525 100644
> > > --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> > > +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> > > @@ -40,6 +40,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
> > > static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
> > > static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
> > > static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
> > > +static DEFINE_PER_CPU(unsigned long, cmci_storm_hcpu_status);
> >
> > Why do you insist on having another status variable, which does
> > actually nothing than obfuscate the code?
> >
> > Look at the usage sites:
> >
> > > __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
> > > + __this_cpu_write(cmci_storm_hcpu_status,
> > > CMCI_STORM_HCPU_NONE);
>
> Because here I can't substitute CMCI_STORM_HCPU_NONE with
> CMCI_STORM_SUBSIDED. If so, the status is chaos.
Have you actually read what I wrote later? You do not neeed that extra
state, period.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH V2] tmp patch to fix hotplug issue in CMCI storm
2012-06-18 12:23 ` Thomas Gleixner
@ 2012-06-19 6:05 ` Chen Gong
2012-06-19 6:09 ` [PATCH V3] " Chen Gong
1 sibling, 0 replies; 33+ messages in thread
From: Chen Gong @ 2012-06-19 6:05 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: tony.luck, borislav.petkov, x86, peterz, linux-kernel
于 2012/6/18 20:23, Thomas Gleixner 写道:
> On Mon, 18 Jun 2012, Chen Gong wrote:
>> 于 2012/6/18 16:00, Thomas Gleixner 写道:
>>> On Mon, 18 Jun 2012, Chen Gong wrote:
>>>> index 92d8b5c..0493525 100644
>>>> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
>>>> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
>>>> @@ -40,6 +40,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
>>>> static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
>>>> static DEFINE_PER_CPU(unsigned int, cmci_storm_cnt);
>>>> static DEFINE_PER_CPU(unsigned int, cmci_storm_state);
>>>> +static DEFINE_PER_CPU(unsigned long, cmci_storm_hcpu_status);
>>>
>>> Why do you insist on having another status variable, which does
>>> actually nothing than obfuscate the code?
>>>
>>> Look at the usage sites:
>>>
>>>> __this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
>>>> + __this_cpu_write(cmci_storm_hcpu_status,
>>>> CMCI_STORM_HCPU_NONE);
>>
>> Because here I can't substitute CMCI_STORM_HCPU_NONE with
>> CMCI_STORM_SUBSIDED. If so, the status is chaos.
>
> Have you actually read what I wrote later? You do not neeed that extra
> state, period.
>
Oh, yes, you are right. I really don't take enough time on your reply.
I apology for it. Unconditionally clearing the state can avoid stale
CMCI_STORM_SUBSIDED status. Your logic is fine and simply my previous
unnecessarily complicated logic. Thanks a lot.
I will send the V3 tmp patch based on your previous 5 patches right now.
If OK, you can merge it into your 5th patch and I can write the comment
for this patch if you are OK.
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH V3] tmp patch to fix hotplug issue in CMCI storm
2012-06-18 12:23 ` Thomas Gleixner
2012-06-19 6:05 ` Chen Gong
@ 2012-06-19 6:09 ` Chen Gong
2012-07-04 8:12 ` Chen Gong
1 sibling, 1 reply; 33+ messages in thread
From: Chen Gong @ 2012-06-19 6:09 UTC (permalink / raw)
To: tglx; +Cc: tony.luck, borislav.petkov, peterz, x86, linux-kernel, Chen Gong
v3->v1
Thanks very much for Thomas' suggestion to simply the whole logic.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
---
arch/x86/kernel/cpu/mcheck/mce-internal.h | 2 ++
arch/x86/kernel/cpu/mcheck/mce.c | 1 +
arch/x86/kernel/cpu/mcheck/mce_intel.c | 11 +++++++++++
3 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 2cd73ce..6a05c1d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -31,9 +31,11 @@ extern struct mce_bank *mce_banks;
#ifdef CONFIG_X86_MCE_INTEL
unsigned long mce_intel_adjust_timer(unsigned long interval);
void mce_intel_cmci_poll(void);
+void mce_intel_hcpu_update(unsigned long cpu);
#else
# define mce_intel_adjust_timer mce_adjust_timer_default
static inline void mce_intel_cmci_poll(void) { }
+static inline void mce_intel_hcpu_update(unsigned long cpu) { }
#endif
void mce_timer_kick(unsigned long interval);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e3f8b94..5e22d99 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2306,6 +2306,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
if (threshold_cpu_callback)
threshold_cpu_callback(action, cpu);
mce_device_remove(cpu);
+ mce_intel_hcpu_update(cpu);
break;
case CPU_DOWN_PREPARE:
smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index 92d8b5c..693bc7d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -77,6 +77,14 @@ void mce_intel_cmci_poll(void)
machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
}
+void mce_intel_hcpu_update(unsigned long cpu)
+{
+ if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
+ atomic_dec(&cmci_storm_on_cpus);
+
+ per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
+}
+
unsigned long mce_intel_adjust_timer(unsigned long interval)
{
if (interval < CMCI_POLL_INTERVAL)
@@ -119,6 +127,9 @@ static bool cmci_storm_detect(void)
unsigned long ts = __this_cpu_read(cmci_time_stamp);
unsigned long now = jiffies;
+ if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
+ return true;
+
if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
cnt++;
} else {
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH V3] tmp patch to fix hotplug issue in CMCI storm
2012-06-19 6:09 ` [PATCH V3] " Chen Gong
@ 2012-07-04 8:12 ` Chen Gong
2012-07-16 3:16 ` Chen Gong
0 siblings, 1 reply; 33+ messages in thread
From: Chen Gong @ 2012-07-04 8:12 UTC (permalink / raw)
To: Chen Gong; +Cc: tglx, tony.luck, borislav.petkov, peterz, x86, linux-kernel
于 2012/6/19 14:09, Chen Gong 写道:
> v3->v1
> Thanks very much for Thomas' suggestion to simply the whole logic.
>
> Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
> ---
> arch/x86/kernel/cpu/mcheck/mce-internal.h | 2 ++
> arch/x86/kernel/cpu/mcheck/mce.c | 1 +
> arch/x86/kernel/cpu/mcheck/mce_intel.c | 11 +++++++++++
> 3 files changed, 14 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index 2cd73ce..6a05c1d 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -31,9 +31,11 @@ extern struct mce_bank *mce_banks;
> #ifdef CONFIG_X86_MCE_INTEL
> unsigned long mce_intel_adjust_timer(unsigned long interval);
> void mce_intel_cmci_poll(void);
> +void mce_intel_hcpu_update(unsigned long cpu);
> #else
> # define mce_intel_adjust_timer mce_adjust_timer_default
> static inline void mce_intel_cmci_poll(void) { }
> +static inline void mce_intel_hcpu_update(unsigned long cpu) { }
> #endif
>
> void mce_timer_kick(unsigned long interval);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index e3f8b94..5e22d99 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -2306,6 +2306,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
> if (threshold_cpu_callback)
> threshold_cpu_callback(action, cpu);
> mce_device_remove(cpu);
> + mce_intel_hcpu_update(cpu);
> break;
> case CPU_DOWN_PREPARE:
> smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> index 92d8b5c..693bc7d 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> @@ -77,6 +77,14 @@ void mce_intel_cmci_poll(void)
> machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
> }
>
> +void mce_intel_hcpu_update(unsigned long cpu)
> +{
> + if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
> + atomic_dec(&cmci_storm_on_cpus);
> +
> + per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
> +}
> +
> unsigned long mce_intel_adjust_timer(unsigned long interval)
> {
> if (interval < CMCI_POLL_INTERVAL)
> @@ -119,6 +127,9 @@ static bool cmci_storm_detect(void)
> unsigned long ts = __this_cpu_read(cmci_time_stamp);
> unsigned long now = jiffies;
>
> + if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
> + return true;
> +
> if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
> cnt++;
> } else {
>
Hi, Thomas
How is going on for this patch and whole patch series? I don't know
if you have updated it or not.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH V3] tmp patch to fix hotplug issue in CMCI storm
2012-07-04 8:12 ` Chen Gong
@ 2012-07-16 3:16 ` Chen Gong
2012-07-16 8:22 ` Thomas Gleixner
0 siblings, 1 reply; 33+ messages in thread
From: Chen Gong @ 2012-07-16 3:16 UTC (permalink / raw)
To: Chen Gong; +Cc: tglx, tony.luck, borislav.petkov, peterz, x86, linux-kernel
于 2012/7/4 16:12, Chen Gong 写道:
> 于 2012/6/19 14:09, Chen Gong 写道:
>> v3->v1
>> Thanks very much for Thomas' suggestion to simply the whole logic.
>>
>> Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
>> ---
>> arch/x86/kernel/cpu/mcheck/mce-internal.h | 2 ++
>> arch/x86/kernel/cpu/mcheck/mce.c | 1 +
>> arch/x86/kernel/cpu/mcheck/mce_intel.c | 11 +++++++++++
>> 3 files changed, 14 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h
>> b/arch/x86/kernel/cpu/mcheck/mce-internal.h
>> index 2cd73ce..6a05c1d 100644
>> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
>> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
>> @@ -31,9 +31,11 @@ extern struct mce_bank *mce_banks;
>> #ifdef CONFIG_X86_MCE_INTEL
>> unsigned long mce_intel_adjust_timer(unsigned long interval);
>> void mce_intel_cmci_poll(void);
>> +void mce_intel_hcpu_update(unsigned long cpu);
>> #else
>> # define mce_intel_adjust_timer mce_adjust_timer_default
>> static inline void mce_intel_cmci_poll(void) { }
>> +static inline void mce_intel_hcpu_update(unsigned long cpu) { }
>> #endif
>>
>> void mce_timer_kick(unsigned long interval);
>> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c
>> b/arch/x86/kernel/cpu/mcheck/mce.c
>> index e3f8b94..5e22d99 100644
>> --- a/arch/x86/kernel/cpu/mcheck/mce.c
>> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
>> @@ -2306,6 +2306,7 @@ mce_cpu_callback(struct notifier_block *nfb,
>> unsigned long action, void *hcpu)
>> if (threshold_cpu_callback)
>> threshold_cpu_callback(action, cpu);
>> mce_device_remove(cpu);
>> + mce_intel_hcpu_update(cpu);
>> break;
>> case CPU_DOWN_PREPARE:
>> smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
>> diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c
>> b/arch/x86/kernel/cpu/mcheck/mce_intel.c
>> index 92d8b5c..693bc7d 100644
>> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
>> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
>> @@ -77,6 +77,14 @@ void mce_intel_cmci_poll(void)
>> machine_check_poll(MCP_TIMESTAMP, &__get_cpu_var(mce_banks_owned));
>> }
>>
>> +void mce_intel_hcpu_update(unsigned long cpu)
>> +{
>> + if (per_cpu(cmci_storm_state, cpu) == CMCI_STORM_ACTIVE)
>> + atomic_dec(&cmci_storm_on_cpus);
>> +
>> + per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
>> +}
>> +
>> unsigned long mce_intel_adjust_timer(unsigned long interval)
>> {
>> if (interval < CMCI_POLL_INTERVAL)
>> @@ -119,6 +127,9 @@ static bool cmci_storm_detect(void)
>> unsigned long ts = __this_cpu_read(cmci_time_stamp);
>> unsigned long now = jiffies;
>>
>> + if (__this_cpu_read(cmci_storm_state) != CMCI_STORM_NONE)
>> + return true;
>> +
>> if (time_before_eq(now, ts + CMCI_STORM_INTERVAL)) {
>> cnt++;
>> } else {
>>
>
>
> Hi, Thomas
>
> How is going on for this patch and whole patch series? I don't know
> if you have updated it or not.
Hi, Thomas
Are you still care about this thread any more? Any plan to update it?
Hope to get your feedback ASAP.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH V3] tmp patch to fix hotplug issue in CMCI storm
2012-07-16 3:16 ` Chen Gong
@ 2012-07-16 8:22 ` Thomas Gleixner
2012-07-17 21:47 ` Chen Gong
0 siblings, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2012-07-16 8:22 UTC (permalink / raw)
To: Chen Gong; +Cc: tony.luck, borislav.petkov, peterz, x86, linux-kernel
On Mon, 16 Jul 2012, Chen Gong wrote:
>
> Are you still care about this thread any more? Any plan to update it?
> Hope to get your feedback ASAP.
Can you please collect the latest series and send it to lkml, Tony and
Boris. I think it's ok as is now.
Thanks,
tglx
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH V3] tmp patch to fix hotplug issue in CMCI storm
2012-07-16 8:22 ` Thomas Gleixner
@ 2012-07-17 21:47 ` Chen Gong
0 siblings, 0 replies; 33+ messages in thread
From: Chen Gong @ 2012-07-17 21:47 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: tony.luck, borislav.petkov, peterz, x86, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 819 bytes --]
On Mon, Jul 16, 2012 at 10:22:16AM +0200, Thomas Gleixner wrote:
> Date: Mon, 16 Jul 2012 10:22:16 +0200 (CEST)
> From: Thomas Gleixner <tglx@linutronix.de>
> To: Chen Gong <gong.chen@linux.intel.com>
> cc: tony.luck@intel.com, borislav.petkov@amd.com, peterz@infradead.org,
> x86@kernel.org, linux-kernel@vger.kernel.org
> Subject: Re: [PATCH V3] tmp patch to fix hotplug issue in CMCI storm
> User-Agent: Alpine 2.02 (LFD 1266 2009-07-14)
>
> On Mon, 16 Jul 2012, Chen Gong wrote:
> >
> > Are you still care about this thread any more? Any plan to update it?
> > Hope to get your feedback ASAP.
>
> Can you please collect the latest series and send it to lkml, Tony and
> Boris. I think it's ok as is now.
>
> Thanks,
>
> tglx
Fine, but what base I should use, -tip tree or Linus' tree?
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread