[patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
@ 2008-12-22  3:54 Ke, Liping
  2008-12-22 12:06 ` Keir Fraser
  0 siblings, 1 reply; 14+ messages in thread
From: Ke, Liping @ 2008-12-22  3:54 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Hi, All

According to Keir's suggestion, I re-organize those patches and make sure that each intermediate patch could be compiled and running.

Since remove old p4 and p6 files are hard to be split, I now put the remove and replace in the one patch. Is it ok?

Any comment, just let me know.
Thanks a lot for your help!
Criping

Following 3 patches are for enabling CMCI of Intel CPUs in XEN.
---------------------------------------------------------------------------
Patch 1: change_stop_machine_run.patch changes stopmachine_run interface so that we can designate the callback function running on the cpu_map instead of single one. We do this change because *cmci owner change* callback (please refer to note 3 below) needs to be executed on each of online cpus when do CPU hotplug. And also, we add a callback function for CMCI use before taking CPU down.
Patch 2: apic_cmci.patch adds the new CMCI LVT entry. And also it did small clean up jobs for thermal since thermal is not only P4 specific, but also later Intel CPUs common features. The old thermal removal will also be finished in patch 3 combined with old P4/p6 removals.
Patch 3: clean_and_cmci.patch contains the main CMCI support logic including removing old duplicated P4/P6 files, add new MCA/CMCI common init/interrupt processing, *CMCI owner judge algorithms* when (bring up CPUs or do CPU hotplug), polling mechanism, etc.
----------------------------------------------------------------------------
About Test 
We wrote CMCI injection tool to test the patch. Also we wrote DOM0 test patches to see whether logs are accepted by DOM0.
We test the patches on both CMCI-support and NO-CMCI-support machine.
We test the patches combined with CPU online/offline ops and S3&S5 ops which cause banks owner changing
----------------------------------------------------------------------------
Below notes might be helpful
1) CMCI use another apic_lvt entry (now max_lvt could be 6 in newer Intel platform)

2) CMCI is combined with polling mechanism since some CPUs don't support CMCI. And for supporting CPUs, still some banks don't support CMCI. So we keep polling mechanism. Also add simple policy, when find errors, shorten polling interval. Otherwise, lengthen intervals. 

3) MCA banks shares between cores/threads. For avoiding mis-ops, we need to judge *owner* for each bank. Each bank only has one owner, the owner manages the bank. The owner will be set up when cpu_up. Owners might be changed when doing cpu-hotplug.

4) When one CMCI interrupt is accepted and logged by XEN, XEN will notify DOM0. We don't plan to check in DOM0 code now.

For more info, please refer to latest Intel software development manual, 
Chapter 14.
--------------------------------------------------------------------------
Machine check (Uncorrectable Error) support will be sent in later patches, so we keep old machine check handling code. 

Any comment, just let us know. 
Thanks a lot for your help!
Regards,
Criping

_______________________________________________

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-22  3:54 [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs Ke, Liping
@ 2008-12-22 12:06 ` Keir Fraser
  2008-12-23  5:17   ` Ke, Liping
  0 siblings, 1 reply; 14+ messages in thread
From: Keir Fraser @ 2008-12-22 12:06 UTC (permalink / raw)
  To: Ke, Liping; +Cc: xen-devel

On 22/12/2008 03:54, "Ke, Liping" <liping.ke@intel.com> wrote:

> Hi, All
> 
> According to Keir's suggestion, I re-organize those patches and make sure that
> each intermediate patch could be compiled and running.
> 
> Since remove old p4 and p6 files are hard to be split, I now put the remove
> and replace in the one patch. Is it ok?
> 
> Any comment, just let me know.

I checked in the patches as c/s 18938 and then cleaned up significantly as
c/s 18939. Let me know if this stops things working for you. Notice I
deleted more than I added. ;-)

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-22 12:06 ` Keir Fraser
@ 2008-12-23  5:17   ` Ke, Liping
  2008-12-23  8:40     ` Keir Fraser
  0 siblings, 1 reply; 14+ messages in thread
From: Ke, Liping @ 2008-12-23  5:17 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1582 bytes --]

Hi, Keir

Thanks a lot for the clean-up and format refine! I will pay more attention to this too.And also I tested on the related platform. It's fine.

As for moving *cmci_owner_set* out of stopmachine_run is basically ok for us. 
Just one thing:  
CMCI might happen and lost during the very small window (old owner is cleared while new owner is not set). In order to make sure that CMCI could be triggered an on the new owner, we need to clear MSR Bank(i) status register [Corrected Error Counter] field ( We normally do this @ CMCI interrupt handler, according to spec, if the counter is not cleared, CMCI will not be triggered any more).
I made a small patch for it in the attachment. How do you think?

Thanks a lot!
Criping

-----Original Message-----
From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
Sent: 2008年12月22日 20:07
To: Ke, Liping
Cc: xen-devel@lists.xensource.com
Subject: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs

On 22/12/2008 03:54, "Ke, Liping" <liping.ke@intel.com> wrote:

> Hi, All
> 
> According to Keir's suggestion, I re-organize those patches and make sure that
> each intermediate patch could be compiled and running.
> 
> Since remove old p4 and p6 files are hard to be split, I now put the remove
> and replace in the one patch. Is it ok?
> 
> Any comment, just let me know.

I checked in the patches as c/s 18938 and then cleaned up significantly as
c/s 18939. Let me know if this stops things working for you. Notice I
deleted more than I added. ;-)

 -- Keir



[-- Attachment #2: clear_error_count.patch --]
[-- Type: application/octet-stream, Size: 1529 bytes --]

Clear Error counter field when set new cmci owner

Since cmci might happened when cpu is taking down (cpu hotplug) before
setting new cmci owner while old owner is down. We need to clear the corrected
error counter field to make sure CMCI could be triggered on the new owner.

Signed-off-by Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by Liping Ke <liping.ke@intel.com>

diff -r a574f4c17b34 xen/arch/x86/cpu/mcheck/mce_intel.c
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c	Tue Dec 23 11:25:24 2008 +0800
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c	Tue Dec 23 13:00:23 2008 +0800
@@ -358,6 +358,12 @@
         return 0;
     }
     set_bit(i, __get_cpu_var(mce_banks_owned));
+    /* Clear Corected Error Counter field, make sure CMCI could 
+     * be triggered on the new owner
+     */
+    msr = MSR_IA32_MC0_STATUS + 4 * i;
+    rdmsrl(msr, val);
+    wrmsrl(msr, val & ~MCi_STATUS_ERRCOUNT);
 out:
     clear_bit(i, __get_cpu_var(no_cmci_banks));
     return 1;
diff -r a574f4c17b34 xen/arch/x86/cpu/mcheck/x86_mca.h
--- a/xen/arch/x86/cpu/mcheck/x86_mca.h	Tue Dec 23 11:25:24 2008 +0800
+++ b/xen/arch/x86/cpu/mcheck/x86_mca.h	Tue Dec 23 13:00:23 2008 +0800
@@ -46,6 +46,8 @@
 #define MCi_STATUS_MSEC         0x00000000ffff0000ULL
 /* Other information */
 #define MCi_STATUS_OTHER        0x01ffffff00000000ULL
+/*Corrected Error Count*/
+#define MCi_STATUS_ERRCOUNT     0x001FFFC0000000000ULL 
 /* processor context corrupt */
 #define MCi_STATUS_PCC          0x0200000000000000ULL
 /* MSR_K8_MCi_ADDR register valid */

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-23  5:17   ` Ke, Liping
@ 2008-12-23  8:40     ` Keir Fraser
  2008-12-23  9:00       ` Keir Fraser
  0 siblings, 1 reply; 14+ messages in thread
From: Keir Fraser @ 2008-12-23  8:40 UTC (permalink / raw)
  To: Ke, Liping; +Cc: xen-devel

On 23/12/2008 05:17, "Ke, Liping" <liping.ke@intel.com> wrote:

> Thanks a lot for the clean-up and format refine! I will pay more attention to
> this too.And also I tested on the related platform. It's fine.
> 
> As for moving *cmci_owner_set* out of stopmachine_run is basically ok for us.
> Just one thing:  
> CMCI might happen and lost during the very small window (old owner is cleared
> while new owner is not set). In order to make sure that CMCI could be
> triggered an on the new owner, we need to clear MSR Bank(i) status register
> [Corrected Error Counter] field ( We normally do this @ CMCI interrupt
> handler, according to spec, if the counter is not cleared, CMCI will not be
> triggered any more).
> I made a small patch for it in the attachment. How do you think?

I don't know very much about CMCI. If you think this is required I will
certainly check it in.

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-23  8:40     ` Keir Fraser
@ 2008-12-23  9:00       ` Keir Fraser
  2008-12-23 12:32         ` Ke, Liping
  2008-12-24  2:57         ` Jiang, Yunhong
  0 siblings, 2 replies; 14+ messages in thread
From: Keir Fraser @ 2008-12-23  9:00 UTC (permalink / raw)
  To: Ke, Liping; +Cc: xen-devel

On 23/12/2008 08:40, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

>> As for moving *cmci_owner_set* out of stopmachine_run is basically ok for us.
>> Just one thing: 
>> CMCI might happen and lost during the very small window (old owner is cleared
>> while new owner is not set). In order to make sure that CMCI could be
>> triggered an on the new owner, we need to clear MSR Bank(i) status register
>> [Corrected Error Counter] field ( We normally do this @ CMCI interrupt
>> handler, according to spec, if the counter is not cleared, CMCI will not be
>> triggered any more).
>> I made a small patch for it in the attachment. How do you think?
> 
> I don't know very much about CMCI. If you think this is required I will
> certainly check it in.

Actually I think this is a good idea, even if we'd stayed with your original
CMCI patches. I will apply it.

One thing -- if you want to reduce the window between release of a band by
its old owner and acquisition by a new owner, we could do the whole lot
before stop_machine_run()? Maybe cmci_cpu_down(cpu) which would IPI 'cpu' to
clear its CMCI state and then IPI all other CPUs to pick up the released
banks. This would be neatly hooked off CPU_DOWN_PREPARE or similar in Linux,
but Xen doesn't have cpu notifiers. :-) You'd have to call cmci_cpu_down()
explicitly in cpu_down(). Or perhaps we should have cpu notifier chains in
Xen too...

If we do the above I don't think we need to re-introduce your rollback
logic. If you think about it, there's no reason to prefer the old owner over
the new owner, so no reason to roll back. I believe?

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-23  9:00       ` Keir Fraser
@ 2008-12-23 12:32         ` Ke, Liping
  2008-12-24  2:57         ` Jiang, Yunhong
  1 sibling, 0 replies; 14+ messages in thread
From: Ke, Liping @ 2008-12-23 12:32 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

> 
> One thing -- if you want to reduce the window between release of a
> band by its old owner and acquisition by a new owner, we could do the
> whole lot before stop_machine_run()? Maybe cmci_cpu_down(cpu) which
> would IPI 'cpu' to clear its CMCI state and then IPI all other CPUs
> to pick up the released banks. This would be neatly hooked off
> CPU_DOWN_PREPARE or similar in Linux, but Xen doesn't have cpu
> notifiers. :-) You'd have to call cmci_cpu_down() explicitly in
> cpu_down(). Or perhaps we should have cpu notifier chains in Xen
> too... 
Hi, Keir
When we wrote the patch, yunhong also mentioned similar thoughts, 
I will have some discussion with him tomorrow.
Thanks a lot!
Criping
> 
> If we do the above I don't think we need to re-introduce your rollback
> logic. If you think about it, there's no reason to prefer the old
> owner over the new owner, so no reason to roll back. I believe?
> 
>  -- Keir
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-23  9:00       ` Keir Fraser
  2008-12-23 12:32         ` Ke, Liping
@ 2008-12-24  2:57         ` Jiang, Yunhong
  2008-12-24  8:01           ` Keir Fraser
  1 sibling, 1 reply; 14+ messages in thread
From: Jiang, Yunhong @ 2008-12-24  2:57 UTC (permalink / raw)
  To: Keir Fraser, Ke, Liping; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 3511 bytes --]



>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com 
>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser
>Sent: 2008年12月23日 17:00
>To: Ke, Liping
>Cc: xen-devel@lists.xensource.com
>Subject: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected 
>Machine Check Error Interrupt) for Intel CPUs
>
>On 23/12/2008 08:40, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:
>
>>> As for moving *cmci_owner_set* out of stopmachine_run is 
>basically ok for us.
>>> Just one thing: 
>>> CMCI might happen and lost during the very small window 
>(old owner is cleared
>>> while new owner is not set). In order to make sure that 
>CMCI could be
>>> triggered an on the new owner, we need to clear MSR Bank(i) 
>status register
>>> [Corrected Error Counter] field ( We normally do this @ 
>CMCI interrupt
>>> handler, according to spec, if the counter is not cleared, 
>CMCI will not be
>>> triggered any more).
>>> I made a small patch for it in the attachment. How do you think?
>> 
>> I don't know very much about CMCI. If you think this is 
>required I will
>> certainly check it in.
>
>Actually I think this is a good idea, even if we'd stayed with 
>your original
>CMCI patches. I will apply it.
>
>One thing -- if you want to reduce the window between release 
>of a band by
>its old owner and acquisition by a new owner, we could do the whole lot
>before stop_machine_run()? Maybe cmci_cpu_down(cpu) which 
>would IPI 'cpu' to
>clear its CMCI state and then IPI all other CPUs to pick up 
>the released
>banks. This would be neatly hooked off CPU_DOWN_PREPARE or 
>similar in Linux,
>but Xen doesn't have cpu notifiers. :-) You'd have to call 
>cmci_cpu_down()
>explicitly in cpu_down(). Or perhaps we should have cpu 
>notifier chains in
>Xen too...

Yes, we discussed this when working on the patch. 
Two target for CMCI when CPU offline: a) We'd better not lost CMCI interrupt; b)if we do lost CMCI interrupt, we should not block further CMCI interrupt. Since CMCI is correctaed error, so target a) is not so strong.

When we place the __cpu_clear_cmci in the stop_machine_run() logic, because the interrupt is disabled when the __cpu_clear_cmci() called, it is sure no CMCI interrupt is lost and no blocking will happen. In current Xen implementation, cpu_mcheck_distribute_cmci() is called after stop_machine_run(), so although there may be CMCI interrupt lost, but with Criping's patch, the CMCI will not be blocked anymore, and the solution is very clear.

As for your proposal of do it before the stop_machine_run(), it may reduce the window, but still can't eliminate the window of lost CMCI interrupt, unless we do similar thing in the cmci_cpu_down() (i.e. all CPU is irq_disabled before update the CMCI status). It is the same if we pull the notifier chain to Xen. How is your idea?


>
>If we do the above I don't think we need to re-introduce your rollback
>logic. If you think about it, there's no reason to prefer the 
>old owner over
>the new owner, so no reason to roll back. I believe?

Yes, currently we don't need rollback anymore. But if we do it before stop_machine_run(), we may still need the rollback for bank owned by the down cpu exclusively (not all back is shared between CPUs).

Thanks
Yunhong Jiang

>
> -- Keir
>
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-24  2:57         ` Jiang, Yunhong
@ 2008-12-24  8:01           ` Keir Fraser
  2008-12-24  8:46             ` Jiang, Yunhong
  0 siblings, 1 reply; 14+ messages in thread
From: Keir Fraser @ 2008-12-24  8:01 UTC (permalink / raw)
  To: Jiang, Yunhong, Ke, Liping; +Cc: xen-devel

On 24/12/2008 02:57, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

> As for your proposal of do it before the stop_machine_run(), it may reduce the
> window, but still can't eliminate the window of lost CMCI interrupt, unless we
> do similar thing in the cmci_cpu_down() (i.e. all CPU is irq_disabled before
> update the CMCI status). It is the same if we pull the notifier chain to Xen.
> How is your idea?

You only need hardirqs disabled, so rendezvous within an on_each_cpu()
callback is sufficient for you. You don't need to get intimate with
stop_machine_run()!

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-24  8:01           ` Keir Fraser
@ 2008-12-24  8:46             ` Jiang, Yunhong
  2008-12-24  8:54               ` Keir Fraser
  0 siblings, 1 reply; 14+ messages in thread
From: Jiang, Yunhong @ 2008-12-24  8:46 UTC (permalink / raw)
  To: Keir Fraser, Ke, Liping; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1087 bytes --]



>-----Original Message-----
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>Sent: 2008年12月24日 16:02
>To: Jiang, Yunhong; Ke, Liping
>Cc: xen-devel@lists.xensource.com
>Subject: Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected 
>Machine Check Error Interrupt) for Intel CPUs
>
>On 24/12/2008 02:57, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
>
>> As for your proposal of do it before the stop_machine_run(), 
>it may reduce the
>> window, but still can't eliminate the window of lost CMCI 
>interrupt, unless we
>> do similar thing in the cmci_cpu_down() (i.e. all CPU is 
>irq_disabled before
>> update the CMCI status). It is the same if we pull the 
>notifier chain to Xen.
>> How is your idea?
>
>You only need hardirqs disabled, so rendezvous within an on_each_cpu()
>callback is sufficient for you. You don't need to get intimate with
>stop_machine_run()!

Yes, that's what I mean of "do similar thing in the cmci_cpu_down()". We placed it in stop_machine_run() because it has rendezvous all CPU already :$

>
> -- Keir
>
>
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-24  8:46             ` Jiang, Yunhong
@ 2008-12-24  8:54               ` Keir Fraser
  2008-12-24 14:51                 ` Jiang, Yunhong
  0 siblings, 1 reply; 14+ messages in thread
From: Keir Fraser @ 2008-12-24  8:54 UTC (permalink / raw)
  To: Jiang, Yunhong, Ke, Liping; +Cc: xen-devel

On 24/12/2008 08:46, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>> You only need hardirqs disabled, so rendezvous within an on_each_cpu()
>> callback is sufficient for you. You don't need to get intimate with
>> stop_machine_run()!
> 
> Yes, that's what I mean of "do similar thing in the cmci_cpu_down()". We
> placed it in stop_machine_run() because it has rendezvous all CPU already :$

Yeah, on_each_cpu() rendezvous is pretty cheap. May as well have cleaner
partitioned code than avoid an extra rendezvous.

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-24  8:54               ` Keir Fraser
@ 2008-12-24 14:51                 ` Jiang, Yunhong
  2008-12-24 14:55                   ` Keir Fraser
  0 siblings, 1 reply; 14+ messages in thread
From: Jiang, Yunhong @ 2008-12-24 14:51 UTC (permalink / raw)
  To: Keir Fraser, Ke, Liping; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

 

>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com 
>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser
>Sent: 2008年12月24日 16:54
>To: Jiang, Yunhong; Ke, Liping
>Cc: xen-devel@lists.xensource.com
>Subject: Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected 
>Machine Check Error Interrupt) for Intel CPUs
>
>On 24/12/2008 08:46, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
>
>>> You only need hardirqs disabled, so rendezvous within an 
>on_each_cpu()
>>> callback is sufficient for you. You don't need to get intimate with
>>> stop_machine_run()!
>> 
>> Yes, that's what I mean of "do similar thing in the 
>cmci_cpu_down()". We
>> placed it in stop_machine_run() because it has rendezvous 
>all CPU already :$
>
>Yeah, on_each_cpu() rendezvous is pretty cheap. May as well 
>have cleaner
>partitioned code than avoid an extra rendezvous.

I'd have a bit clarification here. Simply harirqs disabled in the on_each_cpu() is not enough. What we need is, firstly make sure every CPU has irq disabled, then every CPU begin update the CMCI status. If this is what you mean, we will take a patch for it later. 

Thanks
Yunhong Jiang

>
> -- Keir
>
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-24 14:51                 ` Jiang, Yunhong
@ 2008-12-24 14:55                   ` Keir Fraser
  2008-12-24 15:08                     ` Jiang, Yunhong
  0 siblings, 1 reply; 14+ messages in thread
From: Keir Fraser @ 2008-12-24 14:55 UTC (permalink / raw)
  To: Jiang, Yunhong, Ke, Liping; +Cc: xen-devel

On 24/12/2008 14:51, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>>> Yes, that's what I mean of "do similar thing in the
>> cmci_cpu_down()". We
>>> placed it in stop_machine_run() because it has rendezvous
>> all CPU already :$
>> 
>> Yeah, on_each_cpu() rendezvous is pretty cheap. May as well
>> have cleaner
>> partitioned code than avoid an extra rendezvous.
> 
> I'd have a bit clarification here. Simply harirqs disabled in the
> on_each_cpu() is not enough. What we need is, firstly make sure every CPU has
> irq disabled, then every CPU begin update the CMCI status. If this is what you
> mean, we will take a patch for it later.

Every CPU will have IRQs disabled on entry to the on_each_cpu() IPI
function. I couldn't mean much else -- for example, it's not even valid for
the caller of on_each_cpu() to enter it with IRQs disabled.

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-24 14:55                   ` Keir Fraser
@ 2008-12-24 15:08                     ` Jiang, Yunhong
  2008-12-24 15:38                       ` Keir Fraser
  0 siblings, 1 reply; 14+ messages in thread
From: Jiang, Yunhong @ 2008-12-24 15:08 UTC (permalink / raw)
  To: Keir Fraser, Ke, Liping; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1432 bytes --]

 

>-----Original Message-----
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>Sent: 2008年12月24日 22:55
>To: Jiang, Yunhong; Ke, Liping
>Cc: xen-devel@lists.xensource.com
>Subject: Re: [Xen-devel] Re: [patch 0/3]Enable CMCI (Corrected 
>Machine Check Error Interrupt) for Intel CPUs
>
>On 24/12/2008 14:51, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
>
>>>> Yes, that's what I mean of "do similar thing in the
>>> cmci_cpu_down()". We
>>>> placed it in stop_machine_run() because it has rendezvous
>>> all CPU already :$
>>> 
>>> Yeah, on_each_cpu() rendezvous is pretty cheap. May as well
>>> have cleaner
>>> partitioned code than avoid an extra rendezvous.
>> 
>> I'd have a bit clarification here. Simply harirqs disabled in the
>> on_each_cpu() is not enough. What we need is, firstly make 
>sure every CPU has
>> irq disabled, then every CPU begin update the CMCI status. 
>If this is what you
>> mean, we will take a patch for it later.
>
>Every CPU will have IRQs disabled on entry to the on_each_cpu() IPI
>function. I couldn't mean much else -- for example, it's not 
>even valid for
>the caller of on_each_cpu() to enter it with IRQs disabled.

Currently on_each_cpu() has no gurantee that the fn() will be called with all CPU has entered the IPI. For example, maybe on CPU has been working on the fn() while the IPI is still pending on other CPU.

>
> -- Keir
>
>
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs
  2008-12-24 15:08                     ` Jiang, Yunhong
@ 2008-12-24 15:38                       ` Keir Fraser
  0 siblings, 0 replies; 14+ messages in thread
From: Keir Fraser @ 2008-12-24 15:38 UTC (permalink / raw)
  To: Jiang, Yunhong, Ke, Liping; +Cc: xen-devel

On 24/12/2008 15:08, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

>> Every CPU will have IRQs disabled on entry to the on_each_cpu() IPI
>> function. I couldn't mean much else -- for example, it's not
>> even valid for
>> the caller of on_each_cpu() to enter it with IRQs disabled.
> 
> Currently on_each_cpu() has no gurantee that the fn() will be called with all
> CPU has entered the IPI. For example, maybe on CPU has been working on the
> fn() while the IPI is still pending on other CPU.

In which case you can explicitly rendezvous in the handler, as we do for
time_calibration_rendezvous(). Notice there that we snapshot cpu_online_map
and use that as cpumask argument to on_selected_cpus() and to count CPUs
into a barrier.

If you absolutely need all CPUs to have IRQs disabled before you start your
work on any CPU, that's the way to do it.

 -- Keir

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-12-24 15:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-22  3:54 [patch 0/3]Enable CMCI (Corrected Machine Check Error Interrupt) for Intel CPUs Ke, Liping
2008-12-22 12:06 ` Keir Fraser
2008-12-23  5:17   ` Ke, Liping
2008-12-23  8:40     ` Keir Fraser
2008-12-23  9:00       ` Keir Fraser
2008-12-23 12:32         ` Ke, Liping
2008-12-24  2:57         ` Jiang, Yunhong
2008-12-24  8:01           ` Keir Fraser
2008-12-24  8:46             ` Jiang, Yunhong
2008-12-24  8:54               ` Keir Fraser
2008-12-24 14:51                 ` Jiang, Yunhong
2008-12-24 14:55                   ` Keir Fraser
2008-12-24 15:08                     ` Jiang, Yunhong
2008-12-24 15:38                       ` Keir Fraser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.