All of lore.kernel.org
 help / color / mirror / Atom feed
* vMCE vs migration
@ 2012-01-23 11:08 Jan Beulich
  2012-01-24 10:29 ` George Dunlap
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2012-01-23 11:08 UTC (permalink / raw)
  To: xen-devel; +Cc: Olaf Hering

x86's vMCE implementation lets a guest know of as many MCE reporting
banks as there are in the host. While a PV guest could be expected to
deal with this number changing (particularly decreasing) during migration
(not currently handled anywhere afaict), for HVM guests this is certainly
wrong.

At least to me it isn't, however, clear how to properly handle this. The
easiest would appear to be to save and restore the number of banks
the guest was made believe it can access, making vmce_{rd,wr}msr()
silently tolerate accesses between the host and guest values.

Any thoughts appreciated,
Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-23 11:08 vMCE vs migration Jan Beulich
@ 2012-01-24 10:29 ` George Dunlap
  2012-01-24 11:08   ` Jan Beulich
  0 siblings, 1 reply; 31+ messages in thread
From: George Dunlap @ 2012-01-24 10:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Olaf Hering, xen-devel

On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:
> x86's vMCE implementation lets a guest know of as many MCE reporting
> banks as there are in the host. While a PV guest could be expected to
> deal with this number changing (particularly decreasing) during migration
> (not currently handled anywhere afaict), for HVM guests this is certainly
> wrong.
>
> At least to me it isn't, however, clear how to properly handle this. The
> easiest would appear to be to save and restore the number of banks
> the guest was made believe it can access, making vmce_{rd,wr}msr()
> silently tolerate accesses between the host and guest values.

We ran into this in the XS 6.0 release as well.  I think that the
ideal thing to do would be to have a parameter that can be set at
boot, to say how many vMCE banks a guest has, defaulting to the number
of MCE banks on the host.  This parameter would be preserved across
migration.  Ideally, a pool-aware toolstack like xapi would then set
this value to be the value of the host in the pool with the largest
number of banks, allowing a guest to access all the banks on any host
to which it migrates.

What do you think?
 -George

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-24 10:29 ` George Dunlap
@ 2012-01-24 11:08   ` Jan Beulich
  2012-01-26 16:54     ` George Dunlap
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2012-01-24 11:08 UTC (permalink / raw)
  To: George Dunlap; +Cc: Olaf Hering, xen-devel

>>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:
>> x86's vMCE implementation lets a guest know of as many MCE reporting
>> banks as there are in the host. While a PV guest could be expected to
>> deal with this number changing (particularly decreasing) during migration
>> (not currently handled anywhere afaict), for HVM guests this is certainly
>> wrong.
>>
>> At least to me it isn't, however, clear how to properly handle this. The
>> easiest would appear to be to save and restore the number of banks
>> the guest was made believe it can access, making vmce_{rd,wr}msr()
>> silently tolerate accesses between the host and guest values.
> 
> We ran into this in the XS 6.0 release as well.  I think that the
> ideal thing to do would be to have a parameter that can be set at
> boot, to say how many vMCE banks a guest has, defaulting to the number
> of MCE banks on the host.  This parameter would be preserved across
> migration.  Ideally, a pool-aware toolstack like xapi would then set
> this value to be the value of the host in the pool with the largest
> number of banks, allowing a guest to access all the banks on any host
> to which it migrates.
> 
> What do you think?

That sounds like the way to go.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-24 11:08   ` Jan Beulich
@ 2012-01-26 16:54     ` George Dunlap
  2012-01-30  7:52       ` Jan Beulich
  2012-01-30 13:47       ` Jan Beulich
  0 siblings, 2 replies; 31+ messages in thread
From: George Dunlap @ 2012-01-26 16:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Ian Campbell

On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:
> >> x86's vMCE implementation lets a guest know of as many MCE reporting
> >> banks as there are in the host. While a PV guest could be expected to
> >> deal with this number changing (particularly decreasing) during migration
> >> (not currently handled anywhere afaict), for HVM guests this is certainly
> >> wrong.
> >>
> >> At least to me it isn't, however, clear how to properly handle this. The
> >> easiest would appear to be to save and restore the number of banks
> >> the guest was made believe it can access, making vmce_{rd,wr}msr()
> >> silently tolerate accesses between the host and guest values.
> > 
> > We ran into this in the XS 6.0 release as well.  I think that the
> > ideal thing to do would be to have a parameter that can be set at
> > boot, to say how many vMCE banks a guest has, defaulting to the number
> > of MCE banks on the host.  This parameter would be preserved across
> > migration.  Ideally, a pool-aware toolstack like xapi would then set
> > this value to be the value of the host in the pool with the largest
> > number of banks, allowing a guest to access all the banks on any host
> > to which it migrates.
> > 
> > What do you think?
> 
> That sounds like the way to go.

So should we put this on IanC's to-do-be-done list?  Are you going to
put it on your to-do list? :-)

 -George

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-26 16:54     ` George Dunlap
@ 2012-01-30  7:52       ` Jan Beulich
  2012-01-30 13:47       ` Jan Beulich
  1 sibling, 0 replies; 31+ messages in thread
From: Jan Beulich @ 2012-01-30  7:52 UTC (permalink / raw)
  To: George Dunlap, George.Dunlap; +Cc: xen-devel, Ian Campbell

>>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote:
> On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
>> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:
>> >> x86's vMCE implementation lets a guest know of as many MCE reporting
>> >> banks as there are in the host. While a PV guest could be expected to
>> >> deal with this number changing (particularly decreasing) during migration
>> >> (not currently handled anywhere afaict), for HVM guests this is certainly
>> >> wrong.
>> >>
>> >> At least to me it isn't, however, clear how to properly handle this. The
>> >> easiest would appear to be to save and restore the number of banks
>> >> the guest was made believe it can access, making vmce_{rd,wr}msr()
>> >> silently tolerate accesses between the host and guest values.
>> > 
>> > We ran into this in the XS 6.0 release as well.  I think that the
>> > ideal thing to do would be to have a parameter that can be set at
>> > boot, to say how many vMCE banks a guest has, defaulting to the number
>> > of MCE banks on the host.  This parameter would be preserved across
>> > migration.  Ideally, a pool-aware toolstack like xapi would then set
>> > this value to be the value of the host in the pool with the largest
>> > number of banks, allowing a guest to access all the banks on any host
>> > to which it migrates.
>> > 
>> > What do you think?
>> 
>> That sounds like the way to go.
> 
> So should we put this on IanC's to-do-be-done list?

Would certainly be desirable to get fixed.

> Are you going to put it on your to-do list? :-)

I'll first check with Intel whether the engineers who wrote that code
could make an attempt.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-26 16:54     ` George Dunlap
  2012-01-30  7:52       ` Jan Beulich
@ 2012-01-30 13:47       ` Jan Beulich
  2012-01-31 11:27         ` George Dunlap
                           ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Jan Beulich @ 2012-01-30 13:47 UTC (permalink / raw)
  To: George Dunlap, George.Dunlap; +Cc: Olaf Hering, xen-devel, Ian Campbell

[-- Attachment #1: Type: text/plain, Size: 16194 bytes --]

>>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote:
> On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
>> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:
>> >> x86's vMCE implementation lets a guest know of as many MCE reporting
>> >> banks as there are in the host. While a PV guest could be expected to
>> >> deal with this number changing (particularly decreasing) during migration
>> >> (not currently handled anywhere afaict), for HVM guests this is certainly
>> >> wrong.
>> >>
>> >> At least to me it isn't, however, clear how to properly handle this. The
>> >> easiest would appear to be to save and restore the number of banks
>> >> the guest was made believe it can access, making vmce_{rd,wr}msr()
>> >> silently tolerate accesses between the host and guest values.
>> > 
>> > We ran into this in the XS 6.0 release as well.  I think that the
>> > ideal thing to do would be to have a parameter that can be set at
>> > boot, to say how many vMCE banks a guest has, defaulting to the number
>> > of MCE banks on the host.  This parameter would be preserved across
>> > migration.  Ideally, a pool-aware toolstack like xapi would then set
>> > this value to be the value of the host in the pool with the largest
>> > number of banks, allowing a guest to access all the banks on any host
>> > to which it migrates.
>> > 
>> > What do you think?
>> 
>> That sounds like the way to go.
> 
> So should we put this on IanC's to-do-be-done list?  Are you going to
> put it on your to-do list? :-)

Below/attached a draft patch (compile tested only), handling save/
restore of the bank count, but not allowing for a config setting to
specify its initial value (yet).

Jan

--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
 
         domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
         domctl.domain = dom;
+        memset(&domctl.u, 0, sizeof(domctl.u));
         domctl.u.ext_vcpucontext.vcpu = i;
         if ( xc_domctl(xch, &domctl) < 0 )
         {
--- a/tools/misc/xen-hvmctx.c
+++ b/tools/misc/xen-hvmctx.c
@@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
            (unsigned long long) p.apic_assist);           
 }
 
+static void dump_vmce_vcpu(void)
+{
+    HVM_SAVE_TYPE(VMCE_VCPU) p;
+    READ(p);
+    printf("    VMCE_VCPU: nr_banks %u\n", p.nr_banks);
+}
+
 int main(int argc, char **argv)
 {
     int entry, domid;
@@ -449,6 +456,7 @@ int main(int argc, char **argv)
         case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
         case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break;
         case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break;
+        case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break;
         case HVM_SAVE_CODE(END): break;
         default:
             printf(" ** Don't understand type %u: skipping\n",
--- a/xen/arch/x86/cpu/mcheck/mce.h
+++ b/xen/arch/x86/cpu/mcheck/mce.h
@@ -3,6 +3,7 @@
 #define _MCE_H
 
 #include <xen/init.h>
+#include <xen/sched.h>
 #include <xen/smp.h>
 #include <asm/types.h>
 #include <asm/traps.h>
@@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
 u64 mce_cap_init(void);
 extern int firstbank;
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
-int intel_mce_wrmsr(uint32_t msr, uint64_t val);
+int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val);
+int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val);
 
 struct mcinfo_extended *intel_get_extended_msrs(
     struct mcinfo_global *mig, struct mc_info *mi);
@@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
 
 extern int vmce_init(struct cpuinfo_x86 *c);
 
-static inline int mce_vendor_bank_msr(uint32_t msr)
+static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr)
 {
     if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-         msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks) )
+         msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
           return 1;
     return 0;
 }
 
-static inline int mce_bank_msr(uint32_t msr)
+static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
 {
-    if ( (msr >= MSR_IA32_MC0_CTL && msr < MSR_IA32_MCx_CTL(nr_mce_banks)) ||
-        mce_vendor_bank_msr(msr) )
+    if ( (msr >= MSR_IA32_MC0_CTL &&
+          msr < MSR_IA32_MCx_CTL(v->arch.nr_vmce_banks)) ||
+         mce_vendor_bank_msr(v, msr) )
         return 1;
     return 0;
 }
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
 }
 
 /* intel specific MCA MSR */
-int intel_mce_wrmsr(uint32_t msr, uint64_t val)
+int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not write this MSR!\n");
@@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
     return ret;
 }
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
+int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not read this MSR!\n");
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -10,6 +10,7 @@
 #include <xen/delay.h>
 #include <xen/smp.h>
 #include <xen/mm.h>
+#include <xen/hvm/save.h>
 #include <asm/processor.h>
 #include <public/sysctl.h>
 #include <asm/system.h>
@@ -61,21 +62,26 @@ void vmce_destroy_msr(struct domain *d)
     dom_vmce(d) = NULL;
 }
 
-static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val)
+void vmce_init_vcpu(struct vcpu *v)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    v->arch.nr_vmce_banks = nr_mce_banks;
+}
+
+static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
+{
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -1;
+    *val = 0;
 
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        *val = vmce->mci_ctl[bank] &
-            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
+        if ( bank < nr_mce_banks )
+            *val = vmce->mci_ctl[bank] &
+                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n",
                    bank, *val);
         break;
@@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_rdmsr(msr, val);
+            ret = intel_mce_rdmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain 
  */
 int vmce_rdmsr(uint32_t msr, uint64_t *val)
 {
-    struct domain *d = current->domain;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    const struct vcpu *cur = current;
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     *val = 0;
 
-    spin_lock(&dom_vmce(d)->lock);
+    spin_lock(&vmce->lock);
 
     switch ( msr )
     {
@@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
                    *val);
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
         break;
     }
 
-    spin_unlock(&dom_vmce(d)->lock);
+    spin_unlock(&vmce->lock);
     return ret;
 }
 
-static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
+static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry = NULL;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -EINVAL;
-
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        vmce->mci_ctl[bank] = val;
+        if ( bank < nr_mce_banks )
+            vmce->mci_ctl[bank] = val;
         break;
     case MSR_IA32_MC0_STATUS:
         /* Give the first entry of the list, it corresponds to current
@@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain 
          * the guest, this node will be deleted.
          * Only error bank is written. Non-error banks simply return.
          */
-        if ( !list_empty(&dom_vmce(d)->impact_header) )
+        if ( !list_empty(&vmce->impact_header) )
         {
-            entry = list_entry(dom_vmce(d)->impact_header.next,
+            entry = list_entry(vmce->impact_header.next,
                                struct bank_entry, list);
             if ( entry->bank == bank )
                 entry->mci_status = val;
@@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_wrmsr(msr, val);
+            ret = intel_mce_wrmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain 
  */
 int vmce_wrmsr(u32 msr, u64 val)
 {
-    struct domain *d = current->domain;
+    struct vcpu *cur = current;
     struct bank_entry *entry = NULL;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     if ( !g_mcg_cap )
@@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         vmce->mcg_status = val;
         mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val);
         /* For HVM guest, this is the point for deleting vMCE injection node */
-        if ( d->is_hvm && (vmce->nr_injection > 0) )
+        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
         {
             vmce->nr_injection--; /* Should be 0 */
             if ( !list_empty(&vmce->impact_header) )
@@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         ret = -1;
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
         break;
     }
 
@@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val)
     return ret;
 }
 
+static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    struct vcpu *v;
+    int err = 0;
+
+    for_each_vcpu( d, v ) {
+        struct hvm_vmce_vcpu ctxt = {
+            .nr_banks = v->arch.nr_vmce_banks
+        };
+
+        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        if ( err )
+            break;
+    }
+
+    return err;
+}
+
+static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    unsigned int vcpuid = hvm_load_instance(h);
+    struct vcpu *v;
+    struct hvm_vmce_vcpu ctxt;
+    int err;
+
+    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+    {
+        gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n", vcpuid);
+        return -EINVAL;
+    }
+
+    err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
+    if ( !err )
+        v->arch.nr_vmce_banks = ctxt.nr_banks;
+
+    return err;
+}
+
+HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
+                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
+
 int inject_vmce(struct domain *d)
 {
     int cpu = smp_processor_id();
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v)
 
     v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
 
+    vmce_init_vcpu(v);
+
     rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0;
  done:
     if ( rc )
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,11 +1027,12 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            evc->nr_mce_banks = v->arch.nr_vmce_banks;
         }
         else
         {
             ret = -EINVAL;
-            if ( evc->size != sizeof(*evc) )
+            if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) )
                 goto ext_vcpucontext_out;
 #ifdef __x86_64__
             if ( !is_hvm_domain(d) )
@@ -1059,6 +1060,16 @@ long arch_do_domctl(
                  (evc->syscall32_callback_cs & ~3) ||
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
+
+            if ( evc->size >= offsetof(typeof(*evc), mbz[ARRAY_SIZE(evc->mbz)]) )
+            {
+                unsigned int i;
+
+                for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i )
+                    if ( evc->mbz[i] )
+                        goto ext_vcpucontext_out;
+                v->arch.nr_vmce_banks = evc->nr_mce_banks;
+            }
         }
 
         ret = 0;
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -488,6 +488,8 @@ struct arch_vcpu
     /* This variable determines whether nonlazy extended state has been used,
      * and thus should be saved/restored. */
     bool_t nonlazy_xstate_used;
+
+    uint8_t nr_vmce_banks;
     
     struct paging_vcpu paging;
 
--- a/xen/include/asm-x86/mce.h
+++ b/xen/include/asm-x86/mce.h
@@ -28,6 +28,7 @@ struct domain_mca_msrs
 /* Guest vMCE MSRs virtualization */
 extern int vmce_init_msr(struct domain *d);
 extern void vmce_destroy_msr(struct domain *d);
+extern void vmce_init_vcpu(struct vcpu *);
 extern int vmce_wrmsr(uint32_t msr, uint64_t val);
 extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
 
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
 
 DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
 
+struct hvm_vmce_vcpu {
+    uint8_t nr_banks;
+};
+
+DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
+
 /* 
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 17
+#define HVM_SAVE_CODE_MAX 18
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
     uint32_t         vcpu;
     /*
      * SET: Size of struct (IN)
-     * GET: Size of struct (OUT)
+     * GET: Size of struct (OUT, up to 128 bytes)
      */
     uint32_t         size;
 #if defined(__i386__) || defined(__x86_64__)
@@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext {
     uint16_t         sysenter_callback_cs;
     uint8_t          syscall32_disables_events;
     uint8_t          sysenter_disables_events;
+    uint8_t          nr_mce_banks;
+    /* This array can be split and used for future extension. */
+    /* It is there just to grow the structure beyond 4.1's size. */
+    uint8_t          mbz[9];
 #endif
 };
 typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;


[-- Attachment #2: x86-vmce-sr.patch --]
[-- Type: text/plain, Size: 14346 bytes --]

--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
 
         domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
         domctl.domain = dom;
+        memset(&domctl.u, 0, sizeof(domctl.u));
         domctl.u.ext_vcpucontext.vcpu = i;
         if ( xc_domctl(xch, &domctl) < 0 )
         {
--- a/tools/misc/xen-hvmctx.c
+++ b/tools/misc/xen-hvmctx.c
@@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
            (unsigned long long) p.apic_assist);           
 }
 
+static void dump_vmce_vcpu(void)
+{
+    HVM_SAVE_TYPE(VMCE_VCPU) p;
+    READ(p);
+    printf("    VMCE_VCPU: nr_banks %u\n", p.nr_banks);
+}
+
 int main(int argc, char **argv)
 {
     int entry, domid;
@@ -449,6 +456,7 @@ int main(int argc, char **argv)
         case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
         case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break;
         case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break;
+        case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break;
         case HVM_SAVE_CODE(END): break;
         default:
             printf(" ** Don't understand type %u: skipping\n",
--- a/xen/arch/x86/cpu/mcheck/mce.h
+++ b/xen/arch/x86/cpu/mcheck/mce.h
@@ -3,6 +3,7 @@
 #define _MCE_H
 
 #include <xen/init.h>
+#include <xen/sched.h>
 #include <xen/smp.h>
 #include <asm/types.h>
 #include <asm/traps.h>
@@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
 u64 mce_cap_init(void);
 extern int firstbank;
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
-int intel_mce_wrmsr(uint32_t msr, uint64_t val);
+int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val);
+int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val);
 
 struct mcinfo_extended *intel_get_extended_msrs(
     struct mcinfo_global *mig, struct mc_info *mi);
@@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
 
 extern int vmce_init(struct cpuinfo_x86 *c);
 
-static inline int mce_vendor_bank_msr(uint32_t msr)
+static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr)
 {
     if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-         msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks) )
+         msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
           return 1;
     return 0;
 }
 
-static inline int mce_bank_msr(uint32_t msr)
+static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
 {
-    if ( (msr >= MSR_IA32_MC0_CTL && msr < MSR_IA32_MCx_CTL(nr_mce_banks)) ||
-        mce_vendor_bank_msr(msr) )
+    if ( (msr >= MSR_IA32_MC0_CTL &&
+          msr < MSR_IA32_MCx_CTL(v->arch.nr_vmce_banks)) ||
+         mce_vendor_bank_msr(v, msr) )
         return 1;
     return 0;
 }
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
 }
 
 /* intel specific MCA MSR */
-int intel_mce_wrmsr(uint32_t msr, uint64_t val)
+int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not write this MSR!\n");
@@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
     return ret;
 }
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
+int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not read this MSR!\n");
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -10,6 +10,7 @@
 #include <xen/delay.h>
 #include <xen/smp.h>
 #include <xen/mm.h>
+#include <xen/hvm/save.h>
 #include <asm/processor.h>
 #include <public/sysctl.h>
 #include <asm/system.h>
@@ -61,21 +62,26 @@ void vmce_destroy_msr(struct domain *d)
     dom_vmce(d) = NULL;
 }
 
-static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val)
+void vmce_init_vcpu(struct vcpu *v)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    v->arch.nr_vmce_banks = nr_mce_banks;
+}
+
+static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
+{
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -1;
+    *val = 0;
 
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        *val = vmce->mci_ctl[bank] &
-            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
+        if ( bank < nr_mce_banks )
+            *val = vmce->mci_ctl[bank] &
+                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n",
                    bank, *val);
         break;
@@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_rdmsr(msr, val);
+            ret = intel_mce_rdmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain 
  */
 int vmce_rdmsr(uint32_t msr, uint64_t *val)
 {
-    struct domain *d = current->domain;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    const struct vcpu *cur = current;
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     *val = 0;
 
-    spin_lock(&dom_vmce(d)->lock);
+    spin_lock(&vmce->lock);
 
     switch ( msr )
     {
@@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
                    *val);
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
         break;
     }
 
-    spin_unlock(&dom_vmce(d)->lock);
+    spin_unlock(&vmce->lock);
     return ret;
 }
 
-static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
+static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry = NULL;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -EINVAL;
-
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        vmce->mci_ctl[bank] = val;
+        if ( bank < nr_mce_banks )
+            vmce->mci_ctl[bank] = val;
         break;
     case MSR_IA32_MC0_STATUS:
         /* Give the first entry of the list, it corresponds to current
@@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain 
          * the guest, this node will be deleted.
          * Only error bank is written. Non-error banks simply return.
          */
-        if ( !list_empty(&dom_vmce(d)->impact_header) )
+        if ( !list_empty(&vmce->impact_header) )
         {
-            entry = list_entry(dom_vmce(d)->impact_header.next,
+            entry = list_entry(vmce->impact_header.next,
                                struct bank_entry, list);
             if ( entry->bank == bank )
                 entry->mci_status = val;
@@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_wrmsr(msr, val);
+            ret = intel_mce_wrmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain 
  */
 int vmce_wrmsr(u32 msr, u64 val)
 {
-    struct domain *d = current->domain;
+    struct vcpu *cur = current;
     struct bank_entry *entry = NULL;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     if ( !g_mcg_cap )
@@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         vmce->mcg_status = val;
         mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val);
         /* For HVM guest, this is the point for deleting vMCE injection node */
-        if ( d->is_hvm && (vmce->nr_injection > 0) )
+        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
         {
             vmce->nr_injection--; /* Should be 0 */
             if ( !list_empty(&vmce->impact_header) )
@@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         ret = -1;
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
         break;
     }
 
@@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val)
     return ret;
 }
 
+static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    struct vcpu *v;
+    int err = 0;
+
+    for_each_vcpu( d, v ) {
+        struct hvm_vmce_vcpu ctxt = {
+            .nr_banks = v->arch.nr_vmce_banks
+        };
+
+        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        if ( err )
+            break;
+    }
+
+    return err;
+}
+
+static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    unsigned int vcpuid = hvm_load_instance(h);
+    struct vcpu *v;
+    struct hvm_vmce_vcpu ctxt;
+    int err;
+
+    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+    {
+        gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n", vcpuid);
+        return -EINVAL;
+    }
+
+    err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
+    if ( !err )
+        v->arch.nr_vmce_banks = ctxt.nr_banks;
+
+    return err;
+}
+
+HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
+                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
+
 int inject_vmce(struct domain *d)
 {
     int cpu = smp_processor_id();
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v)
 
     v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
 
+    vmce_init_vcpu(v);
+
     rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0;
  done:
     if ( rc )
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,11 +1027,12 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            evc->nr_mce_banks = v->arch.nr_vmce_banks;
         }
         else
         {
             ret = -EINVAL;
-            if ( evc->size != sizeof(*evc) )
+            if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) )
                 goto ext_vcpucontext_out;
 #ifdef __x86_64__
             if ( !is_hvm_domain(d) )
@@ -1059,6 +1060,16 @@ long arch_do_domctl(
                  (evc->syscall32_callback_cs & ~3) ||
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
+
+            if ( evc->size >= offsetof(typeof(*evc), mbz[ARRAY_SIZE(evc->mbz)]) )
+            {
+                unsigned int i;
+
+                for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i )
+                    if ( evc->mbz[i] )
+                        goto ext_vcpucontext_out;
+                v->arch.nr_vmce_banks = evc->nr_mce_banks;
+            }
         }
 
         ret = 0;
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -488,6 +488,8 @@ struct arch_vcpu
     /* This variable determines whether nonlazy extended state has been used,
      * and thus should be saved/restored. */
     bool_t nonlazy_xstate_used;
+
+    uint8_t nr_vmce_banks;
     
     struct paging_vcpu paging;
 
--- a/xen/include/asm-x86/mce.h
+++ b/xen/include/asm-x86/mce.h
@@ -28,6 +28,7 @@ struct domain_mca_msrs
 /* Guest vMCE MSRs virtualization */
 extern int vmce_init_msr(struct domain *d);
 extern void vmce_destroy_msr(struct domain *d);
+extern void vmce_init_vcpu(struct vcpu *);
 extern int vmce_wrmsr(uint32_t msr, uint64_t val);
 extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
 
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
 
 DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
 
+struct hvm_vmce_vcpu {
+    uint8_t nr_banks;
+};
+
+DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
+
 /* 
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 17
+#define HVM_SAVE_CODE_MAX 18
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
     uint32_t         vcpu;
     /*
      * SET: Size of struct (IN)
-     * GET: Size of struct (OUT)
+     * GET: Size of struct (OUT, up to 128 bytes)
      */
     uint32_t         size;
 #if defined(__i386__) || defined(__x86_64__)
@@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext {
     uint16_t         sysenter_callback_cs;
     uint8_t          syscall32_disables_events;
     uint8_t          sysenter_disables_events;
+    uint8_t          nr_mce_banks;
+    /* This array can be split and used for future extension. */
+    /* It is there just to grow the structure beyond 4.1's size. */
+    uint8_t          mbz[9];
 #endif
 };
 typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-30 13:47       ` Jan Beulich
@ 2012-01-31 11:27         ` George Dunlap
  2012-01-31 11:28           ` George Dunlap
  2012-01-31 13:17           ` Jan Beulich
  2012-02-03  7:18         ` Liu, Jinsong
  2012-02-09 18:02         ` Olaf Hering
  2 siblings, 2 replies; 31+ messages in thread
From: George Dunlap @ 2012-01-31 11:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Olaf Hering, xen-devel, Ian Campbell

On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote:
> >>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote:
> > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
> >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:
> >> >> x86's vMCE implementation lets a guest know of as many MCE reporting
> >> >> banks as there are in the host. While a PV guest could be expected to
> >> >> deal with this number changing (particularly decreasing) during migration
> >> >> (not currently handled anywhere afaict), for HVM guests this is certainly
> >> >> wrong.
> >> >>
> >> >> At least to me it isn't, however, clear how to properly handle this. The
> >> >> easiest would appear to be to save and restore the number of banks
> >> >> the guest was made believe it can access, making vmce_{rd,wr}msr()
> >> >> silently tolerate accesses between the host and guest values.
> >> >
> >> > We ran into this in the XS 6.0 release as well.  I think that the
> >> > ideal thing to do would be to have a parameter that can be set at
> >> > boot, to say how many vMCE banks a guest has, defaulting to the number
> >> > of MCE banks on the host.  This parameter would be preserved across
> >> > migration.  Ideally, a pool-aware toolstack like xapi would then set
> >> > this value to be the value of the host in the pool with the largest
> >> > number of banks, allowing a guest to access all the banks on any host
> >> > to which it migrates.
> >> >
> >> > What do you think?
> >>
> >> That sounds like the way to go.
> >
> > So should we put this on IanC's to-do-be-done list?  Are you going to
> > put it on your to-do list? :-)
> 
> Below/attached a draft patch (compile tested only), handling save/
> restore of the bank count, but not allowing for a config setting to
> specify its initial value (yet).

Looks pretty good for a first blush.  Just one question: Why is the vmce
count made on a per-vcpu basis, rather than on a per-domain basis, like
the actual banks are?  Is the host MCE stuff per-vcpu?

 -George

> +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
> +{
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -1;
> +    *val = 0;
> 
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        *val = vmce->mci_ctl[bank] &
> -            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> +        if ( bank < nr_mce_banks )
> +            *val = vmce->mci_ctl[bank] &
> +                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n",
>                     bank, *val);
>          break;
> @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_rdmsr(msr, val);
> +            ret = intel_mce_rdmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain
>   */
>  int vmce_rdmsr(uint32_t msr, uint64_t *val)
>  {
> -    struct domain *d = current->domain;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    const struct vcpu *cur = current;
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      *val = 0;
> 
> -    spin_lock(&dom_vmce(d)->lock);
> +    spin_lock(&vmce->lock);
> 
>      switch ( msr )
>      {
> @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
>                     *val);
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
>          break;
>      }
> 
> -    spin_unlock(&dom_vmce(d)->lock);
> +    spin_unlock(&vmce->lock);
>      return ret;
>  }
> 
> -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
> +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry = NULL;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -EINVAL;
> -
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        vmce->mci_ctl[bank] = val;
> +        if ( bank < nr_mce_banks )
> +            vmce->mci_ctl[bank] = val;
>          break;
>      case MSR_IA32_MC0_STATUS:
>          /* Give the first entry of the list, it corresponds to current
> @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain
>           * the guest, this node will be deleted.
>           * Only error bank is written. Non-error banks simply return.
>           */
> -        if ( !list_empty(&dom_vmce(d)->impact_header) )
> +        if ( !list_empty(&vmce->impact_header) )
>          {
> -            entry = list_entry(dom_vmce(d)->impact_header.next,
> +            entry = list_entry(vmce->impact_header.next,
>                                 struct bank_entry, list);
>              if ( entry->bank == bank )
>                  entry->mci_status = val;
> @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_wrmsr(msr, val);
> +            ret = intel_mce_wrmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain
>   */
>  int vmce_wrmsr(u32 msr, u64 val)
>  {
> -    struct domain *d = current->domain;
> +    struct vcpu *cur = current;
>      struct bank_entry *entry = NULL;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      if ( !g_mcg_cap )
> @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          vmce->mcg_status = val;
>          mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val);
>          /* For HVM guest, this is the point for deleting vMCE injection node */
> -        if ( d->is_hvm && (vmce->nr_injection > 0) )
> +        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
>          {
>              vmce->nr_injection--; /* Should be 0 */
>              if ( !list_empty(&vmce->impact_header) )
> @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          ret = -1;
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
>          break;
>      }
> 
> @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val)
>      return ret;
>  }
> 
> +static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
> +{
> +    struct vcpu *v;
> +    int err = 0;
> +
> +    for_each_vcpu( d, v ) {
> +        struct hvm_vmce_vcpu ctxt = {
> +            .nr_banks = v->arch.nr_vmce_banks
> +        };
> +
> +        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
> +        if ( err )
> +            break;
> +    }
> +
> +    return err;
> +}
> +
> +static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
> +{
> +    unsigned int vcpuid = hvm_load_instance(h);
> +    struct vcpu *v;
> +    struct hvm_vmce_vcpu ctxt;
> +    int err;
> +
> +    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
> +    {
> +        gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n", vcpuid);
> +        return -EINVAL;
> +    }
> +
> +    err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
> +    if ( !err )
> +        v->arch.nr_vmce_banks = ctxt.nr_banks;
> +
> +    return err;
> +}
> +
> +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
> +                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
> +
>  int inject_vmce(struct domain *d)
>  {
>      int cpu = smp_processor_id();
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v)
> 
>      v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> 
> +    vmce_init_vcpu(v);
> +
>      rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0;
>   done:
>      if ( rc )
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -1027,11 +1027,12 @@ long arch_do_domctl(
>                  evc->syscall32_callback_eip    = 0;
>                  evc->syscall32_disables_events = 0;
>              }
> +            evc->nr_mce_banks = v->arch.nr_vmce_banks;
>          }
>          else
>          {
>              ret = -EINVAL;
> -            if ( evc->size != sizeof(*evc) )
> +            if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) )
>                  goto ext_vcpucontext_out;
>  #ifdef __x86_64__
>              if ( !is_hvm_domain(d) )
> @@ -1059,6 +1060,16 @@ long arch_do_domctl(
>                   (evc->syscall32_callback_cs & ~3) ||
>                   evc->syscall32_callback_eip )
>                  goto ext_vcpucontext_out;
> +
> +            if ( evc->size >= offsetof(typeof(*evc), mbz[ARRAY_SIZE(evc->mbz)]) )
> +            {
> +                unsigned int i;
> +
> +                for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i )
> +                    if ( evc->mbz[i] )
> +                        goto ext_vcpucontext_out;
> +                v->arch.nr_vmce_banks = evc->nr_mce_banks;
> +            }
>          }
> 
>          ret = 0;
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -488,6 +488,8 @@ struct arch_vcpu
>      /* This variable determines whether nonlazy extended state has been used,
>       * and thus should be saved/restored. */
>      bool_t nonlazy_xstate_used;
> +
> +    uint8_t nr_vmce_banks;
> 
>      struct paging_vcpu paging;
> 
> --- a/xen/include/asm-x86/mce.h
> +++ b/xen/include/asm-x86/mce.h
> @@ -28,6 +28,7 @@ struct domain_mca_msrs
>  /* Guest vMCE MSRs virtualization */
>  extern int vmce_init_msr(struct domain *d);
>  extern void vmce_destroy_msr(struct domain *d);
> +extern void vmce_init_vcpu(struct vcpu *);
>  extern int vmce_wrmsr(uint32_t msr, uint64_t val);
>  extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
> 
> --- a/xen/include/public/arch-x86/hvm/save.h
> +++ b/xen/include/public/arch-x86/hvm/save.h
> @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
> 
>  DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
> 
> +struct hvm_vmce_vcpu {
> +    uint8_t nr_banks;
> +};
> +
> +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
> +
>  /*
>   * Largest type-code in use
>   */
> -#define HVM_SAVE_CODE_MAX 17
> +#define HVM_SAVE_CODE_MAX 18
> 
>  #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
>      uint32_t         vcpu;
>      /*
>       * SET: Size of struct (IN)
> -     * GET: Size of struct (OUT)
> +     * GET: Size of struct (OUT, up to 128 bytes)
>       */
>      uint32_t         size;
>  #if defined(__i386__) || defined(__x86_64__)
> @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext {
>      uint16_t         sysenter_callback_cs;
>      uint8_t          syscall32_disables_events;
>      uint8_t          sysenter_disables_events;
> +    uint8_t          nr_mce_banks;
> +    /* This array can be split and used for future extension. */
> +    /* It is there just to grow the structure beyond 4.1's size. */
> +    uint8_t          mbz[9];
>  #endif
>  };
>  typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-31 11:27         ` George Dunlap
@ 2012-01-31 11:28           ` George Dunlap
  2012-01-31 13:17           ` Jan Beulich
  1 sibling, 0 replies; 31+ messages in thread
From: George Dunlap @ 2012-01-31 11:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Olaf Hering, xen-devel, Ian Campbell

On Tue, 2012-01-31 at 11:27 +0000, George Dunlap wrote:
> On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote:
> > >>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote:
> > > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
> > >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> > >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:
> > >> >> x86's vMCE implementation lets a guest know of as many MCE reporting
> > >> >> banks as there are in the host. While a PV guest could be expected to
> > >> >> deal with this number changing (particularly decreasing) during migration
> > >> >> (not currently handled anywhere afaict), for HVM guests this is certainly
> > >> >> wrong.
> > >> >>
> > >> >> At least to me it isn't, however, clear how to properly handle this. The
> > >> >> easiest would appear to be to save and restore the number of banks
> > >> >> the guest was made believe it can access, making vmce_{rd,wr}msr()
> > >> >> silently tolerate accesses between the host and guest values.
> > >> >
> > >> > We ran into this in the XS 6.0 release as well.  I think that the
> > >> > ideal thing to do would be to have a parameter that can be set at
> > >> > boot, to say how many vMCE banks a guest has, defaulting to the number
> > >> > of MCE banks on the host.  This parameter would be preserved across
> > >> > migration.  Ideally, a pool-aware toolstack like xapi would then set
> > >> > this value to be the value of the host in the pool with the largest
> > >> > number of banks, allowing a guest to access all the banks on any host
> > >> > to which it migrates.
> > >> >
> > >> > What do you think?
> > >>
> > >> That sounds like the way to go.
> > >
> > > So should we put this on IanC's to-do-be-done list?  Are you going to
> > > put it on your to-do list? :-)
> > 
> > Below/attached a draft patch (compile tested only), handling save/
> > restore of the bank count, but not allowing for a config setting to
> > specify its initial value (yet).
> 
> Looks pretty good for a first blush.  Just one question: Why is the vmce
> count made on a per-vcpu basis, rather than on a per-domain basis, like
> the actual banks are?  Is the host MCE stuff per-vcpu?

(Per-pcpu, that is...)

> 
>  -George
> 
> > +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
> > +{
> > +    int ret = 1;
> > +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> > +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
> >      struct bank_entry *entry;
> > 
> > -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> > -    if ( bank >= nr_mce_banks )
> > -        return -1;
> > +    *val = 0;
> > 
> >      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
> >      {
> >      case MSR_IA32_MC0_CTL:
> > -        *val = vmce->mci_ctl[bank] &
> > -            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> > +        if ( bank < nr_mce_banks )
> > +            *val = vmce->mci_ctl[bank] &
> > +                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> >          mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n",
> >                     bank, *val);
> >          break;
> > @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain
> >          switch ( boot_cpu_data.x86_vendor )
> >          {
> >          case X86_VENDOR_INTEL:
> > -            ret = intel_mce_rdmsr(msr, val);
> > +            ret = intel_mce_rdmsr(v, msr, val);
> >              break;
> >          default:
> >              ret = 0;
> > @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain
> >   */
> >  int vmce_rdmsr(uint32_t msr, uint64_t *val)
> >  {
> > -    struct domain *d = current->domain;
> > -    struct domain_mca_msrs *vmce = dom_vmce(d);
> > +    const struct vcpu *cur = current;
> > +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
> >      int ret = 1;
> > 
> >      *val = 0;
> > 
> > -    spin_lock(&dom_vmce(d)->lock);
> > +    spin_lock(&vmce->lock);
> > 
> >      switch ( msr )
> >      {
> > @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
> >                     *val);
> >          break;
> >      default:
> > -        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
> > +        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
> >          break;
> >      }
> > 
> > -    spin_unlock(&dom_vmce(d)->lock);
> > +    spin_unlock(&vmce->lock);
> >      return ret;
> >  }
> > 
> > -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
> > +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
> >  {
> > -    int bank, ret = 1;
> > -    struct domain_mca_msrs *vmce = dom_vmce(d);
> > +    int ret = 1;
> > +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> > +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
> >      struct bank_entry *entry = NULL;
> > 
> > -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> > -    if ( bank >= nr_mce_banks )
> > -        return -EINVAL;
> > -
> >      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
> >      {
> >      case MSR_IA32_MC0_CTL:
> > -        vmce->mci_ctl[bank] = val;
> > +        if ( bank < nr_mce_banks )
> > +            vmce->mci_ctl[bank] = val;
> >          break;
> >      case MSR_IA32_MC0_STATUS:
> >          /* Give the first entry of the list, it corresponds to current
> > @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain
> >           * the guest, this node will be deleted.
> >           * Only error bank is written. Non-error banks simply return.
> >           */
> > -        if ( !list_empty(&dom_vmce(d)->impact_header) )
> > +        if ( !list_empty(&vmce->impact_header) )
> >          {
> > -            entry = list_entry(dom_vmce(d)->impact_header.next,
> > +            entry = list_entry(vmce->impact_header.next,
> >                                 struct bank_entry, list);
> >              if ( entry->bank == bank )
> >                  entry->mci_status = val;
> > @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain
> >          switch ( boot_cpu_data.x86_vendor )
> >          {
> >          case X86_VENDOR_INTEL:
> > -            ret = intel_mce_wrmsr(msr, val);
> > +            ret = intel_mce_wrmsr(v, msr, val);
> >              break;
> >          default:
> >              ret = 0;
> > @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain
> >   */
> >  int vmce_wrmsr(u32 msr, u64 val)
> >  {
> > -    struct domain *d = current->domain;
> > +    struct vcpu *cur = current;
> >      struct bank_entry *entry = NULL;
> > -    struct domain_mca_msrs *vmce = dom_vmce(d);
> > +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
> >      int ret = 1;
> > 
> >      if ( !g_mcg_cap )
> > @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val)
> >          vmce->mcg_status = val;
> >          mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val);
> >          /* For HVM guest, this is the point for deleting vMCE injection node */
> > -        if ( d->is_hvm && (vmce->nr_injection > 0) )
> > +        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
> >          {
> >              vmce->nr_injection--; /* Should be 0 */
> >              if ( !list_empty(&vmce->impact_header) )
> > @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val)
> >          ret = -1;
> >          break;
> >      default:
> > -        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
> > +        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
> >          break;
> >      }
> > 
> > @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val)
> >      return ret;
> >  }
> > 
> > +static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
> > +{
> > +    struct vcpu *v;
> > +    int err = 0;
> > +
> > +    for_each_vcpu( d, v ) {
> > +        struct hvm_vmce_vcpu ctxt = {
> > +            .nr_banks = v->arch.nr_vmce_banks
> > +        };
> > +
> > +        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
> > +        if ( err )
> > +            break;
> > +    }
> > +
> > +    return err;
> > +}
> > +
> > +static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
> > +{
> > +    unsigned int vcpuid = hvm_load_instance(h);
> > +    struct vcpu *v;
> > +    struct hvm_vmce_vcpu ctxt;
> > +    int err;
> > +
> > +    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
> > +    {
> > +        gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n", vcpuid);
> > +        return -EINVAL;
> > +    }
> > +
> > +    err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
> > +    if ( !err )
> > +        v->arch.nr_vmce_banks = ctxt.nr_banks;
> > +
> > +    return err;
> > +}
> > +
> > +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
> > +                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
> > +
> >  int inject_vmce(struct domain *d)
> >  {
> >      int cpu = smp_processor_id();
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v)
> > 
> >      v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
> > 
> > +    vmce_init_vcpu(v);
> > +
> >      rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0;
> >   done:
> >      if ( rc )
> > --- a/xen/arch/x86/domctl.c
> > +++ b/xen/arch/x86/domctl.c
> > @@ -1027,11 +1027,12 @@ long arch_do_domctl(
> >                  evc->syscall32_callback_eip    = 0;
> >                  evc->syscall32_disables_events = 0;
> >              }
> > +            evc->nr_mce_banks = v->arch.nr_vmce_banks;
> >          }
> >          else
> >          {
> >              ret = -EINVAL;
> > -            if ( evc->size != sizeof(*evc) )
> > +            if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) )
> >                  goto ext_vcpucontext_out;
> >  #ifdef __x86_64__
> >              if ( !is_hvm_domain(d) )
> > @@ -1059,6 +1060,16 @@ long arch_do_domctl(
> >                   (evc->syscall32_callback_cs & ~3) ||
> >                   evc->syscall32_callback_eip )
> >                  goto ext_vcpucontext_out;
> > +
> > +            if ( evc->size >= offsetof(typeof(*evc), mbz[ARRAY_SIZE(evc->mbz)]) )
> > +            {
> > +                unsigned int i;
> > +
> > +                for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i )
> > +                    if ( evc->mbz[i] )
> > +                        goto ext_vcpucontext_out;
> > +                v->arch.nr_vmce_banks = evc->nr_mce_banks;
> > +            }
> >          }
> > 
> >          ret = 0;
> > --- a/xen/include/asm-x86/domain.h
> > +++ b/xen/include/asm-x86/domain.h
> > @@ -488,6 +488,8 @@ struct arch_vcpu
> >      /* This variable determines whether nonlazy extended state has been used,
> >       * and thus should be saved/restored. */
> >      bool_t nonlazy_xstate_used;
> > +
> > +    uint8_t nr_vmce_banks;
> > 
> >      struct paging_vcpu paging;
> > 
> > --- a/xen/include/asm-x86/mce.h
> > +++ b/xen/include/asm-x86/mce.h
> > @@ -28,6 +28,7 @@ struct domain_mca_msrs
> >  /* Guest vMCE MSRs virtualization */
> >  extern int vmce_init_msr(struct domain *d);
> >  extern void vmce_destroy_msr(struct domain *d);
> > +extern void vmce_init_vcpu(struct vcpu *);
> >  extern int vmce_wrmsr(uint32_t msr, uint64_t val);
> >  extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
> > 
> > --- a/xen/include/public/arch-x86/hvm/save.h
> > +++ b/xen/include/public/arch-x86/hvm/save.h
> > @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
> > 
> >  DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
> > 
> > +struct hvm_vmce_vcpu {
> > +    uint8_t nr_banks;
> > +};
> > +
> > +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
> > +
> >  /*
> >   * Largest type-code in use
> >   */
> > -#define HVM_SAVE_CODE_MAX 17
> > +#define HVM_SAVE_CODE_MAX 18
> > 
> >  #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
> > --- a/xen/include/public/domctl.h
> > +++ b/xen/include/public/domctl.h
> > @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
> >      uint32_t         vcpu;
> >      /*
> >       * SET: Size of struct (IN)
> > -     * GET: Size of struct (OUT)
> > +     * GET: Size of struct (OUT, up to 128 bytes)
> >       */
> >      uint32_t         size;
> >  #if defined(__i386__) || defined(__x86_64__)
> > @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext {
> >      uint16_t         sysenter_callback_cs;
> >      uint8_t          syscall32_disables_events;
> >      uint8_t          sysenter_disables_events;
> > +    uint8_t          nr_mce_banks;
> > +    /* This array can be split and used for future extension. */
> > +    /* It is there just to grow the structure beyond 4.1's size. */
> > +    uint8_t          mbz[9];
> >  #endif
> >  };
> >  typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;
> > 
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-31 11:27         ` George Dunlap
  2012-01-31 11:28           ` George Dunlap
@ 2012-01-31 13:17           ` Jan Beulich
  2012-01-31 14:34             ` George Dunlap
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2012-01-31 13:17 UTC (permalink / raw)
  To: George Dunlap, George.Dunlap; +Cc: Olaf Hering, xen-devel, Ian Campbell

>>> On 31.01.12 at 12:27, George Dunlap <george.dunlap@citrix.com> wrote:
> On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote:
>> >>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote:
>> > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
>> >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
>> >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:
>> >> >> x86's vMCE implementation lets a guest know of as many MCE reporting
>> >> >> banks as there are in the host. While a PV guest could be expected to
>> >> >> deal with this number changing (particularly decreasing) during migration
>> >> >> (not currently handled anywhere afaict), for HVM guests this is certainly
>> >> >> wrong.
>> >> >>
>> >> >> At least to me it isn't, however, clear how to properly handle this. The
>> >> >> easiest would appear to be to save and restore the number of banks
>> >> >> the guest was made believe it can access, making vmce_{rd,wr}msr()
>> >> >> silently tolerate accesses between the host and guest values.
>> >> >
>> >> > We ran into this in the XS 6.0 release as well.  I think that the
>> >> > ideal thing to do would be to have a parameter that can be set at
>> >> > boot, to say how many vMCE banks a guest has, defaulting to the number
>> >> > of MCE banks on the host.  This parameter would be preserved across
>> >> > migration.  Ideally, a pool-aware toolstack like xapi would then set
>> >> > this value to be the value of the host in the pool with the largest
>> >> > number of banks, allowing a guest to access all the banks on any host
>> >> > to which it migrates.
>> >> >
>> >> > What do you think?
>> >>
>> >> That sounds like the way to go.
>> >
>> > So should we put this on IanC's to-do-be-done list?  Are you going to
>> > put it on your to-do list? :-)
>> 
>> Below/attached a draft patch (compile tested only), handling save/
>> restore of the bank count, but not allowing for a config setting to
>> specify its initial value (yet).
> 
> Looks pretty good for a first blush.  Just one question: Why is the vmce
> count made on a per-vcpu basis, rather than on a per-domain basis, like
> the actual banks are?  Is the host MCE stuff per-vcpu?

The question should probably be the other way around - why is the
vMCE implementation using global (fake) MSRs rather than per-vCPU
ones (as they would be on hardware). If the change here was
implemented as per-domain MSRs, a future move of the vMCE
implementation to a more natural model would be impossible. Also,
for the PV case the save/restore logic is much simpler this way.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-31 13:17           ` Jan Beulich
@ 2012-01-31 14:34             ` George Dunlap
  0 siblings, 0 replies; 31+ messages in thread
From: George Dunlap @ 2012-01-31 14:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Olaf Hering, xen-devel, Ian Campbell

On Tue, 2012-01-31 at 13:17 +0000, Jan Beulich wrote:
> >>> On 31.01.12 at 12:27, George Dunlap <george.dunlap@citrix.com> wrote:
> > On Mon, 2012-01-30 at 13:47 +0000, Jan Beulich wrote:
> >> >>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com> wrote:
> >> > On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
> >> >> >>> On 24.01.12 at 11:29, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> >> >> > On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com> wrote:
> >> >> >> x86's vMCE implementation lets a guest know of as many MCE reporting
> >> >> >> banks as there are in the host. While a PV guest could be expected to
> >> >> >> deal with this number changing (particularly decreasing) during migration
> >> >> >> (not currently handled anywhere afaict), for HVM guests this is certainly
> >> >> >> wrong.
> >> >> >>
> >> >> >> At least to me it isn't, however, clear how to properly handle this. The
> >> >> >> easiest would appear to be to save and restore the number of banks
> >> >> >> the guest was made believe it can access, making vmce_{rd,wr}msr()
> >> >> >> silently tolerate accesses between the host and guest values.
> >> >> >
> >> >> > We ran into this in the XS 6.0 release as well.  I think that the
> >> >> > ideal thing to do would be to have a parameter that can be set at
> >> >> > boot, to say how many vMCE banks a guest has, defaulting to the number
> >> >> > of MCE banks on the host.  This parameter would be preserved across
> >> >> > migration.  Ideally, a pool-aware toolstack like xapi would then set
> >> >> > this value to be the value of the host in the pool with the largest
> >> >> > number of banks, allowing a guest to access all the banks on any host
> >> >> > to which it migrates.
> >> >> >
> >> >> > What do you think?
> >> >>
> >> >> That sounds like the way to go.
> >> >
> >> > So should we put this on IanC's to-do-be-done list?  Are you going to
> >> > put it on your to-do list? :-)
> >> 
> >> Below/attached a draft patch (compile tested only), handling save/
> >> restore of the bank count, but not allowing for a config setting to
> >> specify its initial value (yet).
> > 
> > Looks pretty good for a first blush.  Just one question: Why is the vmce
> > count made on a per-vcpu basis, rather than on a per-domain basis, like
> > the actual banks are?  Is the host MCE stuff per-vcpu?
> 
> The question should probably be the other way around - why is the
> vMCE implementation using global (fake) MSRs rather than per-vCPU
> ones (as they would be on hardware). If the change here was
> implemented as per-domain MSRs, a future move of the vMCE
> implementation to a more natural model would be impossible. Also,
> for the PV case the save/restore logic is much simpler this way.

Ah, that makes sense.  Good spot on the impossiblity of changing to
per-vcpu later.  

 -George

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-30 13:47       ` Jan Beulich
  2012-01-31 11:27         ` George Dunlap
@ 2012-02-03  7:18         ` Liu, Jinsong
  2012-02-03  8:08           ` Jan Beulich
  2012-02-09 18:02         ` Olaf Hering
  2 siblings, 1 reply; 31+ messages in thread
From: Liu, Jinsong @ 2012-02-03  7:18 UTC (permalink / raw)
  To: Jan Beulich, George Dunlap, George.Dunlap
  Cc: Olaf Hering, xen-devel, Ian Campbell, Jiang, Yunhong, Shan,
	Haitao, Dugger, Donald D

Jan Beulich wrote:
>>>> On 26.01.12 at 17:54, George Dunlap <george.dunlap@citrix.com>
>>>> wrote: 
>> On Tue, 2012-01-24 at 11:08 +0000, Jan Beulich wrote:
>>>>>> On 24.01.12 at 11:29, George Dunlap
>>>>>> <George.Dunlap@eu.citrix.com> wrote: 
>>>> On Mon, Jan 23, 2012 at 11:08 AM, Jan Beulich <JBeulich@suse.com>
>>>> wrote: 
>>>>> x86's vMCE implementation lets a guest know of as many MCE
>>>>> reporting banks as there are in the host. While a PV guest could
>>>>> be expected to deal with this number changing (particularly
>>>>> decreasing) during migration (not currently handled anywhere
>>>>> afaict), for HVM guests this is certainly wrong. 
>>>>> 
>>>>> At least to me it isn't, however, clear how to properly handle
>>>>> this. The easiest would appear to be to save and restore the
>>>>> number of banks 
>>>>> the guest was made believe it can access, making vmce_{rd,wr}msr()
>>>>> silently tolerate accesses between the host and guest values.
>>>> 
>>>> We ran into this in the XS 6.0 release as well.  I think that the
>>>> ideal thing to do would be to have a parameter that can be set at
>>>> boot, to say how many vMCE banks a guest has, defaulting to the
>>>> number of MCE banks on the host.  This parameter would be
>>>> preserved across migration.  Ideally, a pool-aware toolstack like
>>>> xapi would then set this value to be the value of the host in the
>>>> pool with the largest number of banks, allowing a guest to access
>>>> all the banks on any host to which it migrates. 
>>>> 
>>>> What do you think?
>>> 
>>> That sounds like the way to go.
>> 
>> So should we put this on IanC's to-do-be-done list?  Are you going to
>> put it on your to-do list? :-)
> 
> Below/attached a draft patch (compile tested only), handling save/
> restore of the bank count, but not allowing for a config setting to
> specify its initial value (yet).
> 
> Jan
> 
> --- a/tools/libxc/xc_domain_save.c
> +++ b/tools/libxc/xc_domain_save.c
> @@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
> 
>          domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
>          domctl.domain = dom;
> +        memset(&domctl.u, 0, sizeof(domctl.u));
>          domctl.u.ext_vcpucontext.vcpu = i;
>          if ( xc_domctl(xch, &domctl) < 0 )
>          {
> --- a/tools/misc/xen-hvmctx.c
> +++ b/tools/misc/xen-hvmctx.c
> @@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
>             (unsigned long long) p.apic_assist);
>  }
> 
> +static void dump_vmce_vcpu(void)
> +{
> +    HVM_SAVE_TYPE(VMCE_VCPU) p;
> +    READ(p);
> +    printf("    VMCE_VCPU: nr_banks %u\n", p.nr_banks);
> +}
> +
>  int main(int argc, char **argv)
>  {
>      int entry, domid;
> @@ -449,6 +456,7 @@ int main(int argc, char **argv)
>          case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
>          case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain();
>          break; case HVM_SAVE_CODE(VIRIDIAN_VCPU):
> dump_viridian_vcpu(); break; +        case HVM_SAVE_CODE(VMCE_VCPU):
>          dump_vmce_vcpu(); break; case HVM_SAVE_CODE(END): break;
>          default:
>              printf(" ** Don't understand type %u: skipping\n",
> --- a/xen/arch/x86/cpu/mcheck/mce.h
> +++ b/xen/arch/x86/cpu/mcheck/mce.h
> @@ -3,6 +3,7 @@
>  #define _MCE_H
> 
>  #include <xen/init.h>
> +#include <xen/sched.h>
>  #include <xen/smp.h>
>  #include <asm/types.h>
>  #include <asm/traps.h>
> @@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
>  u64 mce_cap_init(void);
>  extern int firstbank;
> 
> -int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
> -int intel_mce_wrmsr(uint32_t msr, uint64_t val);
> +int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t
> *val); +int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t
> val); 
> 
>  struct mcinfo_extended *intel_get_extended_msrs(
>      struct mcinfo_global *mig, struct mc_info *mi);
> @@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
> 
>  extern int vmce_init(struct cpuinfo_x86 *c);
> 
> -static inline int mce_vendor_bank_msr(uint32_t msr)
> +static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t
>  msr) {
>      if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> -         msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
> nr_mce_banks) ) +         msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
>            return 1;
>      return 0;
>  }
> 
> -static inline int mce_bank_msr(uint32_t msr)
> +static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
>  {
> -    if ( (msr >= MSR_IA32_MC0_CTL && msr <
> MSR_IA32_MCx_CTL(nr_mce_banks)) || 
> -        mce_vendor_bank_msr(msr) )
> +    if ( (msr >= MSR_IA32_MC0_CTL &&
> +          msr < MSR_IA32_MCx_CTL(v->arch.nr_vmce_banks)) ||
> +         mce_vendor_bank_msr(v, msr) )
>          return 1;
>      return 0;
>  }
> --- a/xen/arch/x86/cpu/mcheck/mce_intel.c
> +++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
> @@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
>  }
> 
>  /* intel specific MCA MSR */
> -int intel_mce_wrmsr(uint32_t msr, uint64_t val)
> +int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
>  {
>      int ret = 0;
> 
> -    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
> nr_mce_banks)) +    if ( msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
>      {
>          mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
>                   "Guest should not write this MSR!\n");
> @@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
>      return ret;
>  }
> 
> -int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
> +int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t
>  *val) {
>      int ret = 0;
> 
> -    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 +
> nr_mce_banks)) +    if ( msr >= MSR_IA32_MC0_CTL2 &&
> +         msr < MSR_IA32_MCx_CTL2(v->arch.nr_vmce_banks) )
>      {
>          mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
>                   "Guest should not read this MSR!\n");
> --- a/xen/arch/x86/cpu/mcheck/vmce.c
> +++ b/xen/arch/x86/cpu/mcheck/vmce.c
> @@ -10,6 +10,7 @@
>  #include <xen/delay.h>
>  #include <xen/smp.h>
>  #include <xen/mm.h>
> +#include <xen/hvm/save.h>
>  #include <asm/processor.h>
>  #include <public/sysctl.h>
>  #include <asm/system.h>
> @@ -61,21 +62,26 @@ void vmce_destroy_msr(struct domain *d)
>      dom_vmce(d) = NULL;
>  }
> 
> -static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t
> *val) +void vmce_init_vcpu(struct vcpu *v)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    v->arch.nr_vmce_banks = nr_mce_banks;
> +}
> +
> +static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr,
> uint64_t *val) +{
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -1;
> +    *val = 0;
> 
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        *val = vmce->mci_ctl[bank] &
> -            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
> +        if ( bank < nr_mce_banks )
> +            *val = vmce->mci_ctl[bank] &
> +                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
>          mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n",
>                     bank, *val);
>          break;
> @@ -126,7 +132,7 @@ static int bank_mce_rdmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_rdmsr(msr, val);
> +            ret = intel_mce_rdmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -145,13 +151,13 @@ static int bank_mce_rdmsr(struct domain
>   */
>  int vmce_rdmsr(uint32_t msr, uint64_t *val)
>  {
> -    struct domain *d = current->domain;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    const struct vcpu *cur = current;
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      *val = 0;
> 
> -    spin_lock(&dom_vmce(d)->lock);
> +    spin_lock(&vmce->lock);
> 
>      switch ( msr )
>      {
> @@ -173,28 +179,26 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
>                     *val);
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val)
>          : 0; break;
>      }
> 
> -    spin_unlock(&dom_vmce(d)->lock);
> +    spin_unlock(&vmce->lock);
>      return ret;
>  }
> 
> -static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
> +static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
>  {
> -    int bank, ret = 1;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    int ret = 1;
> +    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
> +    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
>      struct bank_entry *entry = NULL;
> 
> -    bank = (msr - MSR_IA32_MC0_CTL) / 4;
> -    if ( bank >= nr_mce_banks )
> -        return -EINVAL;
> -
>      switch ( msr & (MSR_IA32_MC0_CTL | 3) )
>      {
>      case MSR_IA32_MC0_CTL:
> -        vmce->mci_ctl[bank] = val;
> +        if ( bank < nr_mce_banks )
> +            vmce->mci_ctl[bank] = val;
>          break;
>      case MSR_IA32_MC0_STATUS:
>          /* Give the first entry of the list, it corresponds to
> current @@ -202,9 +206,9 @@ static int bank_mce_wrmsr(struct domain
>           * the guest, this node will be deleted.
>           * Only error bank is written. Non-error banks simply return.
>           */
> -        if ( !list_empty(&dom_vmce(d)->impact_header) )
> +        if ( !list_empty(&vmce->impact_header) )
>          {
> -            entry = list_entry(dom_vmce(d)->impact_header.next,
> +            entry = list_entry(vmce->impact_header.next,
>                                 struct bank_entry, list);
>              if ( entry->bank == bank )
>                  entry->mci_status = val;
> @@ -228,7 +232,7 @@ static int bank_mce_wrmsr(struct domain
>          switch ( boot_cpu_data.x86_vendor )
>          {
>          case X86_VENDOR_INTEL:
> -            ret = intel_mce_wrmsr(msr, val);
> +            ret = intel_mce_wrmsr(v, msr, val);
>              break;
>          default:
>              ret = 0;
> @@ -247,9 +251,9 @@ static int bank_mce_wrmsr(struct domain
>   */
>  int vmce_wrmsr(u32 msr, u64 val)
>  {
> -    struct domain *d = current->domain;
> +    struct vcpu *cur = current;
>      struct bank_entry *entry = NULL;
> -    struct domain_mca_msrs *vmce = dom_vmce(d);
> +    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
>      int ret = 1;
> 
>      if ( !g_mcg_cap )
> @@ -266,7 +270,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          vmce->mcg_status = val;
>          mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n",
>          val); /* For HVM guest, this is the point for deleting vMCE
> injection node */ -        if ( d->is_hvm && (vmce->nr_injection > 0)
> ) +        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
>          {
>              vmce->nr_injection--; /* Should be 0 */
>              if ( !list_empty(&vmce->impact_header) )
> @@ -293,7 +297,7 @@ int vmce_wrmsr(u32 msr, u64 val)
>          ret = -1;
>          break;
>      default:
> -        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
> +        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val)
>          : 0; break;
>      }
> 
> @@ -301,6 +305,47 @@ int vmce_wrmsr(u32 msr, u64 val)
>      return ret;
>  }
> 
> +static int vmce_save_vcpu_ctxt(struct domain *d,
> hvm_domain_context_t *h) +{
> +    struct vcpu *v;
> +    int err = 0;
> +
> +    for_each_vcpu( d, v ) {
> +        struct hvm_vmce_vcpu ctxt = {
> +            .nr_banks = v->arch.nr_vmce_banks
> +        };
> +
> +        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
> +        if ( err )
> +            break;
> +    }
> +
> +    return err;
> +}
> +
> +static int vmce_load_vcpu_ctxt(struct domain *d,
> hvm_domain_context_t *h) +{
> +    unsigned int vcpuid = hvm_load_instance(h);
> +    struct vcpu *v;
> +    struct hvm_vmce_vcpu ctxt;
> +    int err;
> +
> +    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
> +    {
> +        gdprintk(XENLOG_ERR, "HVM restore: domain has no vcpu %u\n",
> vcpuid); +        return -EINVAL;
> +    }
> +
> +    err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
> +    if ( !err )
> +        v->arch.nr_vmce_banks = ctxt.nr_banks;
> +
> +    return err;
> +}
> +
> +HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
> +                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
> +
>  int inject_vmce(struct domain *d)
>  {
>      int cpu = smp_processor_id();
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -468,6 +468,8 @@ int vcpu_initialise(struct vcpu *v)
> 
>      v->arch.pv_vcpu.ctrlreg[4] =
> real_cr4_to_pv_guest_cr4(mmu_cr4_features); 
> 
> +    vmce_init_vcpu(v);
> +
>      rc = is_pv_32on64_vcpu(v) ? setup_compat_l4(v) : 0;
>   done:
>      if ( rc )
> --- a/xen/arch/x86/domctl.c
> +++ b/xen/arch/x86/domctl.c
> @@ -1027,11 +1027,12 @@ long arch_do_domctl(
>                  evc->syscall32_callback_eip    = 0;
>                  evc->syscall32_disables_events = 0;
>              }
> +            evc->nr_mce_banks = v->arch.nr_vmce_banks;
>          }
>          else
>          {
>              ret = -EINVAL;
> -            if ( evc->size != sizeof(*evc) )
> +            if ( evc->size < offsetof(typeof(*evc), nr_mce_banks) )
>                  goto ext_vcpucontext_out;
>  #ifdef __x86_64__
>              if ( !is_hvm_domain(d) )
> @@ -1059,6 +1060,16 @@ long arch_do_domctl(
>                   (evc->syscall32_callback_cs & ~3) ||
>                   evc->syscall32_callback_eip )
>                  goto ext_vcpucontext_out;
> +
> +            if ( evc->size >= offsetof(typeof(*evc),
> mbz[ARRAY_SIZE(evc->mbz)]) ) +            {
> +                unsigned int i;
> +
> +                for ( i = 0; i < ARRAY_SIZE(evc->mbz); ++i )
> +                    if ( evc->mbz[i] )
> +                        goto ext_vcpucontext_out;
> +                v->arch.nr_vmce_banks = evc->nr_mce_banks;
> +            }
>          }
> 
>          ret = 0;
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -488,6 +488,8 @@ struct arch_vcpu
>      /* This variable determines whether nonlazy extended state has
>       been used, * and thus should be saved/restored. */
>      bool_t nonlazy_xstate_used;
> +
> +    uint8_t nr_vmce_banks;
> 
>      struct paging_vcpu paging;
> 
> --- a/xen/include/asm-x86/mce.h
> +++ b/xen/include/asm-x86/mce.h
> @@ -28,6 +28,7 @@ struct domain_mca_msrs
>  /* Guest vMCE MSRs virtualization */
>  extern int vmce_init_msr(struct domain *d);
>  extern void vmce_destroy_msr(struct domain *d);
> +extern void vmce_init_vcpu(struct vcpu *);
>  extern int vmce_wrmsr(uint32_t msr, uint64_t val);
>  extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
> 
> --- a/xen/include/public/arch-x86/hvm/save.h
> +++ b/xen/include/public/arch-x86/hvm/save.h
> @@ -585,9 +585,15 @@ struct hvm_viridian_vcpu_context {
> 
>  DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct
> hvm_viridian_vcpu_context); 
> 
> +struct hvm_vmce_vcpu {
> +    uint8_t nr_banks;
> +};
> +
> +DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
> +
>  /*
>   * Largest type-code in use
>   */
> -#define HVM_SAVE_CODE_MAX 17
> +#define HVM_SAVE_CODE_MAX 18
> 
>  #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
>      uint32_t         vcpu;
>      /*
>       * SET: Size of struct (IN)
> -     * GET: Size of struct (OUT)
> +     * GET: Size of struct (OUT, up to 128 bytes)
>       */
>      uint32_t         size;
>  #if defined(__i386__) || defined(__x86_64__)
> @@ -571,6 +571,10 @@ struct xen_domctl_ext_vcpucontext {
>      uint16_t         sysenter_callback_cs;
>      uint8_t          syscall32_disables_events;
>      uint8_t          sysenter_disables_events;
> +    uint8_t          nr_mce_banks;
> +    /* This array can be split and used for future extension. */
> +    /* It is there just to grow the structure beyond 4.1's size. */
> +    uint8_t          mbz[9];
>  #endif
>  };
>  typedef struct xen_domctl_ext_vcpucontext
> xen_domctl_ext_vcpucontext_t; 

Jan,

How about use static bank size, say 256, which architecture (IA32_MCG_CAP MSR) defined max bank number?

Thanks,
Jinsong

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-03  7:18         ` Liu, Jinsong
@ 2012-02-03  8:08           ` Jan Beulich
  2012-02-03 12:34             ` Liu, Jinsong
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2012-02-03  8:08 UTC (permalink / raw)
  To: Jinsong Liu
  Cc: Olaf Hering, xen-devel, Ian Campbell, George.Dunlap,
	Yunhong Jiang, Haitao Shan, George Dunlap, Donald D Dugger

>>> On 03.02.12 at 08:18, "Liu, Jinsong" <jinsong.liu@intel.com> wrote:
> How about use static bank size, say 256, which architecture (IA32_MCG_CAP 
> MSR) defined max bank number?

The maximum, if any, is 32 (beyond which collision with other defined
MSRs would arise). But with the possibility of having a discontiguous
or relocated MSR space, I don't think any static maximum would be a
good idea.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-03  8:08           ` Jan Beulich
@ 2012-02-03 12:34             ` Liu, Jinsong
  2012-02-03 14:04               ` Jan Beulich
  0 siblings, 1 reply; 31+ messages in thread
From: Liu, Jinsong @ 2012-02-03 12:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Olaf Hering, xen-devel, Ian Campbell, George.Dunlap, Jiang,
	Yunhong, Shan, Haitao, George Dunlap, Dugger, Donald D

Jan Beulich wrote:
>>>> On 03.02.12 at 08:18, "Liu, Jinsong" <jinsong.liu@intel.com> wrote:
>> How about use static bank size, say 256, which architecture
>> (IA32_MCG_CAP MSR) defined max bank number?
> 
> The maximum, if any, is 32 (beyond which collision with other defined
> MSRs would arise). But with the possibility of having a discontiguous
> or relocated MSR space, I don't think any static maximum would be a
> good idea.
> 
> Jan

The advantages of static max bank is simple, any disadvantages of static solution?
or, can we use bank_entry listed in vmce->impact_header for mci_ctl, like what mci_status/mci_addr/mci_misc do?
I just think the patch may be too complicated for a minor issue.

Thanks,
Jinsong

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-03 12:34             ` Liu, Jinsong
@ 2012-02-03 14:04               ` Jan Beulich
  2012-02-04 12:35                 ` George Dunlap
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2012-02-03 14:04 UTC (permalink / raw)
  To: Jinsong Liu
  Cc: Olaf Hering, xen-devel, Ian Campbell, George.Dunlap,
	Yunhong Jiang, Haitao Shan, George Dunlap, Donald D Dugger

>>> On 03.02.12 at 13:34, "Liu, Jinsong" <jinsong.liu@intel.com> wrote:
> Jan Beulich wrote:
>>>>> On 03.02.12 at 08:18, "Liu, Jinsong" <jinsong.liu@intel.com> wrote:
>>> How about use static bank size, say 256, which architecture
>>> (IA32_MCG_CAP MSR) defined max bank number?
>> 
>> The maximum, if any, is 32 (beyond which collision with other defined
>> MSRs would arise). But with the possibility of having a discontiguous
>> or relocated MSR space, I don't think any static maximum would be a
>> good idea.
>> 
>> Jan
> 
> The advantages of static max bank is simple, any disadvantages of static 
> solution?

Lack of forward compatibility. I may even be that the use of uint8_t
wasn't really a good choice of mine.

> or, can we use bank_entry listed in vmce->impact_header for mci_ctl, like what 
> mci_status/mci_addr/mci_misc do?

I don't think so - the control register (obviously) must be accessible
independent of any ongoing (latched) MCE.

> I just think the patch may be too complicated for a minor issue.

Is a guest crash minor? After all there is no guarantee that the
accesses to these MSRs are exception handler protected in a
way similar to what current Linux does.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-03 14:04               ` Jan Beulich
@ 2012-02-04 12:35                 ` George Dunlap
  0 siblings, 0 replies; 31+ messages in thread
From: George Dunlap @ 2012-02-04 12:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Jinsong Liu, Olaf Hering, xen-devel, Ian Campbell, Yunhong Jiang,
	Haitao Shan, George Dunlap, Donald D Dugger

On Fri, Feb 3, 2012 at 2:04 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> I just think the patch may be too complicated for a minor issue.
>
> Is a guest crash minor? After all there is no guarantee that the
> accesses to these MSRs are exception handler protected in a

In fact, I can tell you for a fact that newer versions of Windows do
*not* handle the exception properly, and will crash if an MCE that
worked before now generates an exception (for example, after
live-migrating to a host with fewer MCEs).

 -George

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-01-30 13:47       ` Jan Beulich
  2012-01-31 11:27         ` George Dunlap
  2012-02-03  7:18         ` Liu, Jinsong
@ 2012-02-09 18:02         ` Olaf Hering
  2012-02-10 10:03           ` Jan Beulich
  2 siblings, 1 reply; 31+ messages in thread
From: Olaf Hering @ 2012-02-09 18:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

On Mon, Jan 30, Jan Beulich wrote:

> Below/attached a draft patch (compile tested only), handling save/
> restore of the bank count, but not allowing for a config setting to
> specify its initial value (yet).

Does it take more than just applying this patch for src+dst host and
migrate a hvm guest? I see no difference, the mce warning is still
there.

Olaf

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-09 18:02         ` Olaf Hering
@ 2012-02-10 10:03           ` Jan Beulich
  2012-02-10 16:53             ` Olaf Hering
  2012-02-10 21:28             ` Olaf Hering
  0 siblings, 2 replies; 31+ messages in thread
From: Jan Beulich @ 2012-02-10 10:03 UTC (permalink / raw)
  To: Olaf Hering; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

>>> On 09.02.12 at 19:02, Olaf Hering <olaf@aepfle.de> wrote:
> On Mon, Jan 30, Jan Beulich wrote:
> 
>> Below/attached a draft patch (compile tested only), handling save/
>> restore of the bank count, but not allowing for a config setting to
>> specify its initial value (yet).
> 
> Does it take more than just applying this patch for src+dst host and
> migrate a hvm guest? I see no difference, the mce warning is still
> there.

No, it shouldn't require anything else. Could you add a printk() each
to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
(and at once checking that they actually get executed? I was under
the impression that adding save records for HVM is a simple drop-in
exercise these days...

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-10 10:03           ` Jan Beulich
@ 2012-02-10 16:53             ` Olaf Hering
  2012-02-10 17:00               ` Jan Beulich
  2012-02-10 21:28             ` Olaf Hering
  1 sibling, 1 reply; 31+ messages in thread
From: Olaf Hering @ 2012-02-10 16:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

On Fri, Feb 10, Jan Beulich wrote:

> No, it shouldn't require anything else. Could you add a printk() each
> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
> (and at once checking that they actually get executed? I was under
> the impression that adding save records for HVM is a simple drop-in
> exercise these days...

The functions are called on both sides with no errors.


Regarding these messages, are they guest addresses?

(XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4


Olaf

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-10 16:53             ` Olaf Hering
@ 2012-02-10 17:00               ` Jan Beulich
  2012-02-10 17:05                 ` Olaf Hering
  2012-02-13  8:30                 ` Olaf Hering
  0 siblings, 2 replies; 31+ messages in thread
From: Jan Beulich @ 2012-02-10 17:00 UTC (permalink / raw)
  To: Olaf Hering; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

>>> On 10.02.12 at 17:53, Olaf Hering <olaf@aepfle.de> wrote:
> On Fri, Feb 10, Jan Beulich wrote:
> 
>> No, it shouldn't require anything else. Could you add a printk() each
>> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
>> (and at once checking that they actually get executed? I was under
>> the impression that adding save records for HVM is a simple drop-in
>> exercise these days...
> 
> The functions are called on both sides with no errors.

And the bank count also is correctly reflecting the original, save-side
value on the restore side?

> Regarding these messages, are they guest addresses?
> 
> (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4

No, these are hypervisor ones.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-10 17:00               ` Jan Beulich
@ 2012-02-10 17:05                 ` Olaf Hering
  2012-02-13  8:30                 ` Olaf Hering
  1 sibling, 0 replies; 31+ messages in thread
From: Olaf Hering @ 2012-02-10 17:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

On Fri, Feb 10, Jan Beulich wrote:

> >>> On 10.02.12 at 17:53, Olaf Hering <olaf@aepfle.de> wrote:
> > On Fri, Feb 10, Jan Beulich wrote:
> > 
> >> No, it shouldn't require anything else. Could you add a printk() each
> >> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
> >> (and at once checking that they actually get executed? I was under
> >> the impression that adding save records for HVM is a simple drop-in
> >> exercise these days...
> > 
> > The functions are called on both sides with no errors.
> 
> And the bank count also is correctly reflecting the original, save-side
> value on the restore side?

Oh, the count is zero on both sides?
I will double check my dbg patch.

Olaf

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-10 10:03           ` Jan Beulich
  2012-02-10 16:53             ` Olaf Hering
@ 2012-02-10 21:28             ` Olaf Hering
  2012-02-13  9:30               ` Jan Beulich
  2012-02-13 10:36               ` Jan Beulich
  1 sibling, 2 replies; 31+ messages in thread
From: Olaf Hering @ 2012-02-10 21:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

On Fri, Feb 10, Jan Beulich wrote:

> >>> On 09.02.12 at 19:02, Olaf Hering <olaf@aepfle.de> wrote:
> > On Mon, Jan 30, Jan Beulich wrote:
> > 
> >> Below/attached a draft patch (compile tested only), handling save/
> >> restore of the bank count, but not allowing for a config setting to
> >> specify its initial value (yet).
> > 
> > Does it take more than just applying this patch for src+dst host and
> > migrate a hvm guest? I see no difference, the mce warning is still
> > there.
> 
> No, it shouldn't require anything else. Could you add a printk() each
> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
> (and at once checking that they actually get executed? I was under
> the impression that adding save records for HVM is a simple drop-in
> exercise these days...

These functions are called for dom0, but not for domU. And as a result
arch.nr_vmce_banks remains zero. I assume the guest needs to be
initialized in some way as well, and that does not happen?

Olaf

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-10 17:00               ` Jan Beulich
  2012-02-10 17:05                 ` Olaf Hering
@ 2012-02-13  8:30                 ` Olaf Hering
  2012-02-13 10:43                   ` Jan Beulich
  1 sibling, 1 reply; 31+ messages in thread
From: Olaf Hering @ 2012-02-13  8:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

On Fri, Feb 10, Jan Beulich wrote:

> > Regarding these messages, are they guest addresses?
> > 
> > (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4
> 
> No, these are hypervisor ones.

These are the functions they belong to:

(XEN) traps.c:3131: GPF (0000): ffff82c4801d61ec -> ffff82c48022aa2e
(XEN) do_general_protection: ffff82c4801d61ec vmx_msr_read_intercept+0x2d8/0x358
(XEN) do_general_protection: ffff82c48022aa2e vmac+0x6c8/0x99a

Olaf

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-10 21:28             ` Olaf Hering
@ 2012-02-13  9:30               ` Jan Beulich
  2012-02-13 10:36               ` Jan Beulich
  1 sibling, 0 replies; 31+ messages in thread
From: Jan Beulich @ 2012-02-13  9:30 UTC (permalink / raw)
  To: Olaf Hering; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

>>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote:
> On Fri, Feb 10, Jan Beulich wrote:
> 
>> >>> On 09.02.12 at 19:02, Olaf Hering <olaf@aepfle.de> wrote:
>> > On Mon, Jan 30, Jan Beulich wrote:
>> > 
>> >> Below/attached a draft patch (compile tested only), handling save/
>> >> restore of the bank count, but not allowing for a config setting to
>> >> specify its initial value (yet).
>> > 
>> > Does it take more than just applying this patch for src+dst host and
>> > migrate a hvm guest? I see no difference, the mce warning is still
>> > there.
>> 
>> No, it shouldn't require anything else. Could you add a printk() each
>> to vmce_{save,load}_vcpu_ctxt() printing what gets saved/restored
>> (and at once checking that they actually get executed? I was under
>> the impression that adding save records for HVM is a simple drop-in
>> exercise these days...
> 
> These functions are called for dom0, but not for domU. And as a result
> arch.nr_vmce_banks remains zero. I assume the guest needs to be
> initialized in some way as well, and that does not happen?

These functions should be called with Dom0 being current domain,
but the struct domain * argument should certainly be that of the
DomU being saved/restored.

Guest initialization happens in vmce_init_vcpu(), called from
vcpu_initialise() (irrespective of the kind of domain, i.e. equally for
PV and HVM).

I spotted another problem with the patch though - MCG_CAP reads
aren't reflecting the possibly non-host bank count. I'm in the process
of addressing this, but the whole MCG_* handling is bogus as being
per-domain instead of per-vCPU (and at least MCG_CAP lacking
save/restore too).

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-10 21:28             ` Olaf Hering
  2012-02-13  9:30               ` Jan Beulich
@ 2012-02-13 10:36               ` Jan Beulich
  2012-02-13 14:20                 ` Olaf Hering
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2012-02-13 10:36 UTC (permalink / raw)
  To: Olaf Hering; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

[-- Attachment #1: Type: text/plain, Size: 15928 bytes --]

>>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote:
> These functions are called for dom0, but not for domU. And as a result
> arch.nr_vmce_banks remains zero. I assume the guest needs to be
> initialized in some way as well, and that does not happen?

Below/attached a fixed version of the patch.

Jan

--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
 
         domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
         domctl.domain = dom;
+        memset(&domctl.u, 0, sizeof(domctl.u));
         domctl.u.ext_vcpucontext.vcpu = i;
         if ( xc_domctl(xch, &domctl) < 0 )
         {
--- a/tools/misc/xen-hvmctx.c
+++ b/tools/misc/xen-hvmctx.c
@@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
            (unsigned long long) p.apic_assist);           
 }
 
+static void dump_vmce_vcpu(void)
+{
+    HVM_SAVE_TYPE(VMCE_VCPU) p;
+    READ(p);
+    printf("    VMCE_VCPU: caps %" PRIx64 "\n", p.caps);
+}
+
 int main(int argc, char **argv)
 {
     int entry, domid;
@@ -449,6 +456,7 @@ int main(int argc, char **argv)
         case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
         case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break;
         case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break;
+        case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break;
         case HVM_SAVE_CODE(END): break;
         default:
             printf(" ** Don't understand type %u: skipping\n",
--- a/xen/arch/x86/cpu/mcheck/mce.h
+++ b/xen/arch/x86/cpu/mcheck/mce.h
@@ -3,6 +3,7 @@
 #define _MCE_H
 
 #include <xen/init.h>
+#include <xen/sched.h>
 #include <xen/smp.h>
 #include <asm/types.h>
 #include <asm/traps.h>
@@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
 u64 mce_cap_init(void);
 extern int firstbank;
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
-int intel_mce_wrmsr(uint32_t msr, uint64_t val);
+int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val);
+int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val);
 
 struct mcinfo_extended *intel_get_extended_msrs(
     struct mcinfo_global *mig, struct mc_info *mi);
@@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
 
 extern int vmce_init(struct cpuinfo_x86 *c);
 
-static inline int mce_vendor_bank_msr(uint32_t msr)
+static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr)
 {
     if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-         msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks) )
+         msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
           return 1;
     return 0;
 }
 
-static inline int mce_bank_msr(uint32_t msr)
+static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
 {
-    if ( (msr >= MSR_IA32_MC0_CTL && msr < MSR_IA32_MCx_CTL(nr_mce_banks)) ||
-        mce_vendor_bank_msr(msr) )
+    if ( (msr >= MSR_IA32_MC0_CTL &&
+          msr < MSR_IA32_MCx_CTL(v->arch.mcg_cap & MCG_CAP_COUNT)) ||
+         mce_vendor_bank_msr(v, msr) )
         return 1;
     return 0;
 }
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
 }
 
 /* intel specific MCA MSR */
-int intel_mce_wrmsr(uint32_t msr, uint64_t val)
+int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not write this MSR!\n");
@@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
     return ret;
 }
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
+int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not read this MSR!\n");
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -10,6 +10,7 @@
 #include <xen/delay.h>
 #include <xen/smp.h>
 #include <xen/mm.h>
+#include <xen/hvm/save.h>
 #include <asm/processor.h>
 #include <public/sysctl.h>
 #include <asm/system.h>
@@ -42,7 +43,6 @@ int vmce_init_msr(struct domain *d)
            sizeof(dom_vmce(d)->mci_ctl));
 
     dom_vmce(d)->mcg_status = 0x0;
-    dom_vmce(d)->mcg_cap = g_mcg_cap;
     dom_vmce(d)->mcg_ctl = ~(uint64_t)0x0;
     dom_vmce(d)->nr_injection = 0;
 
@@ -61,21 +61,41 @@ void vmce_destroy_msr(struct domain *d)
     dom_vmce(d) = NULL;
 }
 
-static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val)
+void vmce_init_vcpu(struct vcpu *v)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    v->arch.mcg_cap = g_mcg_cap;
+}
+
+int vmce_restore_vcpu(struct vcpu *v, uint64_t caps)
+{
+    if ( caps & ~g_mcg_cap & ~MCG_CAP_COUNT & ~MCG_CTL_P )
+    {
+        dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA capabilities"
+                " %#" PRIx64 " for d%d:v%u (supported: %#Lx)\n",
+                is_hvm_vcpu(v) ? "HVM" : "PV", caps, v->domain->domain_id,
+                v->vcpu_id, g_mcg_cap & ~MCG_CAP_COUNT);
+        return -EPERM;
+    }
+
+    v->arch.mcg_cap = caps;
+    return 0;
+}
+
+static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
+{
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -1;
+    *val = 0;
 
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        *val = vmce->mci_ctl[bank] &
-            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
+        if ( bank < nr_mce_banks )
+            *val = vmce->mci_ctl[bank] &
+                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n",
                    bank, *val);
         break;
@@ -126,7 +146,7 @@ static int bank_mce_rdmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_rdmsr(msr, val);
+            ret = intel_mce_rdmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -145,13 +165,13 @@ static int bank_mce_rdmsr(struct domain 
  */
 int vmce_rdmsr(uint32_t msr, uint64_t *val)
 {
-    struct domain *d = current->domain;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    const struct vcpu *cur = current;
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     *val = 0;
 
-    spin_lock(&dom_vmce(d)->lock);
+    spin_lock(&vmce->lock);
 
     switch ( msr )
     {
@@ -162,39 +182,38 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
                        "MCE: rdmsr MCG_STATUS 0x%"PRIx64"\n", *val);
         break;
     case MSR_IA32_MCG_CAP:
-        *val = vmce->mcg_cap;
+        *val = cur->arch.mcg_cap;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CAP 0x%"PRIx64"\n",
                    *val);
         break;
     case MSR_IA32_MCG_CTL:
         /* Always 0 if no CTL support */
-        *val = vmce->mcg_ctl & h_mcg_ctl;
+        if ( cur->arch.mcg_cap & MCG_CTL_P )
+            *val = vmce->mcg_ctl & h_mcg_ctl;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CTL 0x%"PRIx64"\n",
                    *val);
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
         break;
     }
 
-    spin_unlock(&dom_vmce(d)->lock);
+    spin_unlock(&vmce->lock);
     return ret;
 }
 
-static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
+static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry = NULL;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -EINVAL;
-
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        vmce->mci_ctl[bank] = val;
+        if ( bank < nr_mce_banks )
+            vmce->mci_ctl[bank] = val;
         break;
     case MSR_IA32_MC0_STATUS:
         /* Give the first entry of the list, it corresponds to current
@@ -202,9 +221,9 @@ static int bank_mce_wrmsr(struct domain 
          * the guest, this node will be deleted.
          * Only error bank is written. Non-error banks simply return.
          */
-        if ( !list_empty(&dom_vmce(d)->impact_header) )
+        if ( !list_empty(&vmce->impact_header) )
         {
-            entry = list_entry(dom_vmce(d)->impact_header.next,
+            entry = list_entry(vmce->impact_header.next,
                                struct bank_entry, list);
             if ( entry->bank == bank )
                 entry->mci_status = val;
@@ -228,7 +247,7 @@ static int bank_mce_wrmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_wrmsr(msr, val);
+            ret = intel_mce_wrmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -247,9 +266,9 @@ static int bank_mce_wrmsr(struct domain 
  */
 int vmce_wrmsr(u32 msr, u64 val)
 {
-    struct domain *d = current->domain;
+    struct vcpu *cur = current;
     struct bank_entry *entry = NULL;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     if ( !g_mcg_cap )
@@ -266,7 +285,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         vmce->mcg_status = val;
         mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val);
         /* For HVM guest, this is the point for deleting vMCE injection node */
-        if ( d->is_hvm && (vmce->nr_injection > 0) )
+        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
         {
             vmce->nr_injection--; /* Should be 0 */
             if ( !list_empty(&vmce->impact_header) )
@@ -293,7 +312,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         ret = -1;
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
         break;
     }
 
@@ -301,6 +320,46 @@ int vmce_wrmsr(u32 msr, u64 val)
     return ret;
 }
 
+static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    struct vcpu *v;
+    int err = 0;
+
+    for_each_vcpu( d, v ) {
+        struct hvm_vmce_vcpu ctxt = {
+            .caps = v->arch.mcg_cap
+        };
+
+        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        if ( err )
+            break;
+    }
+
+    return err;
+}
+
+static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    unsigned int vcpuid = hvm_load_instance(h);
+    struct vcpu *v;
+    struct hvm_vmce_vcpu ctxt;
+    int err;
+
+    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+    {
+        dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no vcpu%u\n",
+                d->domain_id, vcpuid);
+        err = -EINVAL;
+    }
+    else
+        err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
+
+    return err ?: vmce_restore_vcpu(v, ctxt.caps);
+}
+
+HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
+                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
+
 int inject_vmce(struct domain *d)
 {
     int cpu = smp_processor_id();
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -422,6 +422,8 @@ int vcpu_initialise(struct vcpu *v)
     if ( (rc = vcpu_init_fpu(v)) != 0 )
         return rc;
 
+    vmce_init_vcpu(v);
+
     if ( is_hvm_domain(d) )
     {
         rc = hvm_vcpu_initialise(v);
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,11 +1027,12 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            evc->mcg_cap = v->arch.mcg_cap;
         }
         else
         {
             ret = -EINVAL;
-            if ( evc->size != sizeof(*evc) )
+            if ( evc->size < offsetof(typeof(*evc), mcg_cap) )
                 goto ext_vcpucontext_out;
 #ifdef __x86_64__
             if ( !is_hvm_domain(d) )
@@ -1059,6 +1060,10 @@ long arch_do_domctl(
                  (evc->syscall32_callback_cs & ~3) ||
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
+
+            if ( evc->size >= offsetof(typeof(*evc), mcg_cap) +
+                              sizeof(evc->mcg_cap) )
+                ret = vmce_restore_vcpu(v, evc->mcg_cap);
         }
 
         ret = 0;
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -488,6 +488,8 @@ struct arch_vcpu
     /* This variable determines whether nonlazy extended state has been used,
      * and thus should be saved/restored. */
     bool_t nonlazy_xstate_used;
+
+    uint64_t mcg_cap;
     
     struct paging_vcpu paging;
 
--- a/xen/include/asm-x86/mce.h
+++ b/xen/include/asm-x86/mce.h
@@ -16,7 +16,6 @@ struct bank_entry {
 struct domain_mca_msrs
 {
     /* Guest should not change below values after DOM boot up */
-    uint64_t mcg_cap;
     uint64_t mcg_ctl;
     uint64_t mcg_status;
     uint64_t *mci_ctl;
@@ -28,6 +27,8 @@ struct domain_mca_msrs
 /* Guest vMCE MSRs virtualization */
 extern int vmce_init_msr(struct domain *d);
 extern void vmce_destroy_msr(struct domain *d);
+extern void vmce_init_vcpu(struct vcpu *);
+extern int vmce_restore_vcpu(struct vcpu *, uint64_t caps);
 extern int vmce_wrmsr(uint32_t msr, uint64_t val);
 extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
 
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -575,9 +575,15 @@ struct hvm_viridian_vcpu_context {
 
 DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
 
+struct hvm_vmce_vcpu {
+    uint64_t caps;
+};
+
+DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
+
 /* 
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 17
+#define HVM_SAVE_CODE_MAX 18
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
     uint32_t         vcpu;
     /*
      * SET: Size of struct (IN)
-     * GET: Size of struct (OUT)
+     * GET: Size of struct (OUT, up to 128 bytes)
      */
     uint32_t         size;
 #if defined(__i386__) || defined(__x86_64__)
@@ -571,6 +571,7 @@ struct xen_domctl_ext_vcpucontext {
     uint16_t         sysenter_callback_cs;
     uint8_t          syscall32_disables_events;
     uint8_t          sysenter_disables_events;
+    uint64_aligned_t mcg_cap;
 #endif
 };
 typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;


[-- Attachment #2: x86-vmce-sr.patch --]
[-- Type: text/plain, Size: 15603 bytes --]

--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
 
         domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
         domctl.domain = dom;
+        memset(&domctl.u, 0, sizeof(domctl.u));
         domctl.u.ext_vcpucontext.vcpu = i;
         if ( xc_domctl(xch, &domctl) < 0 )
         {
--- a/tools/misc/xen-hvmctx.c
+++ b/tools/misc/xen-hvmctx.c
@@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
            (unsigned long long) p.apic_assist);           
 }
 
+static void dump_vmce_vcpu(void)
+{
+    HVM_SAVE_TYPE(VMCE_VCPU) p;
+    READ(p);
+    printf("    VMCE_VCPU: caps %" PRIx64 "\n", p.caps);
+}
+
 int main(int argc, char **argv)
 {
     int entry, domid;
@@ -449,6 +456,7 @@ int main(int argc, char **argv)
         case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
         case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break;
         case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break;
+        case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break;
         case HVM_SAVE_CODE(END): break;
         default:
             printf(" ** Don't understand type %u: skipping\n",
--- a/xen/arch/x86/cpu/mcheck/mce.h
+++ b/xen/arch/x86/cpu/mcheck/mce.h
@@ -3,6 +3,7 @@
 #define _MCE_H
 
 #include <xen/init.h>
+#include <xen/sched.h>
 #include <xen/smp.h>
 #include <asm/types.h>
 #include <asm/traps.h>
@@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
 u64 mce_cap_init(void);
 extern int firstbank;
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
-int intel_mce_wrmsr(uint32_t msr, uint64_t val);
+int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val);
+int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val);
 
 struct mcinfo_extended *intel_get_extended_msrs(
     struct mcinfo_global *mig, struct mc_info *mi);
@@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
 
 extern int vmce_init(struct cpuinfo_x86 *c);
 
-static inline int mce_vendor_bank_msr(uint32_t msr)
+static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr)
 {
     if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-         msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks) )
+         msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
           return 1;
     return 0;
 }
 
-static inline int mce_bank_msr(uint32_t msr)
+static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
 {
-    if ( (msr >= MSR_IA32_MC0_CTL && msr < MSR_IA32_MCx_CTL(nr_mce_banks)) ||
-        mce_vendor_bank_msr(msr) )
+    if ( (msr >= MSR_IA32_MC0_CTL &&
+          msr < MSR_IA32_MCx_CTL(v->arch.mcg_cap & MCG_CAP_COUNT)) ||
+         mce_vendor_bank_msr(v, msr) )
         return 1;
     return 0;
 }
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
 }
 
 /* intel specific MCA MSR */
-int intel_mce_wrmsr(uint32_t msr, uint64_t val)
+int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not write this MSR!\n");
@@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
     return ret;
 }
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
+int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not read this MSR!\n");
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -10,6 +10,7 @@
 #include <xen/delay.h>
 #include <xen/smp.h>
 #include <xen/mm.h>
+#include <xen/hvm/save.h>
 #include <asm/processor.h>
 #include <public/sysctl.h>
 #include <asm/system.h>
@@ -42,7 +43,6 @@ int vmce_init_msr(struct domain *d)
            sizeof(dom_vmce(d)->mci_ctl));
 
     dom_vmce(d)->mcg_status = 0x0;
-    dom_vmce(d)->mcg_cap = g_mcg_cap;
     dom_vmce(d)->mcg_ctl = ~(uint64_t)0x0;
     dom_vmce(d)->nr_injection = 0;
 
@@ -61,21 +61,41 @@ void vmce_destroy_msr(struct domain *d)
     dom_vmce(d) = NULL;
 }
 
-static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val)
+void vmce_init_vcpu(struct vcpu *v)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    v->arch.mcg_cap = g_mcg_cap;
+}
+
+int vmce_restore_vcpu(struct vcpu *v, uint64_t caps)
+{
+    if ( caps & ~g_mcg_cap & ~MCG_CAP_COUNT & ~MCG_CTL_P )
+    {
+        dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA capabilities"
+                " %#" PRIx64 " for d%d:v%u (supported: %#Lx)\n",
+                is_hvm_vcpu(v) ? "HVM" : "PV", caps, v->domain->domain_id,
+                v->vcpu_id, g_mcg_cap & ~MCG_CAP_COUNT);
+        return -EPERM;
+    }
+
+    v->arch.mcg_cap = caps;
+    return 0;
+}
+
+static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
+{
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -1;
+    *val = 0;
 
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        *val = vmce->mci_ctl[bank] &
-            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
+        if ( bank < nr_mce_banks )
+            *val = vmce->mci_ctl[bank] &
+                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n",
                    bank, *val);
         break;
@@ -126,7 +146,7 @@ static int bank_mce_rdmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_rdmsr(msr, val);
+            ret = intel_mce_rdmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -145,13 +165,13 @@ static int bank_mce_rdmsr(struct domain 
  */
 int vmce_rdmsr(uint32_t msr, uint64_t *val)
 {
-    struct domain *d = current->domain;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    const struct vcpu *cur = current;
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     *val = 0;
 
-    spin_lock(&dom_vmce(d)->lock);
+    spin_lock(&vmce->lock);
 
     switch ( msr )
     {
@@ -162,39 +182,38 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
                        "MCE: rdmsr MCG_STATUS 0x%"PRIx64"\n", *val);
         break;
     case MSR_IA32_MCG_CAP:
-        *val = vmce->mcg_cap;
+        *val = cur->arch.mcg_cap;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CAP 0x%"PRIx64"\n",
                    *val);
         break;
     case MSR_IA32_MCG_CTL:
         /* Always 0 if no CTL support */
-        *val = vmce->mcg_ctl & h_mcg_ctl;
+        if ( cur->arch.mcg_cap & MCG_CTL_P )
+            *val = vmce->mcg_ctl & h_mcg_ctl;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CTL 0x%"PRIx64"\n",
                    *val);
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
         break;
     }
 
-    spin_unlock(&dom_vmce(d)->lock);
+    spin_unlock(&vmce->lock);
     return ret;
 }
 
-static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
+static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry = NULL;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -EINVAL;
-
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        vmce->mci_ctl[bank] = val;
+        if ( bank < nr_mce_banks )
+            vmce->mci_ctl[bank] = val;
         break;
     case MSR_IA32_MC0_STATUS:
         /* Give the first entry of the list, it corresponds to current
@@ -202,9 +221,9 @@ static int bank_mce_wrmsr(struct domain 
          * the guest, this node will be deleted.
          * Only error bank is written. Non-error banks simply return.
          */
-        if ( !list_empty(&dom_vmce(d)->impact_header) )
+        if ( !list_empty(&vmce->impact_header) )
         {
-            entry = list_entry(dom_vmce(d)->impact_header.next,
+            entry = list_entry(vmce->impact_header.next,
                                struct bank_entry, list);
             if ( entry->bank == bank )
                 entry->mci_status = val;
@@ -228,7 +247,7 @@ static int bank_mce_wrmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_wrmsr(msr, val);
+            ret = intel_mce_wrmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -247,9 +266,9 @@ static int bank_mce_wrmsr(struct domain 
  */
 int vmce_wrmsr(u32 msr, u64 val)
 {
-    struct domain *d = current->domain;
+    struct vcpu *cur = current;
     struct bank_entry *entry = NULL;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     if ( !g_mcg_cap )
@@ -266,7 +285,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         vmce->mcg_status = val;
         mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val);
         /* For HVM guest, this is the point for deleting vMCE injection node */
-        if ( d->is_hvm && (vmce->nr_injection > 0) )
+        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
         {
             vmce->nr_injection--; /* Should be 0 */
             if ( !list_empty(&vmce->impact_header) )
@@ -293,7 +312,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         ret = -1;
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
         break;
     }
 
@@ -301,6 +320,46 @@ int vmce_wrmsr(u32 msr, u64 val)
     return ret;
 }
 
+static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    struct vcpu *v;
+    int err = 0;
+
+    for_each_vcpu( d, v ) {
+        struct hvm_vmce_vcpu ctxt = {
+            .caps = v->arch.mcg_cap
+        };
+
+        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        if ( err )
+            break;
+    }
+
+    return err;
+}
+
+static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    unsigned int vcpuid = hvm_load_instance(h);
+    struct vcpu *v;
+    struct hvm_vmce_vcpu ctxt;
+    int err;
+
+    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+    {
+        dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no vcpu%u\n",
+                d->domain_id, vcpuid);
+        err = -EINVAL;
+    }
+    else
+        err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
+
+    return err ?: vmce_restore_vcpu(v, ctxt.caps);
+}
+
+HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
+                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
+
 int inject_vmce(struct domain *d)
 {
     int cpu = smp_processor_id();
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -422,6 +422,8 @@ int vcpu_initialise(struct vcpu *v)
     if ( (rc = vcpu_init_fpu(v)) != 0 )
         return rc;
 
+    vmce_init_vcpu(v);
+
     if ( is_hvm_domain(d) )
     {
         rc = hvm_vcpu_initialise(v);
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,11 +1027,12 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            evc->mcg_cap = v->arch.mcg_cap;
         }
         else
         {
             ret = -EINVAL;
-            if ( evc->size != sizeof(*evc) )
+            if ( evc->size < offsetof(typeof(*evc), mcg_cap) )
                 goto ext_vcpucontext_out;
 #ifdef __x86_64__
             if ( !is_hvm_domain(d) )
@@ -1059,6 +1060,10 @@ long arch_do_domctl(
                  (evc->syscall32_callback_cs & ~3) ||
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
+
+            if ( evc->size >= offsetof(typeof(*evc), mcg_cap) +
+                              sizeof(evc->mcg_cap) )
+                ret = vmce_restore_vcpu(v, evc->mcg_cap);
         }
 
         ret = 0;
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -488,6 +488,8 @@ struct arch_vcpu
     /* This variable determines whether nonlazy extended state has been used,
      * and thus should be saved/restored. */
     bool_t nonlazy_xstate_used;
+
+    uint64_t mcg_cap;
     
     struct paging_vcpu paging;
 
--- a/xen/include/asm-x86/mce.h
+++ b/xen/include/asm-x86/mce.h
@@ -16,7 +16,6 @@ struct bank_entry {
 struct domain_mca_msrs
 {
     /* Guest should not change below values after DOM boot up */
-    uint64_t mcg_cap;
     uint64_t mcg_ctl;
     uint64_t mcg_status;
     uint64_t *mci_ctl;
@@ -28,6 +27,8 @@ struct domain_mca_msrs
 /* Guest vMCE MSRs virtualization */
 extern int vmce_init_msr(struct domain *d);
 extern void vmce_destroy_msr(struct domain *d);
+extern void vmce_init_vcpu(struct vcpu *);
+extern int vmce_restore_vcpu(struct vcpu *, uint64_t caps);
 extern int vmce_wrmsr(uint32_t msr, uint64_t val);
 extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
 
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -575,9 +575,15 @@ struct hvm_viridian_vcpu_context {
 
 DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
 
+struct hvm_vmce_vcpu {
+    uint64_t caps;
+};
+
+DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
+
 /* 
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 17
+#define HVM_SAVE_CODE_MAX 18
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
     uint32_t         vcpu;
     /*
      * SET: Size of struct (IN)
-     * GET: Size of struct (OUT)
+     * GET: Size of struct (OUT, up to 128 bytes)
      */
     uint32_t         size;
 #if defined(__i386__) || defined(__x86_64__)
@@ -571,6 +571,7 @@ struct xen_domctl_ext_vcpucontext {
     uint16_t         sysenter_callback_cs;
     uint8_t          syscall32_disables_events;
     uint8_t          sysenter_disables_events;
+    uint64_aligned_t mcg_cap;
 #endif
 };
 typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-13  8:30                 ` Olaf Hering
@ 2012-02-13 10:43                   ` Jan Beulich
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Beulich @ 2012-02-13 10:43 UTC (permalink / raw)
  To: Olaf Hering; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

>>> On 13.02.12 at 09:30, Olaf Hering <olaf@aepfle.de> wrote:
> On Fri, Feb 10, Jan Beulich wrote:
> 
>> > Regarding these messages, are they guest addresses?
>> > 
>> > (XEN) traps.c:3131: GPF (0000): ffff82c4801d4c6c -> ffff82c480226eb4
>> 
>> No, these are hypervisor ones.
> 
> These are the functions they belong to:
> 
> (XEN) traps.c:3131: GPF (0000): ffff82c4801d61ec -> ffff82c48022aa2e
> (XEN) do_general_protection: ffff82c4801d61ec vmx_msr_read_intercept+0x2d8/0x358
> (XEN) do_general_protection: ffff82c48022aa2e vmac+0x6c8/0x99a

This appears to be the rdmsr_safe() at the bottom of the default case
in vmx_msr_read_intercept(), and I would expect those to go away
once the vMCE save/restore patch is working properly.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-13 10:36               ` Jan Beulich
@ 2012-02-13 14:20                 ` Olaf Hering
  2012-02-14 14:31                   ` Jan Beulich
  2012-02-14 14:43                   ` Jan Beulich
  0 siblings, 2 replies; 31+ messages in thread
From: Olaf Hering @ 2012-02-13 14:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

On Mon, Feb 13, Jan Beulich wrote:

> >>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote:
> > These functions are called for dom0, but not for domU. And as a result
> > arch.nr_vmce_banks remains zero. I assume the guest needs to be
> > initialized in some way as well, and that does not happen?
> 
> Below/attached a fixed version of the patch.

I get some mismatch after migration, both hosts run the same xen binary.
The small debug patch I use is attached.


Also: The tools do not catch the restore error so that the guest continues to
run on the source host.

Olaf

nicolai login: (XEN) vmce_init_vcpu 0 o 0 n 806
(XEN) vmce_init_vcpu 1 o 0 n 806
(XEN) vmce_init_vcpu 2 o 0 n 806
(XEN) vmce_init_vcpu 3 o 0 n 806
(XEN) save.c:62:d0 HVM restore (1): VM saved on one CPU (0x206c2) and restored on another (0x10676).
(XEN) save.c:234:d0 HVM restore: CPU 0
(XEN) save.c:234:d0 HVM restore: CPU 1
(XEN) save.c:234:d0 HVM restore: CPU 2
(XEN) save.c:234:d0 HVM restore: CPU 3
(XEN) save.c:234:d0 HVM restore: PIC 0
(XEN) save.c:234:d0 HVM restore: PIC 1
(XEN) save.c:234:d0 HVM restore: IOAPIC 0
(XEN) save.c:234:d0 HVM restore: LAPIC 0
(XEN) save.c:234:d0 HVM restore: LAPIC 1
(XEN) save.c:234:d0 HVM restore: LAPIC 2
(XEN) save.c:234:d0 HVM restore: LAPIC 3
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3
(XEN) save.c:234:d0 HVM restore: PCI_IRQ 0
(XEN) save.c:234:d0 HVM restore: ISA_IRQ 0
(XEN) save.c:234:d0 HVM restore: PCI_LINK 0
(XEN) save.c:234:d0 HVM restore: PIT 0
(XEN) save.c:234:d0 HVM restore: RTC 0
(XEN) save.c:234:d0 HVM restore: HPET 0
(XEN) save.c:234:d0 HVM restore: PMTIMER 0
(XEN) save.c:234:d0 HVM restore: MTRR 0
(XEN) save.c:234:d0 HVM restore: MTRR 1
(XEN) save.c:234:d0 HVM restore: MTRR 2
(XEN) save.c:234:d0 HVM restore: MTRR 3
(XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0
(XEN) save.c:291:d0 HVM restore mismatch: expected type 18 length 8, saw type 18 length 1
(XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff82c4802c7d28 ffffffff -1 o 806 n ea
(XEN) save.c:239:d0 HVM restore: failed to load entry 18/0
(XEN) vmce_init_vcpu 0 o 0 n 806
(XEN) vmce_init_vcpu 1 o 0 n 806
(XEN) vmce_init_vcpu 2 o 0 n 806
(XEN) vmce_init_vcpu 3 o 0 n 806
(XEN) save.c:62:d0 HVM restore (2): VM saved on one CPU (0x206c2) and restored on another (0x10676).
(XEN) save.c:234:d0 HVM restore: CPU 0
(XEN) save.c:234:d0 HVM restore: CPU 1
(XEN) save.c:234:d0 HVM restore: CPU 2
(XEN) save.c:234:d0 HVM restore: CPU 3
(XEN) save.c:234:d0 HVM restore: PIC 0
(XEN) save.c:234:d0 HVM restore: PIC 1
(XEN) save.c:234:d0 HVM restore: IOAPIC 0
(XEN) save.c:234:d0 HVM restore: LAPIC 0
(XEN) save.c:234:d0 HVM restore: LAPIC 1
(XEN) save.c:234:d0 HVM restore: LAPIC 2
(XEN) save.c:234:d0 HVM restore: LAPIC 3
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2
(XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3
(XEN) save.c:234:d0 HVM restore: PCI_IRQ 0
(XEN) save.c:234:d0 HVM restore: ISA_IRQ 0
(XEN) save.c:234:d0 HVM restore: PCI_LINK 0
(XEN) save.c:234:d0 HVM restore: PIT 0
(XEN) save.c:234:d0 HVM restore: RTC 0
(XEN) save.c:234:d0 HVM restore: HPET 0
(XEN) save.c:234:d0 HVM restore: PMTIMER 0
(XEN) save.c:234:d0 HVM restore: MTRR 0
(XEN) save.c:234:d0 HVM restore: MTRR 1
(XEN) save.c:234:d0 HVM restore: MTRR 2
(XEN) save.c:234:d0 HVM restore: MTRR 3
(XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0
(XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809
(XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0 (supported: 0x800)
(XEN) save.c:239:d0 HVM restore: failed to load entry 18/0



diff -r cbb1cce5fac0 xen/arch/x86/cpu/mcheck/mce.c
--- a/xen/arch/x86/cpu/mcheck/mce.c
+++ b/xen/arch/x86/cpu/mcheck/mce.c
@@ -743,6 +743,7 @@ int mca_cap_init(void)
     {
         int i;
 
+        printk("%s: nr_mce_banks %x\n", __func__, nr_mce_banks);
         mca_allbanks = mcabanks_alloc();
         for ( i = 0; i < nr_mce_banks; i++)
             mcabanks_set(i, mca_allbanks);
diff -r cbb1cce5fac0 xen/arch/x86/cpu/mcheck/vmce.c
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -63,6 +63,7 @@ void vmce_destroy_msr(struct domain *d)
 
 void vmce_init_vcpu(struct vcpu *v)
 {
+    printk("%s %u o %lx n %lx\n", __func__, v->vcpu_id, v->arch.mcg_cap, g_mcg_cap);
     v->arch.mcg_cap = g_mcg_cap;
 }
 
@@ -331,6 +332,7 @@ static int vmce_save_vcpu_ctxt(struct do
         };
 
         err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        gdprintk(XENLOG_ERR, "%s %p %u %x %d o %lx\n", __func__, h, v->vcpu_id, err, err, ctxt.caps);
         if ( err )
             break;
     }
@@ -352,8 +354,11 @@ static int vmce_load_vcpu_ctxt(struct do
         err = -EINVAL;
     }
     else
+    {
         err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
 
+    gdprintk(XENLOG_ERR, "%s %p %x %d o %lx n %lx\n", __func__, h, err, err, v->arch.mcg_cap, ctxt.caps);
+    }
     return err ?: vmce_restore_vcpu(v, ctxt.caps);
 }
 
diff -r cbb1cce5fac0 xen/arch/x86/domctl.c
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,6 +1027,7 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            gdprintk(XENLOG_ERR, "%s %u n %lx o %lx\n", __func__, v->vcpu_id, v->arch.mcg_cap, evc->mcg_cap);
             evc->mcg_cap = v->arch.mcg_cap;
         }
         else
@@ -1061,6 +1062,7 @@ long arch_do_domctl(
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
 
+            gdprintk(XENLOG_ERR, "%s %u n %lx o %lx\n", __func__, v->vcpu_id, v->arch.mcg_cap, evc->mcg_cap);
             if ( evc->size >= offsetof(typeof(*evc), mcg_cap) +
                               sizeof(evc->mcg_cap) )
                 ret = vmce_restore_vcpu(v, evc->mcg_cap);
diff -r cbb1cce5fac0 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -3129,6 +3129,8 @@ void do_general_protection(struct cpu_us
     {
         dprintk(XENLOG_INFO, "GPF (%04x): %p -> %p\n",
                 regs->error_code, _p(regs->eip), _p(fixup));
+ printk ("%s: %p ", __func__, _p(regs->eip)); print_symbol("%s\n", regs->eip);
+ printk ("%s: %p ", __func__, _p(fixup)); print_symbol("%s\n", fixup);
         regs->eip = fixup;
         return;
     }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-13 14:20                 ` Olaf Hering
@ 2012-02-14 14:31                   ` Jan Beulich
  2012-02-14 15:21                     ` Olaf Hering
  2012-02-14 14:43                   ` Jan Beulich
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2012-02-14 14:31 UTC (permalink / raw)
  To: Olaf Hering; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

>>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote:
> On Mon, Feb 13, Jan Beulich wrote:
> 
>> >>> On 10.02.12 at 22:28, Olaf Hering <olaf@aepfle.de> wrote:
>> > These functions are called for dom0, but not for domU. And as a result
>> > arch.nr_vmce_banks remains zero. I assume the guest needs to be
>> > initialized in some way as well, and that does not happen?
>> 
>> Below/attached a fixed version of the patch.
> 
> I get some mismatch after migration, both hosts run the same xen binary.

Are you sure about that? I'm asking because ...

> The small debug patch I use is attached.
> 
> 
> Also: The tools do not catch the restore error so that the guest continues 
> to
> run on the source host.
> 
> Olaf
> 
> nicolai login: (XEN) vmce_init_vcpu 0 o 0 n 806
> (XEN) vmce_init_vcpu 1 o 0 n 806
> (XEN) vmce_init_vcpu 2 o 0 n 806
> (XEN) vmce_init_vcpu 3 o 0 n 806
> (XEN) save.c:62:d0 HVM restore (1): VM saved on one CPU (0x206c2) and 
> restored on another (0x10676).
> (XEN) save.c:234:d0 HVM restore: CPU 0
> (XEN) save.c:234:d0 HVM restore: CPU 1
> (XEN) save.c:234:d0 HVM restore: CPU 2
> (XEN) save.c:234:d0 HVM restore: CPU 3
> (XEN) save.c:234:d0 HVM restore: PIC 0
> (XEN) save.c:234:d0 HVM restore: PIC 1
> (XEN) save.c:234:d0 HVM restore: IOAPIC 0
> (XEN) save.c:234:d0 HVM restore: LAPIC 0
> (XEN) save.c:234:d0 HVM restore: LAPIC 1
> (XEN) save.c:234:d0 HVM restore: LAPIC 2
> (XEN) save.c:234:d0 HVM restore: LAPIC 3
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3
> (XEN) save.c:234:d0 HVM restore: PCI_IRQ 0
> (XEN) save.c:234:d0 HVM restore: ISA_IRQ 0
> (XEN) save.c:234:d0 HVM restore: PCI_LINK 0
> (XEN) save.c:234:d0 HVM restore: PIT 0
> (XEN) save.c:234:d0 HVM restore: RTC 0
> (XEN) save.c:234:d0 HVM restore: HPET 0
> (XEN) save.c:234:d0 HVM restore: PMTIMER 0
> (XEN) save.c:234:d0 HVM restore: MTRR 0
> (XEN) save.c:234:d0 HVM restore: MTRR 1
> (XEN) save.c:234:d0 HVM restore: MTRR 2
> (XEN) save.c:234:d0 HVM restore: MTRR 3
> (XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0
> (XEN) save.c:291:d0 HVM restore mismatch: expected type 18 length 8, saw type 18 length 1

... this suggests the source host was still running with the old binary
(where the save record was indeed just 1 byte in size)...

> (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff82c4802c7d28 ffffffff -1 o 806 n ea
> (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0
> (XEN) vmce_init_vcpu 0 o 0 n 806
> (XEN) vmce_init_vcpu 1 o 0 n 806
> (XEN) vmce_init_vcpu 2 o 0 n 806
> (XEN) vmce_init_vcpu 3 o 0 n 806
> (XEN) save.c:62:d0 HVM restore (2): VM saved on one CPU (0x206c2) and 
> restored on another (0x10676).
> (XEN) save.c:234:d0 HVM restore: CPU 0
> (XEN) save.c:234:d0 HVM restore: CPU 1
> (XEN) save.c:234:d0 HVM restore: CPU 2
> (XEN) save.c:234:d0 HVM restore: CPU 3
> (XEN) save.c:234:d0 HVM restore: PIC 0
> (XEN) save.c:234:d0 HVM restore: PIC 1
> (XEN) save.c:234:d0 HVM restore: IOAPIC 0
> (XEN) save.c:234:d0 HVM restore: LAPIC 0
> (XEN) save.c:234:d0 HVM restore: LAPIC 1
> (XEN) save.c:234:d0 HVM restore: LAPIC 2
> (XEN) save.c:234:d0 HVM restore: LAPIC 3
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 0
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 1
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 2
> (XEN) save.c:234:d0 HVM restore: LAPIC_REGS 3
> (XEN) save.c:234:d0 HVM restore: PCI_IRQ 0
> (XEN) save.c:234:d0 HVM restore: ISA_IRQ 0
> (XEN) save.c:234:d0 HVM restore: PCI_LINK 0
> (XEN) save.c:234:d0 HVM restore: PIT 0
> (XEN) save.c:234:d0 HVM restore: RTC 0
> (XEN) save.c:234:d0 HVM restore: HPET 0
> (XEN) save.c:234:d0 HVM restore: PMTIMER 0
> (XEN) save.c:234:d0 HVM restore: MTRR 0
> (XEN) save.c:234:d0 HVM restore: MTRR 1
> (XEN) save.c:234:d0 HVM restore: MTRR 2
> (XEN) save.c:234:d0 HVM restore: MTRR 3
> (XEN) save.c:234:d0 HVM restore: VMCE_VCPU 0
> (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809
> (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0 (supported: 0x800)

... whereas this one suggests that the size check passed (same binary
on both ends) but the feature sets of the hosts are different (sorry,
I can't suggest a way around this, the hosts must be sufficiently
matching in MCA features - those differences that are tolerable are
already being handled). I'm confused as to what was running on the
source host.

Let me check what bit 12 is: The latest SDM considers this reserved,
so the only way I can see how to deal with this is to switch the
setting up of g_mcg_cap from a back-listing approach to a white-listing
one.

> (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0

Bottom line - it appears to work as intended considering the second
attempt.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-13 14:20                 ` Olaf Hering
  2012-02-14 14:31                   ` Jan Beulich
@ 2012-02-14 14:43                   ` Jan Beulich
  2012-02-14 17:17                     ` Olaf Hering
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Beulich @ 2012-02-14 14:43 UTC (permalink / raw)
  To: Olaf Hering; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

[-- Attachment #1: Type: text/plain, Size: 511 bytes --]

>>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote:
> (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809
> (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0 
> (supported: 0x800)
> (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0

Attached an updated patch that sets up g_mcg_cap using a white-listing
bit mask, as mentioned in the previous response.

Note that this takes -unstable c/s 24781:6ae5506e49ab as prerequisite.

Jan


[-- Attachment #2: x86-vmce-sr.patch --]
[-- Type: text/plain, Size: 16833 bytes --]

This requires setting up g_mcg_cap using a white-listing bit mask.

Should we also save/restore MCG_CTL and MCi_CTL?

--- a/tools/libxc/xc_domain_save.c
+++ b/tools/libxc/xc_domain_save.c
@@ -1879,6 +1879,7 @@ int xc_domain_save(xc_interface *xch, in
 
         domctl.cmd = XEN_DOMCTL_get_ext_vcpucontext;
         domctl.domain = dom;
+        memset(&domctl.u, 0, sizeof(domctl.u));
         domctl.u.ext_vcpucontext.vcpu = i;
         if ( xc_domctl(xch, &domctl) < 0 )
         {
--- a/tools/misc/xen-hvmctx.c
+++ b/tools/misc/xen-hvmctx.c
@@ -383,6 +383,13 @@ static void dump_viridian_vcpu(void)
            (unsigned long long) p.apic_assist);           
 }
 
+static void dump_vmce_vcpu(void)
+{
+    HVM_SAVE_TYPE(VMCE_VCPU) p;
+    READ(p);
+    printf("    VMCE_VCPU: caps %" PRIx64 "\n", p.caps);
+}
+
 int main(int argc, char **argv)
 {
     int entry, domid;
@@ -449,6 +456,7 @@ int main(int argc, char **argv)
         case HVM_SAVE_CODE(MTRR): dump_mtrr(); break;
         case HVM_SAVE_CODE(VIRIDIAN_DOMAIN): dump_viridian_domain(); break;
         case HVM_SAVE_CODE(VIRIDIAN_VCPU): dump_viridian_vcpu(); break;
+        case HVM_SAVE_CODE(VMCE_VCPU): dump_vmce_vcpu(); break;
         case HVM_SAVE_CODE(END): break;
         default:
             printf(" ** Don't understand type %u: skipping\n",
--- a/xen/arch/x86/cpu/mcheck/mce.h
+++ b/xen/arch/x86/cpu/mcheck/mce.h
@@ -3,6 +3,7 @@
 #define _MCE_H
 
 #include <xen/init.h>
+#include <xen/sched.h>
 #include <xen/smp.h>
 #include <asm/types.h>
 #include <asm/traps.h>
@@ -54,8 +55,8 @@ int unmmap_broken_page(struct domain *d,
 u64 mce_cap_init(void);
 extern unsigned int firstbank;
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val);
-int intel_mce_wrmsr(uint32_t msr, uint64_t val);
+int intel_mce_rdmsr(const struct vcpu *, uint32_t msr, uint64_t *val);
+int intel_mce_wrmsr(struct vcpu *, uint32_t msr, uint64_t val);
 
 struct mcinfo_extended *intel_get_extended_msrs(
     struct mcinfo_global *mig, struct mc_info *mi);
@@ -171,18 +172,20 @@ int vmce_domain_inject(struct mcinfo_ban
 
 extern int vmce_init(struct cpuinfo_x86 *c);
 
-static inline int mce_vendor_bank_msr(uint32_t msr)
+static inline int mce_vendor_bank_msr(const struct vcpu *v, uint32_t msr)
 {
     if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-         msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks) )
+         msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
           return 1;
     return 0;
 }
 
-static inline int mce_bank_msr(uint32_t msr)
+static inline int mce_bank_msr(const struct vcpu *v, uint32_t msr)
 {
-    if ( (msr >= MSR_IA32_MC0_CTL && msr < MSR_IA32_MCx_CTL(nr_mce_banks)) ||
-        mce_vendor_bank_msr(msr) )
+    if ( (msr >= MSR_IA32_MC0_CTL &&
+          msr < MSR_IA32_MCx_CTL(v->arch.mcg_cap & MCG_CAP_COUNT)) ||
+         mce_vendor_bank_msr(v, msr) )
         return 1;
     return 0;
 }
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c
@@ -1421,11 +1421,12 @@ enum mcheck_type intel_mcheck_init(struc
 }
 
 /* intel specific MCA MSR */
-int intel_mce_wrmsr(uint32_t msr, uint64_t val)
+int intel_mce_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not write this MSR!\n");
@@ -1435,11 +1436,12 @@ int intel_mce_wrmsr(uint32_t msr, uint64
     return ret;
 }
 
-int intel_mce_rdmsr(uint32_t msr, uint64_t *val)
+int intel_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
     int ret = 0;
 
-    if (msr >= MSR_IA32_MC0_CTL2 && msr < (MSR_IA32_MC0_CTL2 + nr_mce_banks))
+    if ( msr >= MSR_IA32_MC0_CTL2 &&
+         msr < MSR_IA32_MCx_CTL2(v->arch.mcg_cap & MCG_CAP_COUNT) )
     {
         mce_printk(MCE_QUIET, "We have disabled CMCI capability, "
                  "Guest should not read this MSR!\n");
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -10,6 +10,7 @@
 #include <xen/delay.h>
 #include <xen/smp.h>
 #include <xen/mm.h>
+#include <xen/hvm/save.h>
 #include <asm/processor.h>
 #include <public/sysctl.h>
 #include <asm/system.h>
@@ -42,7 +43,6 @@ int vmce_init_msr(struct domain *d)
            nr_mce_banks * sizeof(*dom_vmce(d)->mci_ctl));
 
     dom_vmce(d)->mcg_status = 0x0;
-    dom_vmce(d)->mcg_cap = g_mcg_cap;
     dom_vmce(d)->mcg_ctl = ~(uint64_t)0x0;
     dom_vmce(d)->nr_injection = 0;
 
@@ -61,21 +61,41 @@ void vmce_destroy_msr(struct domain *d)
     dom_vmce(d) = NULL;
 }
 
-static int bank_mce_rdmsr(struct domain *d, uint32_t msr, uint64_t *val)
+void vmce_init_vcpu(struct vcpu *v)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    v->arch.mcg_cap = g_mcg_cap;
+}
+
+int vmce_restore_vcpu(struct vcpu *v, uint64_t caps)
+{
+    if ( caps & ~g_mcg_cap & ~MCG_CAP_COUNT & ~MCG_CTL_P )
+    {
+        dprintk(XENLOG_G_ERR, "%s restore: unsupported MCA capabilities"
+                " %#" PRIx64 " for d%d:v%u (supported: %#Lx)\n",
+                is_hvm_vcpu(v) ? "HVM" : "PV", caps, v->domain->domain_id,
+                v->vcpu_id, g_mcg_cap & ~MCG_CAP_COUNT);
+        return -EPERM;
+    }
+
+    v->arch.mcg_cap = caps;
+    return 0;
+}
+
+static int bank_mce_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
+{
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -1;
+    *val = 0;
 
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        *val = vmce->mci_ctl[bank] &
-            (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
+        if ( bank < nr_mce_banks )
+            *val = vmce->mci_ctl[bank] &
+                   (h_mci_ctrl ? h_mci_ctrl[bank] : ~0UL);
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MC%u_CTL 0x%"PRIx64"\n",
                    bank, *val);
         break;
@@ -126,7 +146,7 @@ static int bank_mce_rdmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_rdmsr(msr, val);
+            ret = intel_mce_rdmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -145,13 +165,13 @@ static int bank_mce_rdmsr(struct domain 
  */
 int vmce_rdmsr(uint32_t msr, uint64_t *val)
 {
-    struct domain *d = current->domain;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    const struct vcpu *cur = current;
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     *val = 0;
 
-    spin_lock(&dom_vmce(d)->lock);
+    spin_lock(&vmce->lock);
 
     switch ( msr )
     {
@@ -162,39 +182,38 @@ int vmce_rdmsr(uint32_t msr, uint64_t *v
                        "MCE: rdmsr MCG_STATUS 0x%"PRIx64"\n", *val);
         break;
     case MSR_IA32_MCG_CAP:
-        *val = vmce->mcg_cap;
+        *val = cur->arch.mcg_cap;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CAP 0x%"PRIx64"\n",
                    *val);
         break;
     case MSR_IA32_MCG_CTL:
         /* Always 0 if no CTL support */
-        *val = vmce->mcg_ctl & h_mcg_ctl;
+        if ( cur->arch.mcg_cap & MCG_CTL_P )
+            *val = vmce->mcg_ctl & h_mcg_ctl;
         mce_printk(MCE_VERBOSE, "MCE: rdmsr MCG_CTL 0x%"PRIx64"\n",
                    *val);
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_rdmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_rdmsr(cur, msr, val) : 0;
         break;
     }
 
-    spin_unlock(&dom_vmce(d)->lock);
+    spin_unlock(&vmce->lock);
     return ret;
 }
 
-static int bank_mce_wrmsr(struct domain *d, u32 msr, u64 val)
+static int bank_mce_wrmsr(struct vcpu *v, u32 msr, u64 val)
 {
-    int bank, ret = 1;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    int ret = 1;
+    unsigned int bank = (msr - MSR_IA32_MC0_CTL) / 4;
+    struct domain_mca_msrs *vmce = dom_vmce(v->domain);
     struct bank_entry *entry = NULL;
 
-    bank = (msr - MSR_IA32_MC0_CTL) / 4;
-    if ( bank >= nr_mce_banks )
-        return -EINVAL;
-
     switch ( msr & (MSR_IA32_MC0_CTL | 3) )
     {
     case MSR_IA32_MC0_CTL:
-        vmce->mci_ctl[bank] = val;
+        if ( bank < nr_mce_banks )
+            vmce->mci_ctl[bank] = val;
         break;
     case MSR_IA32_MC0_STATUS:
         /* Give the first entry of the list, it corresponds to current
@@ -202,9 +221,9 @@ static int bank_mce_wrmsr(struct domain 
          * the guest, this node will be deleted.
          * Only error bank is written. Non-error banks simply return.
          */
-        if ( !list_empty(&dom_vmce(d)->impact_header) )
+        if ( !list_empty(&vmce->impact_header) )
         {
-            entry = list_entry(dom_vmce(d)->impact_header.next,
+            entry = list_entry(vmce->impact_header.next,
                                struct bank_entry, list);
             if ( entry->bank == bank )
                 entry->mci_status = val;
@@ -228,7 +247,7 @@ static int bank_mce_wrmsr(struct domain 
         switch ( boot_cpu_data.x86_vendor )
         {
         case X86_VENDOR_INTEL:
-            ret = intel_mce_wrmsr(msr, val);
+            ret = intel_mce_wrmsr(v, msr, val);
             break;
         default:
             ret = 0;
@@ -247,9 +266,9 @@ static int bank_mce_wrmsr(struct domain 
  */
 int vmce_wrmsr(u32 msr, u64 val)
 {
-    struct domain *d = current->domain;
+    struct vcpu *cur = current;
     struct bank_entry *entry = NULL;
-    struct domain_mca_msrs *vmce = dom_vmce(d);
+    struct domain_mca_msrs *vmce = dom_vmce(cur->domain);
     int ret = 1;
 
     if ( !g_mcg_cap )
@@ -266,7 +285,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         vmce->mcg_status = val;
         mce_printk(MCE_VERBOSE, "MCE: wrmsr MCG_STATUS %"PRIx64"\n", val);
         /* For HVM guest, this is the point for deleting vMCE injection node */
-        if ( d->is_hvm && (vmce->nr_injection > 0) )
+        if ( is_hvm_vcpu(cur) && (vmce->nr_injection > 0) )
         {
             vmce->nr_injection--; /* Should be 0 */
             if ( !list_empty(&vmce->impact_header) )
@@ -293,7 +312,7 @@ int vmce_wrmsr(u32 msr, u64 val)
         ret = -1;
         break;
     default:
-        ret = mce_bank_msr(msr) ? bank_mce_wrmsr(d, msr, val) : 0;
+        ret = mce_bank_msr(cur, msr) ? bank_mce_wrmsr(cur, msr, val) : 0;
         break;
     }
 
@@ -301,6 +320,46 @@ int vmce_wrmsr(u32 msr, u64 val)
     return ret;
 }
 
+static int vmce_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    struct vcpu *v;
+    int err = 0;
+
+    for_each_vcpu( d, v ) {
+        struct hvm_vmce_vcpu ctxt = {
+            .caps = v->arch.mcg_cap
+        };
+
+        err = hvm_save_entry(VMCE_VCPU, v->vcpu_id, h, &ctxt);
+        if ( err )
+            break;
+    }
+
+    return err;
+}
+
+static int vmce_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
+{
+    unsigned int vcpuid = hvm_load_instance(h);
+    struct vcpu *v;
+    struct hvm_vmce_vcpu ctxt;
+    int err;
+
+    if ( vcpuid >= d->max_vcpus || (v = d->vcpu[vcpuid]) == NULL )
+    {
+        dprintk(XENLOG_G_ERR, "HVM restore: dom%d has no vcpu%u\n",
+                d->domain_id, vcpuid);
+        err = -EINVAL;
+    }
+    else
+        err = hvm_load_entry(VMCE_VCPU, h, &ctxt);
+
+    return err ?: vmce_restore_vcpu(v, ctxt.caps);
+}
+
+HVM_REGISTER_SAVE_RESTORE(VMCE_VCPU, vmce_save_vcpu_ctxt,
+                          vmce_load_vcpu_ctxt, 1, HVMSR_PER_VCPU);
+
 int inject_vmce(struct domain *d)
 {
     int cpu = smp_processor_id();
@@ -457,7 +516,7 @@ int vmce_init(struct cpuinfo_x86 *c)
 
     rdmsrl(MSR_IA32_MCG_CAP, value);
     /* For Guest vMCE usage */
-    g_mcg_cap = value & ~MCG_CMCI_P;
+    g_mcg_cap = value & (MCG_CAP_COUNT | MCG_CTL_P | MCG_TES_P | MCG_SER_P);
     if (value & MCG_CTL_P)
         rdmsrl(MSR_IA32_MCG_CTL, h_mcg_ctl);
 
--- a/xen/arch/x86/cpu/mcheck/x86_mca.h
+++ b/xen/arch/x86/cpu/mcheck/x86_mca.h
@@ -30,12 +30,13 @@
 
 
 /* Bitfield of the MSR_IA32_MCG_CAP register */
-#define MCG_SER_P               (1UL<<24)
 #define MCG_CAP_COUNT           0x00000000000000ffULL
-#define MCG_CTL_P               0x0000000000000100ULL
-#define MCG_EXT_P		(1UL<<9)
-#define MCG_EXT_CNT		(16)
-#define MCG_CMCI_P		(1UL<<10)
+#define MCG_CTL_P               (1ULL<<8)
+#define MCG_EXT_P               (1ULL<<9)
+#define MCG_CMCI_P              (1ULL<<10)
+#define MCG_TES_P               (1ULL<<11)
+#define MCG_EXT_CNT             16
+#define MCG_SER_P               (1ULL<<24)
 /* Other bits are reserved */
 
 /* Bitfield of the MSR_IA32_MCG_STATUS register */
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -422,6 +422,8 @@ int vcpu_initialise(struct vcpu *v)
     if ( (rc = vcpu_init_fpu(v)) != 0 )
         return rc;
 
+    vmce_init_vcpu(v);
+
     if ( is_hvm_domain(d) )
     {
         rc = hvm_vcpu_initialise(v);
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1027,11 +1027,12 @@ long arch_do_domctl(
                 evc->syscall32_callback_eip    = 0;
                 evc->syscall32_disables_events = 0;
             }
+            evc->mcg_cap = v->arch.mcg_cap;
         }
         else
         {
             ret = -EINVAL;
-            if ( evc->size != sizeof(*evc) )
+            if ( evc->size < offsetof(typeof(*evc), mcg_cap) )
                 goto ext_vcpucontext_out;
 #ifdef __x86_64__
             if ( !is_hvm_domain(d) )
@@ -1059,6 +1060,10 @@ long arch_do_domctl(
                  (evc->syscall32_callback_cs & ~3) ||
                  evc->syscall32_callback_eip )
                 goto ext_vcpucontext_out;
+
+            if ( evc->size >= offsetof(typeof(*evc), mcg_cap) +
+                              sizeof(evc->mcg_cap) )
+                ret = vmce_restore_vcpu(v, evc->mcg_cap);
         }
 
         ret = 0;
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -488,6 +488,8 @@ struct arch_vcpu
     /* This variable determines whether nonlazy extended state has been used,
      * and thus should be saved/restored. */
     bool_t nonlazy_xstate_used;
+
+    uint64_t mcg_cap;
     
     struct paging_vcpu paging;
 
--- a/xen/include/asm-x86/mce.h
+++ b/xen/include/asm-x86/mce.h
@@ -16,7 +16,6 @@ struct bank_entry {
 struct domain_mca_msrs
 {
     /* Guest should not change below values after DOM boot up */
-    uint64_t mcg_cap;
     uint64_t mcg_ctl;
     uint64_t mcg_status;
     uint64_t *mci_ctl;
@@ -28,6 +27,8 @@ struct domain_mca_msrs
 /* Guest vMCE MSRs virtualization */
 extern int vmce_init_msr(struct domain *d);
 extern void vmce_destroy_msr(struct domain *d);
+extern void vmce_init_vcpu(struct vcpu *);
+extern int vmce_restore_vcpu(struct vcpu *, uint64_t caps);
 extern int vmce_wrmsr(uint32_t msr, uint64_t val);
 extern int vmce_rdmsr(uint32_t msr, uint64_t *val);
 
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -575,9 +575,15 @@ struct hvm_viridian_vcpu_context {
 
 DECLARE_HVM_SAVE_TYPE(VIRIDIAN_VCPU, 17, struct hvm_viridian_vcpu_context);
 
+struct hvm_vmce_vcpu {
+    uint64_t caps;
+};
+
+DECLARE_HVM_SAVE_TYPE(VMCE_VCPU, 18, struct hvm_vmce_vcpu);
+
 /* 
  * Largest type-code in use
  */
-#define HVM_SAVE_CODE_MAX 17
+#define HVM_SAVE_CODE_MAX 18
 
 #endif /* __XEN_PUBLIC_HVM_SAVE_X86_H__ */
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -559,7 +559,7 @@ struct xen_domctl_ext_vcpucontext {
     uint32_t         vcpu;
     /*
      * SET: Size of struct (IN)
-     * GET: Size of struct (OUT)
+     * GET: Size of struct (OUT, up to 128 bytes)
      */
     uint32_t         size;
 #if defined(__i386__) || defined(__x86_64__)
@@ -571,6 +571,7 @@ struct xen_domctl_ext_vcpucontext {
     uint16_t         sysenter_callback_cs;
     uint8_t          syscall32_disables_events;
     uint8_t          sysenter_disables_events;
+    uint64_aligned_t mcg_cap;
 #endif
 };
 typedef struct xen_domctl_ext_vcpucontext xen_domctl_ext_vcpucontext_t;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-14 14:31                   ` Jan Beulich
@ 2012-02-14 15:21                     ` Olaf Hering
  0 siblings, 0 replies; 31+ messages in thread
From: Olaf Hering @ 2012-02-14 15:21 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

On Tue, Feb 14, Jan Beulich wrote:

> >>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote:
> >> Below/attached a fixed version of the patch.
> > 
> > I get some mismatch after migration, both hosts run the same xen binary.
> 
> Are you sure about that? I'm asking because ...

Yes.
But I will move to another source host since the current one has now
only a 10MBit/sec connection for some reason, so I will double check
that both run indeed the same code.

Olaf

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
  2012-02-14 14:43                   ` Jan Beulich
@ 2012-02-14 17:17                     ` Olaf Hering
  0 siblings, 0 replies; 31+ messages in thread
From: Olaf Hering @ 2012-02-14 17:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

On Tue, Feb 14, Jan Beulich wrote:

> >>> On 13.02.12 at 15:20, Olaf Hering <olaf@aepfle.de> wrote:
> > (XEN) vmce.c:360:d0 vmce_load_vcpu_ctxt ffff83082e377d28 0 0 o 806 n 1809
> > (XEN) vmce.c:77: HVM restore: unsupported MCA capabilities 0x1809 for d2:v0 
> > (supported: 0x800)
> > (XEN) save.c:239:d0 HVM restore: failed to load entry 18/0
> 
> Attached an updated patch that sets up g_mcg_cap using a white-listing
> bit mask, as mentioned in the previous response.
> 
> Note that this takes -unstable c/s 24781:6ae5506e49ab as prerequisite.

Thanks, this patch works ok for me, migrating from a host with 9 MCE
banks to a host with 6 MCE banks.

Olaf

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: vMCE vs migration
@ 2012-02-13  9:35 Jan Beulich
  0 siblings, 0 replies; 31+ messages in thread
From: Jan Beulich @ 2012-02-13  9:35 UTC (permalink / raw)
  To: Olaf Hering; +Cc: George.Dunlap, xen-devel, George Dunlap, Ian Campbell

>Guest initialization happens in vmce_init_vcpu(), called from
>vcpu_initialise() (irrespective of the kind of domain, i.e. equally for
>PV and HVM).

And wrong I was - HVM guests get filtered much earlier in that function.
I'll get back with an updated patch hopefully soon.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-02-14 17:17 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-23 11:08 vMCE vs migration Jan Beulich
2012-01-24 10:29 ` George Dunlap
2012-01-24 11:08   ` Jan Beulich
2012-01-26 16:54     ` George Dunlap
2012-01-30  7:52       ` Jan Beulich
2012-01-30 13:47       ` Jan Beulich
2012-01-31 11:27         ` George Dunlap
2012-01-31 11:28           ` George Dunlap
2012-01-31 13:17           ` Jan Beulich
2012-01-31 14:34             ` George Dunlap
2012-02-03  7:18         ` Liu, Jinsong
2012-02-03  8:08           ` Jan Beulich
2012-02-03 12:34             ` Liu, Jinsong
2012-02-03 14:04               ` Jan Beulich
2012-02-04 12:35                 ` George Dunlap
2012-02-09 18:02         ` Olaf Hering
2012-02-10 10:03           ` Jan Beulich
2012-02-10 16:53             ` Olaf Hering
2012-02-10 17:00               ` Jan Beulich
2012-02-10 17:05                 ` Olaf Hering
2012-02-13  8:30                 ` Olaf Hering
2012-02-13 10:43                   ` Jan Beulich
2012-02-10 21:28             ` Olaf Hering
2012-02-13  9:30               ` Jan Beulich
2012-02-13 10:36               ` Jan Beulich
2012-02-13 14:20                 ` Olaf Hering
2012-02-14 14:31                   ` Jan Beulich
2012-02-14 15:21                     ` Olaf Hering
2012-02-14 14:43                   ` Jan Beulich
2012-02-14 17:17                     ` Olaf Hering
2012-02-13  9:35 Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.