All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] RAS(Part II)--MCA enalbing in XEN
@ 2009-02-16  5:35 Ke, Liping
  2009-02-16 13:34 ` Christoph Egger
  0 siblings, 1 reply; 45+ messages in thread
From: Ke, Liping @ 2009-02-16  5:35 UTC (permalink / raw)
  To: Keir Fraser, Christoph Egger, Frank.Vanderlinden, Gavin Maltby, Jia
  Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2891 bytes --]

Hi, all
These patches are for MCA enabling in XEN. It is sent as RFC firstly to collect some feedbacks for refinement if 
needed before the final patch. We also attach one description txt documents for your reference.
 
Some implementation notes:
1) When error happens, if the error is fatal (pcc = 1) or can't be recovered (pcc = 0, yet no good recovery methods),
    for avoiding losing logs in DOM0, we will reset machine immediately. Most of MCA MSRs are sticky. After reboot, 
    MCA polling mechanism will send vIRQ to DOM0 for logging.
2) When MCE# happens, all CPUs enter MCA context. The first CPU who read&clear the error MSR bank will be this
    MCE# owner. Necessary locks/synchronization will help to judge the owner and select most severe error.
3) For convenience, we will select the most offending CPU to do most of processing&recovery job.
4) MCE# happens, we will do three jobs:
    a. Send vIRQ to DOM0 for logging
    b. Send vMCE# to Impacted Guest (Currently Only inject to impacted DOM0)
    c. Guest vMCE MSR virtualization
5) Some further improvement/adds might be done if needed:
    a) Impacted DOM judgement algorithm. 
    b) Now vMCE# injection is controlled by centralized data(vmce_data). The injection algorithm is a bit complex. 
        We might change the algorithm which's based on PER_DOM data if you preferred.
        Notes for understanding:
        1) If several banks impact one domain, yet those banks belong to the same pCPU, it will be injected only once.
        2) If more than one bank impact one domain, yet error banks belong to different pCPU, ith will be injected nr_num(pCPU) times.
        3) We use centralized data [two arrays impact_domid, impact_cpus map in vmce_data] to represent the injection 
            algorithm. Combined the two array item (idx, impact_domid) and (idx, impact_cpus) into one item 
            (idx, impact_domid, impact_cpus). This item records the impact_domain id and the error pCPU map 
            (Finding UC errors on this CPU which impact this domain). Then, we can judge how to inject the vMCE
            (domid, impact_times[nr_pCPUs]).
        4) Although data structure is ready, we only inject vMCE# to DOMD0 currently.
    c) Connection with recovery actions (cpu/memory online/offline)
    d) More refines and tests for HVM might be done when needed.
 
Patch Description:
1. basic_mca_support: Enable MCA support in XEN. 
2. vmsr_virtualization: Guest MCE# MSR read/write virtualization support in XEN.
3. mce_dom0: Cooperating with XEN, DOM0 add vIRQ and vMCE# handler. Translate XEN log to DOM0, re-use 
    Linux kernel and MCELOG mechanisms and MCE handler. This is mainly a demonstration patch. 
 
About Test:
We did some internal test and the result is just fine.
 
Any feedback is welcome and thanks a lot for your help! :-)
Regards,
Criping

[-- Attachment #2: MCA_desc.txt --]
[-- Type: text/plain, Size: 3269 bytes --]

This DOC is a brief description about the series of patches for MCA enabling in XEN

With the new availability of hardware MCA support in newly x86 Platform as well as the increasing
software demands, we're doing MCA enhancement jobs for XEN upstream. The corrected
error handling (CMCI) part is now in upstream. This doc focuses on the uncorrected error handling.
Current XEN upstream MCA support is checked in by Christoph which already did much
great improvements. Most of our MCA jobs are based on it.

Our Target:
1) Narrow MCE# impact. Try to keep the system/guest working running as much as possible.
2) Log information in DOM0 as much as possible.

Diffs with current implementation:
When enabling MCA Intel platform, we also made some changes including:
1) Xen will handle the MCA, i.e. Xen will decide impacted components, take recover action, 
   inject virtual MCA to guest etc. Especially, Xen will directly inject vMCE# to the impacted DOM 
   to avoid current notification from DOM0 to DOMU.
2) Xen will provide MCA MSR virtualization so that guest's native #MC handler can run without changes. 
   With this method, we can benifit from guest #MC handler enhancement and no need to maintain PV MCA 
   handler. See http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00643.html for more 
   information on how to support guest MCA. 
3) We add MCE owner judgement algorithm in XEN MCE# handler since some MCA banks are shared among CPUs.
4) Adopt two round banks scanning, reset system when meeting fatal/non-recoverable errors without clearing 
   the MCA MSR, so that the MCA information will be logged after reboot.

Some detailed notes for MCA handling:
1)  When MCE# happens, if the error is fatal (pcc=1) or can't be recovered, for avoiding losing LOGS in
     DOM0 as much as possible, we will reset machine since MSRs banks are sticky. After reboot, MCA polling
     mechanism will be responsible for LOG. So we adopt two round banks scanning.
2)  When MCE# happens, all CPUs enter MCA context. The first CPU who read&clear the error MSR bank will be this
     MCE# owner. Necessary locks/synchronization will help to judge the owner and select most severe error.
3)  XEN's MCA MSR virtualization will provide MCA MSR virtualization to guest for reuse guest native handler. 
    Currently, we only virtualize MSR read/write. 
4)  MCE# happens, we will do following jobs:
    a. Send vIRQ to DOM0 for logging. Log as complete as possible.
    b. Send vMCE# to Impacted Guest (Currently we do injection only if impacted guest is Dom0)
    c. Guest vMCE MSR virtualization
    d. Recovery action in XEN (offline offending page).

MCE# processing Sequence Flow for your reference:
1)  MCE# happens and invoke XEN MCE# handler.
2)  XEN MCE# handler judges the severity and the impacted domain, decides whether to reset whole system 
    or be able to do some recovery.
3)  If error can be recovered, send vIRQ to DOM0 for logging, send vMCE# to impacted Guest 
    (Currently we only inject to DOM0),  and continue recovery action.
4)  Guest MCA handler will be invoked after receiving the injected vMCE#. Guest MCA# MSR banks read/write 
    will be traped by HV(vMCE# MSR virtualization).



[-- Attachment #3: basic_mca_support.patch --]
[-- Type: application/octet-stream, Size: 33217 bytes --]

diff -r 2fe33f3403f5 xen/arch/x86/cpu/mcheck/mce_intel.c
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c	Fri Feb 13 18:00:22 2009 +0800
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c	Mon Feb 16 19:02:16 2009 +0800
@@ -4,9 +4,11 @@
 #include <xen/event.h>
 #include <xen/kernel.h>
 #include <xen/smp.h>
+#include <xen/delay.h>
 #include <asm/processor.h> 
 #include <asm/system.h>
 #include <asm/msr.h>
+#include <xen/softirq.h>
 #include "mce.h"
 #include "x86_mca.h"
 
@@ -162,7 +164,7 @@
     struct mc_info *mi = NULL;
     int exceptions = (read_cr4() & X86_CR4_MCE);
     int i, nr_unit = 0, uc = 0, pcc = 0;
-    uint64_t status, addr;
+    uint64_t status;
     struct mcinfo_global mcg;
     struct mcinfo_extended mce;
     unsigned int cpu;
@@ -226,8 +228,8 @@
         if (status & MCi_STATUS_MISCV)
             rdmsrl(MSR_IA32_MC0_MISC + 4 * i, mcb.mc_misc);
         if (status & MCi_STATUS_ADDRV) {
-            rdmsrl(MSR_IA32_MC0_ADDR + 4 * i, addr);
-            d = maddr_get_owner(addr);
+            rdmsrl(MSR_IA32_MC0_ADDR + 4 * i, mcb.mc_addr);
+            d = maddr_get_owner(mcb.mc_addr);
             if ( d && (calltype == MC_FLAG_CMCI || calltype == MC_FLAG_POLLED) )
                 mcb.mc_domid = d->domain_id;
         }
@@ -252,7 +254,7 @@
         mcg.mc_flags |= MC_FLAG_UNCORRECTABLE;
     else if (uc)
         mcg.mc_flags |= MC_FLAG_RECOVERABLE;
-    else /* correctable */
+    else if (nr_unit) /* correctable */
         mcg.mc_flags |= MC_FLAG_CORRECTABLE;
 
     if (nr_unit && nr_intel_ext_msrs && 
@@ -266,73 +268,561 @@
     return mi;
 }
 
+/* Below are for MCE handling */
+
+/* Log worst error severity and offending CPU.,
+ * Pick this CPU for further processing in softirq */
+static int severity_cpu = -1;
+static int worst = 0;
+
+/* Lock of enter point@second round scanning in MCE# handler */
+static cpumask_t scanned_cpus;
+/* Lock for enter point@Critical Section in MCE# handler */
+static bool_t mce_enter_lock = 0;
+/* Record how many CPUs impacted in this MCE# */
+static cpumask_t impact_map;
+
+/* Lock of softirq rendezvous entering point */
+static cpumask_t mced_cpus;
+/*Lock of softirq rendezvous leaving point */
+static cpumask_t finished_cpus;
+/* Lock for picking one processing CPU */
+static bool_t mce_process_lock = 0;
+
+/* Spinlock for vMCE# MSR virtualization data */
+static DEFINE_SPINLOCK(mce_locks);
+/* Param for vMCE# injection */
+DEFINE_PER_CPU(struct softirq_trap, mce_softirq_trap);
+
+
+/* Local buffer for holding MCE# data temporarily, sharing between mce
+ * handler and softirq handler. Local buffer will be finally copied to
+ * global buffer for DOM0 LOG and per_dom related data for guest vMCE#
+ * MSR virtualization.
+ * Note: When local buffer is still in processing in softirq, another
+ * MCA comes, simply panic.
+ * TODO: We might have further improvement to have lockless ring if
+ * neccessary
+ */
+struct mc_local_t
+{
+    bool_t in_use;
+    struct mc_info mc[NR_CPUS];
+};
+static struct mc_local_t mc_local;
+
+/* For vMCE injection reference. It holds impacted domains and
+ * injection times for each impacted domain.
+ */
+struct intel_vmce_inject vmce_data;
+
+/* When a new MCE# comes, XEN handler will clear the old vMCE
+ * injection reference data. */
+static void init_vmce_data(void) {
+
+    for (int i = 0; i < MAX_IMPACT_DOMAIN; i++) {
+        vmce_data.impact_domid[i] = -1;
+        cpus_clear(vmce_data.impact_cpus[i]);
+    }
+}
+
+/* This node list records errors impacting a domain. when one
+ * MCE# happens, one error bank impact a domain. This error node
+ * will be inserted to the tail of the per_dom data for vMCE# MSR
+ * virtualization. When one vMCE# injection is finished, the corresponding
+ * node will be deleted. This node list is for GUEST vMCE# MSRS 
+ * virtualization.
+ */
+static struct bank_entry* alloc_bank_entry(void) {
+    struct bank_entry *entry;
+
+    entry = xmalloc(struct bank_entry);
+    if (!entry) {
+        printk(KERN_ERR "MCE: malloc bank_entry failed\n");
+        return NULL;
+    }
+    memset(entry, 0x0, sizeof(entry));
+    INIT_LIST_HEAD(&entry->list);
+    entry->cpu = -1;
+    return entry;
+}
+
+/* Fill error bank info to #vMCE injection ref data and GUEST vMCE#
+ * MSR virtualization data
+*/
+static int fill_vmsr_data(int cpu, struct mcinfo_bank *mc_bank, 
+        uint64_t gstatus) {
+    int32_t idx, flag_new = 0;
+    struct domain *d;
+    struct bank_entry *entry;
+
+    /* This error bank impacts some DOMs, we need to fill domain related
+     * data for vMCE MSRs virtualization and vMCE# injection */
+    if (mc_bank->mc_domid != (uint16_t)~0) {
+        d = get_domain_by_id(mc_bank->mc_domid);
+
+        /* Not impact a valid domain, skip this error of the bank */
+        if (!d) {
+            printk(KERN_DEBUG "MCE: Not found valid impacted DOM\n");
+            return 0;
+        }
+
+        for (idx = 0; idx < MAX_IMPACT_DOMAIN; idx++) {
+            if (vmce_data.impact_domid[idx] == mc_bank->mc_domid) {
+                /* Note: only when the error on DIFF pCPUs,
+                 * will it be injected nr_pCPUs times. Several errors
+                 * offending one CPU which impact one domain will be
+                 * put into the one node in the impact_header list.
+                 * Correspondingly, this error is injected only once.
+                 */
+
+                if (cpu_isset(cpu, vmce_data.impact_cpus[idx])) {
+                    /* Same CPU diff bank, no need to alloc new node */
+                    printk(KERN_DEBUG "MCE: No Need to alloc node!\n");
+                    if (!list_empty(&d->arch.vmca_msrs.impact_header)) {
+                        entry = list_entry(
+                            d->arch.vmca_msrs.impact_header.prev, 
+                            struct bank_entry, list);
+                    }
+                    else {
+                        printk(KERN_ERR "MCE: impact list should"
+                            " not be empty !\n");
+                        return -1;
+                    }
+                 }
+               /* Diff CPU, need to alloc new node */
+                else {
+                    printk(KERN_DEBUG "MCE: alloc Node for DOM%d\n",
+                        mc_bank->mc_domid);
+                    entry = alloc_bank_entry();
+                    flag_new = 1;
+                }
+                cpu_set(cpu, vmce_data.impact_cpus[idx]);
+                break;
+            }
+            /* First node of the impact DOM */
+            else if (vmce_data.impact_domid[idx] == -1) {
+                printk(KERN_DEBUG "MCE: fill new recored"
+                    "(IDX%d, DOM%d, CPU%d)\n", 
+                    idx, mc_bank->mc_domid, cpu);
+                vmce_data.impact_domid[idx] = 
+                                mc_bank->mc_domid;
+                printk(KERN_DEBUG "MCE: alloc Node for DOM%d\n",
+                    mc_bank->mc_domid);
+                entry = alloc_bank_entry();
+                flag_new = 1;
+                cpu_set(cpu, vmce_data.impact_cpus[idx]);
+                /* Fill MSR global status */
+                d->arch.vmca_msrs.mcg_status = gstatus;
+                break;
+            }
+        }
+        if (idx >= MAX_IMPACT_DOMAIN) {
+            printk(KERN_ERR "MCE: Errors impacts too many domains\n");
+            return -1;
+        }
+        entry->mci_status[mc_bank->mc_bank] = mc_bank->mc_status;
+        entry->mci_addr[mc_bank->mc_bank] = mc_bank->mc_addr;
+        entry->mci_misc[mc_bank->mc_bank] = mc_bank->mc_misc;
+
+        /* Something Wrong */
+        if (entry->cpu != -1 && entry->cpu != cpu) {
+            printk(KERN_ERR "MCE: vMSR Virtualization "
+                    "Data Filling Error\n");
+            return -1;
+        }
+        entry->cpu = cpu;
+
+        /* This is a new Node, insert to the tail of the per_dom data */
+        if (flag_new) {
+            printk(KERN_DEBUG "MCE: add new node for DOM%d\n", 
+                mc_bank->mc_domid);
+            list_add_tail(&entry->list, &d->arch.vmca_msrs.impact_header);
+        }
+
+        printk(KERN_DEBUG "MCE: Found error @[CPU%d BANK%d "
+                "status %lx addr %lx domid %d]\n ", entry->cpu, mc_bank->mc_bank,
+                mc_bank->mc_status, mc_bank->mc_addr, mc_bank->mc_domid);
+    }
+    return 0;
+}
+
+/* Filling vmce_data for:
+ * 1) Log down (array_idx, domain_id, impact_cpu_map) map for vMCE injection.
+      cpu_weight(impact_cpu_map) decides how many injections to the impacted
+	  DOM are needed.
+ * 2) Copy MCE# info to global buffer, for DOM0 logging.
+ * 3) Copy MCE# info to impacted DOM, for vMCE# MSRs virtualization
+ */
+static int mce_actions(void) {
+    int32_t cpu, ret;
+    struct mc_info *local_mi, *global_mi;
+    struct mcinfo_common *mic = NULL;
+    struct mcinfo_global *mc_global;
+    struct mcinfo_bank *mc_bank;
+
+    /* Spinlock is used for exclusive read/write of vMSR virtualization
+     * (per_dom vMCE# data)
+     */
+    spin_lock(&mce_locks);
+
+    /* local buffer is shared between MCE handler and softirq.
+     * If softirq is filling this buffer while another MCE# comes,
+     * simply panic
+     */
+    test_and_set_bool(mc_local.in_use);
+
+    init_vmce_data();
+
+    for_each_cpu_mask(cpu, impact_map) {
+
+        local_mi = &mc_local.mc[cpu];
+        x86_mcinfo_lookup(mic, local_mi, MC_TYPE_GLOBAL);
+        if (mic == NULL) {
+            printk(KERN_ERR "MCE: get local buffer entry failed\n ");
+            ret = -1;
+		    goto end;
+        }
+
+        /* Copy local data to Global buffer for DOM0 LOG */
+        mc_global = (struct mcinfo_global *)mic;
+        global_mi = x86_mcinfo_getptr();
+        if (!global_mi) {
+            printk(KERN_ERR "MCE: Get global buffer entry failed\n");
+            ret = -1;
+            goto end;
+        }
+        x86_mcinfo_clear(global_mi);
+        x86_mcinfo_add(global_mi, mc_global);
+
+        /* Processing bank information */
+        x86_mcinfo_lookup(mic, local_mi, MC_TYPE_BANK);
+
+        for ( ; mic && mic->size; mic = x86_mcinfo_next(mic) ) {
+            if (mic->type != MC_TYPE_BANK) {
+                continue;
+            }
+            mc_bank = (struct mcinfo_bank*)mic;
+            /* Copy bank info to global buffer */
+            x86_mcinfo_add(global_mi, mc_bank);
+
+            /* Fill vMCE# injection and vMCE# MSR virtualization related data */
+            if (fill_vmsr_data(cpu, mc_bank, mc_global->mc_gstatus) == -1) {
+                ret = -1;
+                goto end;
+            }
+
+            /* TODO: Add recovery actions here, such as page-offline, etc */
+
+        }
+    } /* end of impact_map loop */
+
+    /* Successfully filled all local/global buffer */
+    ret = 0;
+
+end:
+    test_and_clear_bool(mc_local.in_use);
+    spin_unlock(&mce_locks);
+    return ret;
+}
+
+/* Softirq Handler for this MCE# processing */
+static void mce_softirq(void)
+{
+    int cpu = smp_processor_id(), idx;
+    cpumask_t affinity;
+    struct softirq_trap *st = NULL;
+
+    /* Wait until all cpus entered softirq */
+    while ( cpus_weight(mced_cpus) != num_online_cpus() ) {
+        cpu_relax();
+    }
+    /* Not Found worst error on severity_cpu, it's weird */
+    if (severity_cpu == -1) {
+        printk(KERN_WARNING "MCE: not found severity_cpu!\n");
+        mc_panic("MCE: not found severity_cpu!");
+        return;
+    }
+    /* We choose severity_cpu for further processing */
+    if (severity_cpu == cpu) {
+
+        /* Step1: Fill DOM0 LOG buffer, vMCE injection buffer and
+         * vMCE MSRs virtualization buffer
+         */
+        if (mce_actions())
+            mc_panic("MCE recovery actions or Filling vMCE MSRS "
+			    "virtualization data failed!\n");
+
+        /* Step2: Send Log to DOM0 through vIRQ */
+        if (dom0 && guest_enabled_event(dom0->vcpu[0], VIRQ_MCA)) {
+            printk(KERN_DEBUG "MCE: send MCE# to DOM0 through virq\n");
+            send_guest_global_virq(dom0, VIRQ_MCA);
+        }
+
+        /* Step3: Inject vMCE to impacted DOM. Currently we cares DOM0 only */
+        for (idx = 0; idx < MAX_IMPACT_DOMAIN; idx++) {
+
+         /* Found errors impacting DOM0, bind this DOM0.vCPU0 to this pCPU */
+            if ( vmce_data.impact_domid[idx] == 0 )
+            {
+                st = &per_cpu(mce_softirq_trap, cpu);
+                st->domain = dom0;
+                st->vcpu = dom0->vcpu[0];
+                st->processor = st->vcpu->processor;
+                break;
+            }
+        }
+        if (idx < MAX_IMPACT_DOMAIN &&
+            guest_has_trap_callback
+                (st->domain, st->vcpu->vcpu_id, TRAP_machine_check)) {
+            /* inject vMCE into DOM0 cpu_weight(impact_map) times */
+            if (st && st->vcpu && !test_and_set_bool(st->vcpu->mce_pending)) {
+
+                st->vcpu->cpu_affinity_tmp = st->vcpu->cpu_affinity;
+                if (cpu != st->processor
+                    || (st->processor != st->vcpu->processor)){
+                    /* We're on the different physical cpu. Make
+                     * sure to wakeup the vcpu on the specified
+                     * processor */
+                     cpus_clear(affinity);
+                     cpu_set(cpu, affinity);
+                     printk(KERN_DEBUG "MCE: CPU%d set affinity\n", cpu);
+                     vcpu_set_affinity(st->vcpu, &affinity);
+                     /* Afinity is restored in the iRET hypercall */
+                }
+               vcpu_kick(st->vcpu);
+            }
+        }
+
+
+        /* Clean Data */
+        test_and_clear_bool(mce_process_lock);
+        cpus_clear(impact_map);
+        cpus_clear(scanned_cpus);
+        worst = 0;
+        cpus_clear(mced_cpus);
+        memset(&mc_local, 0x0, sizeof(mc_local));
+    }
+
+    cpu_set(cpu, finished_cpus);
+    wmb();
+   /* Leave until all cpus finished recovery actions in softirq */
+    while ( cpus_weight(finished_cpus) != num_online_cpus() ) {
+        cpu_relax();
+    }
+
+    cpus_clear(finished_cpus);
+    severity_cpu = -1;
+    printk(KERN_DEBUG "CPU%d exit softirq \n", cpu);
+}
+
+/* Machine Check owner judge algorithm:
+ * When error happens, all cpus serially read its msr banks.
+ * The first CPU who fetches the error bank's info will clear
+ * this bank. Later readers can't get any infor again.
+ * The first CPU is the actual mce_owner
+ *
+ * For Fatal (pcc=1) error, it might cause machine crash
+ * before we're able to log. For avoiding log missing, we adopt two
+ * round scanning:
+ * Round1: simply scan. If found pcc = 1 or ripv = 0, simply reset.
+ * All MCE banks are sticky, when boot up, MCE polling mechanism
+ * will help to collect and log those MCE errors.
+ * Round2: Do all MCE processing logic as normal.
+ */
+
+/* Simple Scan. Panic when found non-recovery errors. Doing this for
+ * avoiding LOG missing
+ */
+static void severity_scan(void)
+{
+    uint64_t status;
+    int32_t i;
+
+    /* TODO: For PCC = 0, we need to have further judge. If it is can't be
+     * recovered, we need to RESET for avoiding DOM0 LOG missing
+     */
+    for ( i = 0; i < nr_mce_banks; i++) {
+        rdmsrl(MSR_IA32_MC0_STATUS + 4 * i , status);
+        if ( !(status & MCi_STATUS_VAL) )
+            continue;
+        /* MCE handler only handles UC error */
+        if ( !(status & MCi_STATUS_UC) )
+            continue;
+        if ( !(status & MCi_STATUS_EN) )
+            continue;
+        if (status & MCi_STATUS_PCC)
+            mc_panic("pcc = 1, cpu unable to continue\n");
+    }
+
+    /* TODO: Further judgement here, maybe we need MCACOD assistence  */
+    /* EIPV and RIPV is not a reliable way to judge the error severity */
+
+}
 static fastcall void intel_machine_check(struct cpu_user_regs * regs, long error_code)
 {
-    /* MACHINE CHECK Error handler will be sent in another patch,
-     * simply copy old solutions here. This code will be replaced
-     * by upcoming machine check patches
-     */
+    unsigned int cpu = smp_processor_id();
+    struct mc_info *mi;
+    struct mcinfo_global mcg;
+    struct mcinfo_extended mce;
+    uint64_t status;
+    int32_t uc = 0, pcc = 0, nr_unit = 0, severity = 0, i;
+    struct domain *d;
 
-    int recover=1;
-    u32 alow, ahigh, high, low;
-    u32 mcgstl, mcgsth;
-    int i;
-   
-    rdmsr(MSR_IA32_MCG_STATUS, mcgstl, mcgsth);
-    if (mcgstl & (1<<0))       /* Recoverable ? */
-        recover=0;
-    
-    printk(KERN_EMERG "CPU %d: Machine Check Exception: %08x%08x\n",
-           smp_processor_id(), mcgsth, mcgstl);
-    
-    for (i=0; i<nr_mce_banks; i++) {
-        rdmsr (MSR_IA32_MC0_STATUS+i*4,low, high);
-        if (high & (1<<31)) {
-            if (high & (1<<29))
-                recover |= 1;
-            if (high & (1<<25))
-                recover |= 2;
-            printk (KERN_EMERG "Bank %d: %08x%08x", i, high, low);
-            high &= ~(1<<31);
-            if (high & (1<<27)) {
-                rdmsr (MSR_IA32_MC0_MISC+i*4, alow, ahigh);
-                printk ("[%08x%08x]", ahigh, alow);
-            }
-            if (high & (1<<26)) {
-                rdmsr (MSR_IA32_MC0_ADDR+i*4, alow, ahigh);
-                printk (" at %08x%08x", ahigh, alow);
-            }
-            printk ("\n");
+    /* First round scanning */
+    severity_scan();
+    cpu_set(cpu, scanned_cpus);
+    while (cpus_weight(scanned_cpus) < num_online_cpus())
+        cpu_relax();
+
+    wmb();
+    /* All CPUs Finished first round scanning */
+    if (mc_local.in_use != 0) {
+        mc_panic("MCE: Local buffer is being processed, can't handle new MCE!\n");
+        return;
+    }
+
+     /* Fill local data, let softirq to processing the local data */
+    mi = &mc_local.mc[cpu];
+    if (!mi) {
+        printk(KERN_ERR "MCE: Get mc_info entry failed\n");
+        mc_panic("MCE: Failed to get local buffer entry\n");
+        return;
+    }
+
+    x86_mcinfo_clear(mi);
+    memset(&mcg, 0, sizeof(mcg));
+    mcg.common.type = MC_TYPE_GLOBAL;
+    mcg.common.size = sizeof(mcg);
+
+   /* domid should be per_bank data */
+    mcg.mc_domid = -1;
+    mcg.mc_vcpuid = -1;
+    mcg.mc_flags = MC_FLAG_MCE;
+    mcg.mc_socketid = phys_proc_id[cpu];
+    mcg.mc_coreid = cpu_core_id[cpu];
+    mcg.mc_apicid = cpu_physical_id(cpu);
+    mcg.mc_core_threadid =
+        mcg.mc_apicid & ( 1 << (cpu_data[cpu].x86_num_siblings - 1));
+    rdmsrl(MSR_IA32_MCG_STATUS, mcg.mc_gstatus);
+
+    /* Enter Critical Section */
+    while (test_and_set_bool(mce_enter_lock)) {
+        udelay (1);
+    }
+
+    for ( i = 0; i < nr_mce_banks; i++) {
+        struct mcinfo_bank mcb;
+
+        memset(&mcb, 0, sizeof(mcb));
+        rdmsrl(MSR_IA32_MC0_STATUS + 4 * i , status);
+        if ( !(status & MCi_STATUS_VAL) )
+            continue;
+        /*MCE handler only deals with UC error*/
+        if ( !(status & MCi_STATUS_UC) )
+            continue;
+        uc = 1;
+        add_taint(TAINT_MACHINE_CHECK);
+        /* The bank found UC, but machine check event is
+         * not enabled. Skip and let polling deal with it.
+        */
+        if ( !(status & MCi_STATUS_EN) )
+            continue;
+        if (status & MCi_STATUS_PCC)
+            pcc = 1;
+        memset(&mcb, 0, sizeof(mcb));
+        mcb.common.type = MC_TYPE_BANK;
+        mcb.common.size = sizeof(mcb);
+        mcb.mc_bank = i;
+        mcb.mc_status = status;
+
+        if (status & MCi_STATUS_MISCV)
+            rdmsrl(MSR_IA32_MC0_MISC + 4 * i, mcb.mc_misc);
+        if (status & MCi_STATUS_ADDRV) {
+            rdmsrl(MSR_IA32_MC0_ADDR + 4 * i, mcb.mc_addr);
+
+            /* TODO: This is not correct way. We temperarily keep it.
+             * We'll do further improvement later
+             */
+            d = maddr_get_owner(mcb.mc_addr);
+            if (d)
+                mcb.mc_domid = d->domain_id;
         }
+        rdtscll(mcb.mc_tsc);
+        x86_mcinfo_add(mi, &mcb);
+        nr_unit++;
+        /* Clear state for this bank, this CPU will be this MCE error owner */
+        wrmsrl(MSR_IA32_MC0_STATUS + 4 * i, 0);
+        printk(KERN_DEBUG "MCE: bank%i CPU%d status[%"PRIx64"]\n", 
+                i, cpu, status);
+        printk(KERN_DEBUG "MCE: SOCKET%d, CORE%d, APICID[%d], "
+                "thread[%d]\n", mcg.mc_socketid, 
+                mcg.mc_coreid, mcg.mc_apicid, mcg.mc_core_threadid);
     }
-    
-    if (recover & 2)
-        mc_panic ("CPU context corrupt");
-    if (recover & 1)
-        mc_panic ("Unable to continue");
-    
-    printk(KERN_EMERG "Attempting to continue.\n");
-    /* 
-     * Do not clear the MSR_IA32_MCi_STATUS if the error is not 
-     * recoverable/continuable.This will allow BIOS to look at the MSRs
-     * for errors if the OS could not log the error.
-     */
-    for (i=0; i<nr_mce_banks; i++) {
-        u32 msr;
-        msr = MSR_IA32_MC0_STATUS+i*4;
-        rdmsr (msr, low, high);
-        if (high&(1<<31)) {
-            /* Clear it */
-            wrmsr(msr, 0UL, 0UL);
-            /* Serialize */
-            wmb();
-            add_taint(TAINT_MACHINE_CHECK);
+
+    if (nr_unit && nr_intel_ext_msrs && 
+                    (mcg.mc_gstatus & MCG_STATUS_EIPV)) {
+        printk(KERN_DEBUG "MCE: found extension MCE MSRs\n");
+        intel_get_extended_msrs(&mce);
+        x86_mcinfo_add(mi, &mce);
+    }
+    if (!nr_unit) {
+        /* Not offending CPU, goto softirq directly */
+        cpu_set(cpu, mced_cpus);
+        test_and_clear_bool(mce_enter_lock);
+        raise_softirq(MACHINE_CHECK_SOFTIRQ);
+        return;
+    }
+
+    if (pcc) {
+        printk(KERN_WARNING "PCC=1 should have caused reset\n");
+        mcg.mc_flags |= MC_FLAG_UNCORRECTABLE;
+        severity = 3;
+    }
+    else if (uc) {
+        mcg.mc_flags |= MC_FLAG_RECOVERABLE;
+        severity = 2;
+    }
+    else {
+        printk(KERN_WARNING "We should skip Correctable Error\n");
+        severity = 1; 
+    }
+    /* This is the offending cpu! */
+    cpu_set(cpu, impact_map);
+
+    x86_mcinfo_add(mi, &mcg);
+    if ( severity > worst) {
+        worst = severity;
+        /* This CPU found more severe error! */
+        severity_cpu = cpu;
+    }
+    cpu_set(cpu, mced_cpus);
+    test_and_clear_bool(mce_enter_lock);
+    wmb();
+
+    /* Wait for all cpus Leave Critical */
+    while (cpus_weight(mced_cpus) < num_online_cpus())
+        cpu_relax();
+    /* Print MCE error */
+    x86_mcinfo_dump(mi);
+
+    /* Pick one CPU to clear MCIP */
+    if (!test_and_set_bool(mce_process_lock)) {
+        wrmsrl(MSR_IA32_MCG_STATUS, mcg.mc_gstatus & ~MCG_STATUS_MCIP);
+
+        if (worst >= 3) {
+            printk(KERN_WARNING "worst=3 should have caused RESET\n");
+            mc_panic("worst=3 should have caused RESET");
         }
+        else {
+            printk(KERN_DEBUG "MCE: trying to recover\n");
+        }
+
     }
-    mcgstl &= ~(1<<2);
-    wrmsr (MSR_IA32_MCG_STATUS,mcgstl, mcgsth);
+    raise_softirq(MACHINE_CHECK_SOFTIRQ);
 }
 
+
 static DEFINE_SPINLOCK(cmci_discover_lock);
 static DEFINE_PER_CPU(cpu_banks_t, no_cmci_banks);
 
@@ -488,8 +978,10 @@
     mi = machine_check_poll(MC_FLAG_CMCI);
     if (mi) {
         x86_mcinfo_dump(mi);
-        if (dom0 && guest_enabled_event(dom0->vcpu[0], VIRQ_MCA))
+        if (dom0 && guest_enabled_event(dom0->vcpu[0], VIRQ_MCA)) {
+            printk(KERN_DEBUG "MCE: send CMCI info to DOM0 through virq\n");
             send_guest_global_virq(dom0, VIRQ_MCA);
+        }
     }
     irq_exit();
 }
@@ -501,13 +993,18 @@
     intel_init_thermal(c);
 #endif
     intel_init_cmci(c);
+    init_vmce_data();
 }
 
+uint64_t g_mcg_cap;
 static void mce_cap_init(struct cpuinfo_x86 *c)
 {
     u32 l, h;
 
     rdmsr (MSR_IA32_MCG_CAP, l, h);
+    /* For Guest vMCE usage */
+    g_mcg_cap = ((u64)h << 32 | l) & (~MCG_CMCI_P);
+
     if ((l & MCG_CMCI_P) && cpu_has_apic)
         cmci_support = 1;
 
@@ -576,6 +1073,7 @@
     mce_init();
     mce_intel_feature_init(c);
     mce_set_owner();
+    open_softirq(MACHINE_CHECK_SOFTIRQ, mce_softirq);
 }
 
 /*
diff -r 2fe33f3403f5 xen/arch/x86/cpu/mcheck/x86_mca.h
--- a/xen/arch/x86/cpu/mcheck/x86_mca.h	Fri Feb 13 18:00:22 2009 +0800
+++ b/xen/arch/x86/cpu/mcheck/x86_mca.h	Mon Feb 16 19:02:16 2009 +0800
@@ -79,7 +79,6 @@
 #define CMCI_THRESHOLD			0x2
 
 
-#define MAX_NR_BANKS 128
 
 typedef DECLARE_BITMAP(cpu_banks_t, MAX_NR_BANKS);
 DECLARE_PER_CPU(cpu_banks_t, mce_banks_owned);
diff -r 2fe33f3403f5 xen/arch/x86/domain.c
--- a/xen/arch/x86/domain.c	Fri Feb 13 18:00:22 2009 +0800
+++ b/xen/arch/x86/domain.c	Mon Feb 16 19:02:16 2009 +0800
@@ -366,6 +366,7 @@
         hvm_vcpu_destroy(v);
 }
 
+extern uint64_t g_mcg_cap;
 int arch_domain_create(struct domain *d, unsigned int domcr_flags)
 {
 #ifdef __x86_64__
@@ -446,6 +447,15 @@
 
         if ( (rc = iommu_domain_init(d)) != 0 )
             goto fail;
+
+        /* For Guest vMCE MSRs virtualization */
+        d->arch.vmca_msrs.mcg_status = 0x0;
+        d->arch.vmca_msrs.mcg_cap = g_mcg_cap;
+        d->arch.vmca_msrs.mcg_ctl = (uint64_t)~0x0;
+        memset(d->arch.vmca_msrs.mci_ctl, 0x1,
+            sizeof(d->arch.vmca_msrs.mci_ctl));
+        INIT_LIST_HEAD(&d->arch.vmca_msrs.impact_header);
+
     }
 
     if ( is_hvm_domain(d) )
diff -r 2fe33f3403f5 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c	Fri Feb 13 18:00:22 2009 +0800
+++ b/xen/arch/x86/traps.c	Mon Feb 16 19:02:16 2009 +0800
@@ -728,8 +728,6 @@
         if ( !opt_allow_hugepage )
             __clear_bit(X86_FEATURE_PSE, &d);
         __clear_bit(X86_FEATURE_PGE, &d);
-        __clear_bit(X86_FEATURE_MCE, &d);
-        __clear_bit(X86_FEATURE_MCA, &d);
         __clear_bit(X86_FEATURE_PSE36, &d);
     }
     switch ( (uint32_t)regs->eax )
diff -r 2fe33f3403f5 xen/arch/x86/x86_64/traps.c
--- a/xen/arch/x86/x86_64/traps.c	Fri Feb 13 18:00:22 2009 +0800
+++ b/xen/arch/x86/x86_64/traps.c	Mon Feb 16 19:02:16 2009 +0800
@@ -14,6 +14,8 @@
 #include <xen/nmi.h>
 #include <asm/current.h>
 #include <asm/flushtlb.h>
+#include <asm/traps.h>
+#include <asm/event.h>
 #include <asm/msr.h>
 #include <asm/page.h>
 #include <asm/shared.h>
@@ -260,11 +262,16 @@
 #endif
 }
 
+extern struct intel_vmce_inject vmce_data;
+DECLARE_PER_CPU(struct softirq_trap, mce_softirq_trap);
 unsigned long do_iret(void)
 {
     struct cpu_user_regs *regs = guest_cpu_user_regs();
     struct iret_context iret_saved;
     struct vcpu *v = current;
+    struct domain *d = v->domain;
+    struct bank_entry *entry;
+    int idx, cpu = smp_processor_id(), impact_cpu;
 
     if ( unlikely(copy_from_user(&iret_saved, (void *)regs->rsp,
                                  sizeof(iret_saved))) )
@@ -304,6 +311,64 @@
        && !cpus_equal(v->cpu_affinity_tmp, v->cpu_affinity))
         vcpu_set_affinity(v, &v->cpu_affinity_tmp);
 
+   /*Currently, only inject vMCE to DOM0.*/
+
+    if (v->trap_priority >= VCPU_TRAP_NMI) {
+        struct softirq_trap *st = &per_cpu(mce_softirq_trap, cpu);
+        for (idx = 0; idx < MAX_IMPACT_DOMAIN; idx++) {
+            if (vmce_data.impact_domid[idx] == 0) {
+                impact_cpu = first_cpu(vmce_data.impact_cpus[idx]);
+                if (impact_cpu < NR_CPUS) {
+                    cpu_clear(impact_cpu, vmce_data.impact_cpus[idx]);
+                    if (!list_empty(&d->arch.vmca_msrs.impact_header)) {
+                        entry = list_entry(d->arch.vmca_msrs.impact_header.next,
+                            struct bank_entry, list);
+                        printk(KERN_DEBUG "MCE: Delete last injection Node\n");
+                        list_del(&entry->list);
+                    }
+                    else {
+                        printk(KERN_DEBUG "MCE: Not found last injection "
+                        "Node, something Wrong!\n");
+                    }
+                }
+               if (cpus_weight(vmce_data.impact_cpus[idx]) <=0) {
+                   printk(KERN_DEBUG "MCE: All vMCEs are injected to DOM0\n");
+                   goto end;
+               }
+            }
+            break;
+        }
+
+        /* inject another vMCE into DOM0
+         * First injection is done in MCE# softirq handler. It's injected
+         * Serially
+        */
+        if (idx < MAX_IMPACT_DOMAIN &&
+            guest_has_trap_callback(st->domain, 
+                st->vcpu->vcpu_id, TRAP_machine_check)) {
+            if (st && st->vcpu && !test_and_set_bool(st->vcpu->mce_pending)) {
+                st->vcpu->cpu_affinity_tmp = st->vcpu->cpu_affinity;
+                if (cpu != st->processor 
+                    || (st->processor != st->vcpu->processor)){
+                    cpumask_t affinity;
+                    /* We're on the different physical cpu. Make
+                     * sure to wakeup the vcpu on the specified
+                     * processor */
+                     cpus_clear(affinity);
+                     cpu_set(cpu, affinity);
+                     printk(KERN_DEBUG "MCE: CPU%d set afinity\n", cpu);
+                     vcpu_set_affinity(st->vcpu, &affinity);
+                     /*Afinity is restored in the iRET hypercall*/
+                }
+                /*We need to use vMCE data when doing vMCE injection!
+                 * It will be cleared after the last injection is finished
+                */
+                vcpu_kick(st->vcpu);
+            }
+        }
+    } /* end of outer-if */
+
+end:
     /* Restore previous trap priority */
     v->trap_priority = v->old_trap_priority;
 
diff -r 2fe33f3403f5 xen/include/asm-x86/domain.h
--- a/xen/include/asm-x86/domain.h	Fri Feb 13 18:00:22 2009 +0800
+++ b/xen/include/asm-x86/domain.h	Mon Feb 16 19:02:16 2009 +0800
@@ -204,6 +204,29 @@
 
 struct p2m_domain;
 
+/* Define for GUEST MCA handling */
+#define MAX_NR_BANKS 128
+
+/* This entry is for recording bank nodes for the impacted domain,
+ * put into impact_header list. */
+struct bank_entry {
+    struct list_head list;
+    int32_t cpu;
+    uint64_t mci_status[MAX_NR_BANKS];
+    uint64_t mci_addr[MAX_NR_BANKS];
+    uint64_t mci_misc[MAX_NR_BANKS];
+};
+
+struct domain_mca_msrs
+{
+    /* Guest should not change below values after DOM boot up */
+    uint64_t mcg_cap;
+    uint64_t mcg_ctl;
+    uint64_t mcg_status;
+    uint64_t mci_ctl[MAX_NR_BANKS];
+    struct list_head impact_header;
+};
+
 struct arch_domain
 {
     l1_pgentry_t *mm_perdomain_pt;
@@ -268,6 +291,9 @@
     struct page_list_head relmem_list;
 
     cpuid_input_t cpuids[MAX_CPUID_INPUT];
+
+    /* For Guest vMCA handling */
+    struct domain_mca_msrs vmca_msrs;
 } __cacheline_aligned;
 
 #define has_arch_pdevs(d)    (!list_empty(&(d)->arch.pdev_list))
diff -r 2fe33f3403f5 xen/include/asm-x86/softirq.h
--- a/xen/include/asm-x86/softirq.h	Fri Feb 13 18:00:22 2009 +0800
+++ b/xen/include/asm-x86/softirq.h	Mon Feb 16 19:02:16 2009 +0800
@@ -5,6 +5,7 @@
 #define TIME_CALIBRATE_SOFTIRQ (NR_COMMON_SOFTIRQS + 1)
 #define VCPU_KICK_SOFTIRQ      (NR_COMMON_SOFTIRQS + 2)
 
-#define NR_ARCH_SOFTIRQS       3
+#define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
+#define NR_ARCH_SOFTIRQS       4
 
 #endif /* __ASM_SOFTIRQ_H__ */
diff -r 2fe33f3403f5 xen/include/asm-x86/traps.h
--- a/xen/include/asm-x86/traps.h	Fri Feb 13 18:00:22 2009 +0800
+++ b/xen/include/asm-x86/traps.h	Mon Feb 16 19:02:16 2009 +0800
@@ -20,6 +20,32 @@
 #ifndef ASM_TRAP_H
 #define ASM_TRAP_H
 
+/* No need to emulate CMCI related MSRs CMCI related MSRS. */
+#define MAX_IMPACT_DOMAIN 5
+#define MAX_NR_BANKS 128
+
+/* Data strcuture for vMCE MSRs virtualization. When MCA happens in
+ * physical CPUs, all machine MCA MSRs info will be copied to this
+ * data structure
+ */
+
+
+struct intel_vmce_inject {
+
+    /* Map: (Index, impact_domid, impact_cpumap). The map records
+     * how many vMCE# we need inject to the impacted domain.
+     * If MCE# happened on more than one pCPUs (nr_CPUs), and impact
+     * the same domain, vMCEs will be injected to the impacted domain
+     * nr_CPUs times. If MCE# errors happened on the same CPU
+     * yet different banks, the vMCE will be injected only once
+    */
+
+    int32_t impact_domid[MAX_IMPACT_DOMAIN];
+    cpumask_t impact_cpus[MAX_IMPACT_DOMAIN];
+
+};
+
+
 struct softirq_trap {
 	struct domain *domain;  /* domain to inject trap */
 	struct vcpu *vcpu;	/* vcpu to inject trap */
diff -r 2fe33f3403f5 xen/include/public/arch-x86/xen-mca.h
--- a/xen/include/public/arch-x86/xen-mca.h	Fri Feb 13 18:00:22 2009 +0800
+++ b/xen/include/public/arch-x86/xen-mca.h	Mon Feb 16 19:02:16 2009 +0800
@@ -106,10 +106,11 @@
 
 #define MC_FLAG_CORRECTABLE     (1 << 0)
 #define MC_FLAG_UNCORRECTABLE   (1 << 1)
-#define MC_FLAG_RECOVERABLE	(1 << 2)
-#define MC_FLAG_POLLED		(1 << 3)
-#define MC_FLAG_RESET		(1 << 4)
-#define MC_FLAG_CMCI		(1 << 5)
+#define MC_FLAG_RECOVERABLE     (1 << 2)
+#define MC_FLAG_POLLED          (1 << 3)
+#define MC_FLAG_RESET           (1 << 4)
+#define MC_FLAG_CMCI            (1 << 5)
+#define MC_FLAG_MCE             (1 << 6)
 /* contains global x86 mc information */
 struct mcinfo_global {
     struct mcinfo_common common;

[-- Attachment #4: vmsr_virtualization.patch --]
[-- Type: application/octet-stream, Size: 13340 bytes --]

diff -r 179b7b3d7f84 xen/arch/x86/cpu/mcheck/mce_intel.c
--- a/xen/arch/x86/cpu/mcheck/mce_intel.c	Mon Feb 16 19:04:25 2009 +0800
+++ b/xen/arch/x86/cpu/mcheck/mce_intel.c	Mon Feb 16 19:12:20 2009 +0800
@@ -1132,3 +1132,254 @@
     set_timer(&mce_timer, NOW() + MILLISECS(MCE_PERIOD));
 }
 
+/* Guest vMCE# MSRs virtualization ops (rdmsr/wrmsr) */
+int intel_mce_wrmsr(u32 msr, u32 lo, u32 hi)
+{
+    struct domain *d = current->domain;
+    struct bank_entry *entry = NULL;
+    uint64_t value = (u64)hi << 32 | lo;
+    int ret = 0;
+
+    spin_lock(&mce_locks);
+    switch(msr)
+    {
+        case MSR_IA32_MCG_CTL:
+            if (value != (u64)~0x0 && value != 0x0) {
+                printk(KERN_ERR "MCE: value writen to MCG_CTL"
+                    "should be all 0s or 1s\n");
+                ret = -1;
+                break;
+            }
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "MCE: wrmsr not in DOM context, skip\n");
+                break;
+            }
+            d->arch.vmca_msrs.mcg_ctl = value;
+            break;
+        case MSR_IA32_MCG_STATUS:
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "MCE: wrmsr not in DOM context, skip\n");
+                break;
+            }
+            d->arch.vmca_msrs.mcg_status = value;
+            printk(KERN_DEBUG "MCE: wrmsr MCG_CTL %lx\n", value);
+            break;
+        case MSR_IA32_MC0_CTL2:
+        case MSR_IA32_MC1_CTL2:
+        case MSR_IA32_MC2_CTL2:
+        case MSR_IA32_MC3_CTL2:
+        case MSR_IA32_MC4_CTL2:
+        case MSR_IA32_MC5_CTL2:
+        case MSR_IA32_MC6_CTL2:
+        case MSR_IA32_MC7_CTL2:
+        case MSR_IA32_MC8_CTL2:
+            printk(KERN_ERR "We have disabled CMCI capability, "
+                    "Guest should not write this MSR!\n");
+            break;
+        case MSR_IA32_MC0_CTL:
+        case MSR_IA32_MC1_CTL:
+        case MSR_IA32_MC2_CTL:
+        case MSR_IA32_MC3_CTL:
+        case MSR_IA32_MC4_CTL:
+        case MSR_IA32_MC5_CTL:
+        case MSR_IA32_MC6_CTL:
+        case MSR_IA32_MC7_CTL:
+        case MSR_IA32_MC8_CTL:
+            if (value != (u64)~0x0 && value != 0x0) {
+                printk(KERN_ERR "MCE: value writen to MCi_CTL"
+                    "should be all 0s or 1s\n");
+                ret = -1;
+                break;
+            }
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "MCE: wrmsr not in DOM context, skip\n");
+                break;
+            }
+            d->arch.vmca_msrs.mci_ctl[(msr - MSR_IA32_MC0_CTL)/4] = value;
+            break;
+        case MSR_IA32_MC0_STATUS:
+        case MSR_IA32_MC1_STATUS:
+        case MSR_IA32_MC2_STATUS:
+        case MSR_IA32_MC3_STATUS:
+        case MSR_IA32_MC4_STATUS:
+        case MSR_IA32_MC5_STATUS:
+        case MSR_IA32_MC6_STATUS:
+        case MSR_IA32_MC7_STATUS:
+        case MSR_IA32_MC8_STATUS:
+            if (!d || is_idle_domain(d)) {
+                /* Just skip */
+                printk(KERN_ERR "mce wrmsr: not in domain context!\n");
+                break;
+            }
+            /* Give the first entry of the list, it corresponds to current
+             * vMCE# injection. When vMCE# is finished processing by the
+             * the guest, this node will be deleted. 
+             */
+            if (!list_empty(&d->arch.vmca_msrs.impact_header)) {
+                entry = list_entry(d->arch.vmca_msrs.impact_header.next,
+                    struct bank_entry, list);
+                entry->mci_status[(msr - MSR_IA32_MC0_STATUS)/4] = value;
+                printk(KERN_DEBUG "MCE: wmrsr mci_status in vMCE# context\n");
+            }
+
+            printk(KERN_DEBUG "MCE: wrmsr mci_status val:%lx\n", value);
+            break;
+    }
+    spin_unlock(&mce_locks);
+    return ret;
+}
+
+int intel_mce_rdmsr(u32 msr, u32 *lo, u32 *hi)
+{
+    struct domain *d = current->domain;
+    int ret = 0;
+    struct bank_entry *entry = NULL;
+
+    spin_lock(&mce_locks);
+    switch(msr) 
+    {
+        case MSR_IA32_MCG_STATUS:
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "MCE: rdmsr not in domain context!\n");
+                *lo = *hi = 0x0;
+                break;
+            }
+            *lo = (u32)d->arch.vmca_msrs.mcg_status;
+            *hi = (u32)(d->arch.vmca_msrs.mcg_status >> 32);
+            printk(KERN_DEBUG "MCE: rd MCG_STATUS lo %x hi %x\n", *lo, *hi);
+            break;
+        case MSR_IA32_MCG_CAP:
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "MCE: rdmsr not in domain context!\n");
+                *lo = *hi = 0x0;
+                break;
+            }
+            *lo = (u32)d->arch.vmca_msrs.mcg_cap;
+            *hi = (u32)(d->arch.vmca_msrs.mcg_cap >> 32);
+            printk(KERN_DEBUG "MCE: rdmsr MCG_CAP lo %x hi %x\n", *lo, *hi);
+            break;
+        case MSR_IA32_MCG_CTL:
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "MCE: rdmsr not in domain context!\n");
+                *lo = *hi = 0x0;
+                break;
+            }
+            *lo = (u32)d->arch.vmca_msrs.mcg_ctl;
+            *hi = (u32)(d->arch.vmca_msrs.mcg_ctl >> 32);
+            printk(KERN_DEBUG "MCE: rdmsr MCG_CTL lo %x hi %x\n", *lo, *hi);
+            break;
+        case MSR_IA32_MC0_CTL2:
+        case MSR_IA32_MC1_CTL2:
+        case MSR_IA32_MC2_CTL2:
+        case MSR_IA32_MC3_CTL2:
+        case MSR_IA32_MC4_CTL2:
+        case MSR_IA32_MC5_CTL2:
+        case MSR_IA32_MC6_CTL2:
+        case MSR_IA32_MC7_CTL2:
+        case MSR_IA32_MC8_CTL2:
+            printk(KERN_WARNING "We have disabled CMCI capability, "
+                    "Guest should not read this MSR!\n");
+            *lo = *hi = 0x0;
+            break;
+        case MSR_IA32_MC0_CTL:
+        case MSR_IA32_MC1_CTL:
+        case MSR_IA32_MC2_CTL:
+        case MSR_IA32_MC3_CTL:
+        case MSR_IA32_MC4_CTL:
+        case MSR_IA32_MC5_CTL:
+        case MSR_IA32_MC6_CTL:
+        case MSR_IA32_MC7_CTL:
+        case MSR_IA32_MC8_CTL:
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "MCE: rdmsr not in domain context!\n");
+                *lo = *hi = 0x0;
+                break;
+            }
+            *lo = (u32)d->arch.vmca_msrs.mci_ctl[(msr - MSR_IA32_MC0_CTL)/4];
+            *hi =
+                (u32)(d->arch.vmca_msrs.mci_ctl[(msr - MSR_IA32_MC0_CTL)/4]
+                    >> 32);
+            printk(KERN_DEBUG "MCE: rdmsr MCi_CTL lo %x hi %x\n", *lo, *hi);
+            break;
+        case MSR_IA32_MC0_STATUS:
+        case MSR_IA32_MC1_STATUS:
+        case MSR_IA32_MC2_STATUS:
+        case MSR_IA32_MC3_STATUS:
+        case MSR_IA32_MC4_STATUS:
+        case MSR_IA32_MC5_STATUS:
+        case MSR_IA32_MC6_STATUS:
+        case MSR_IA32_MC7_STATUS:
+        case MSR_IA32_MC8_STATUS:
+            *lo = *hi = 0x0;
+            printk(KERN_DEBUG "MCE: rdmsr mci_status\n");
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "mce_rdmsr: not in domain context!\n");
+                break;
+            }
+            if (!list_empty(&d->arch.vmca_msrs.impact_header)) {
+                entry = list_entry(d->arch.vmca_msrs.impact_header.next,
+                    struct bank_entry, list);
+                *lo = entry->mci_status[(msr - MSR_IA32_MC0_STATUS)/4];
+                *hi = entry->mci_status[(msr - MSR_IA32_MC0_STATUS)/4] >> 32;
+
+                printk(KERN_DEBUG "MCE: rdmsr MCi_STATUS in vmCE# context "
+                    "lo %x hi %x\n", *lo, *hi);
+            }
+            break;
+        case MSR_IA32_MC0_ADDR:
+        case MSR_IA32_MC1_ADDR:
+        case MSR_IA32_MC2_ADDR:
+        case MSR_IA32_MC3_ADDR:
+        case MSR_IA32_MC4_ADDR:
+        case MSR_IA32_MC5_ADDR:
+        case MSR_IA32_MC6_ADDR:
+        case MSR_IA32_MC7_ADDR:
+        case MSR_IA32_MC8_ADDR:
+            *lo = *hi = 0x0;
+
+            printk(KERN_DEBUG "MCE: rdmsr mci_addr\n");
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "mce_rdmsr: not in domain context!\n");
+                break;
+            }
+            if (!list_empty(&d->arch.vmca_msrs.impact_header)) {
+                entry = list_entry(d->arch.vmca_msrs.impact_header.next,
+                    struct bank_entry, list);
+                *lo = entry->mci_addr[(msr - MSR_IA32_MC0_ADDR)/4];
+                *hi = entry->mci_addr[(msr - MSR_IA32_MC0_ADDR)/4] >> 32;
+                printk(KERN_DEBUG "MCE: rdmsr MCi_ADDR in vMCE# context "
+                    "lo %x hi %x\n", *lo, *hi);
+            }
+            break;
+        case MSR_IA32_MC0_MISC:
+        case MSR_IA32_MC1_MISC:
+        case MSR_IA32_MC2_MISC:
+        case MSR_IA32_MC3_MISC:
+        case MSR_IA32_MC4_MISC:
+        case MSR_IA32_MC5_MISC:
+        case MSR_IA32_MC6_MISC:
+        case MSR_IA32_MC7_MISC:
+        case MSR_IA32_MC8_MISC:
+            *lo = *hi = 0x0;
+            printk(KERN_DEBUG "MCE: rdmsr mci_misc\n");
+            if (!d || is_idle_domain(d)) {
+                printk(KERN_ERR "MCE: rdmsr not in domain context!\n");
+                break;
+            }
+            if (!list_empty(&d->arch.vmca_msrs.impact_header)) {
+                entry = list_entry(d->arch.vmca_msrs.impact_header.next,
+                    struct bank_entry, list);
+                *lo = entry->mci_misc[(msr - MSR_IA32_MC0_MISC)/4];
+                *hi = entry->mci_misc[(msr - MSR_IA32_MC0_MISC)/4] >> 32;
+
+                printk(KERN_DEBUG "MCE: rdmsr MCi_MISC in vMCE# context "
+                    " lo %x hi %x\n", *lo, *hi);
+            }
+            break;
+        default:
+            break;
+    }
+    spin_unlock(&mce_locks);
+    return ret;
+}
+
diff -r 179b7b3d7f84 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c	Mon Feb 16 19:04:25 2009 +0800
+++ b/xen/arch/x86/traps.c	Mon Feb 16 19:12:20 2009 +0800
@@ -1636,6 +1636,10 @@
             (d->domain_id == 0));
 }
 
+/*Intel vMCE MSRs virtualization*/
+extern int intel_mce_wrmsr(u32 msr, u32 lo,  u32 hi);
+extern int intel_mce_rdmsr(u32 msr, u32 *lo,  u32 *hi);
+
 static int emulate_privileged_op(struct cpu_user_regs *regs)
 {
     struct vcpu *v = current;
@@ -2196,6 +2200,15 @@
         default:
             if ( wrmsr_hypervisor_regs(regs->ecx, eax, edx) )
                 break;
+            if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
+                if ( intel_mce_wrmsr(regs->ecx, eax, edx) != 0) {
+                    gdprintk(XENLOG_ERR, "MCE: vMCE MSRS(%lx) Write"
+                        " (%x:%x) Fails! ", regs->ecx, edx, eax);
+                    goto fail;
+                }
+                break;
+            }
+ 
             if ( (rdmsr_safe(regs->ecx, l, h) != 0) ||
                  (eax != l) || (edx != h) )
         invalid:
@@ -2279,6 +2292,12 @@
                         _p(regs->ecx));*/
             if ( rdmsr_safe(regs->ecx, regs->eax, regs->edx) )
                 goto fail;
+
+            if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
+                if ( intel_mce_rdmsr(regs->ecx, &eax, &edx) != 0)
+                    printk(KERN_ERR "MCE: Not MCE MSRs %lx\n", regs->ecx);
+            }
+
             break;
         }
         break;
diff -r 179b7b3d7f84 xen/include/asm-x86/msr-index.h
--- a/xen/include/asm-x86/msr-index.h	Mon Feb 16 19:04:25 2009 +0800
+++ b/xen/include/asm-x86/msr-index.h	Mon Feb 16 19:12:20 2009 +0800
@@ -96,30 +96,54 @@
 #define CMCI_EN 			(1UL<<30)
 #define CMCI_THRESHOLD_MASK		0x7FFF
 
+#define MSR_IA32_MC1_CTL		0x00000404
+#define MSR_IA32_MC1_CTL2		0x00000281
 #define MSR_IA32_MC1_STATUS		0x00000405
 #define MSR_IA32_MC1_ADDR		0x00000406
 #define MSR_IA32_MC1_MISC		0x00000407
 
 #define MSR_IA32_MC2_CTL		0x00000408
+#define MSR_IA32_MC2_CTL2		0x00000282
 #define MSR_IA32_MC2_STATUS		0x00000409
 #define MSR_IA32_MC2_ADDR		0x0000040A
 #define MSR_IA32_MC2_MISC		0x0000040B
 
+#define MSR_IA32_MC3_CTL2		0x00000283
 #define MSR_IA32_MC3_CTL		0x0000040C
 #define MSR_IA32_MC3_STATUS		0x0000040D
 #define MSR_IA32_MC3_ADDR		0x0000040E
 #define MSR_IA32_MC3_MISC		0x0000040F
 
+#define MSR_IA32_MC4_CTL2		0x00000284
 #define MSR_IA32_MC4_CTL		0x00000410
 #define MSR_IA32_MC4_STATUS		0x00000411
 #define MSR_IA32_MC4_ADDR		0x00000412
 #define MSR_IA32_MC4_MISC		0x00000413
 
+#define MSR_IA32_MC5_CTL2		0x00000285
 #define MSR_IA32_MC5_CTL		0x00000414
 #define MSR_IA32_MC5_STATUS		0x00000415
 #define MSR_IA32_MC5_ADDR		0x00000416
 #define MSR_IA32_MC5_MISC		0x00000417
 
+#define MSR_IA32_MC6_CTL2		0x00000286
+#define MSR_IA32_MC6_CTL		0x00000418
+#define MSR_IA32_MC6_STATUS		0x00000419
+#define MSR_IA32_MC6_ADDR		0x0000041A
+#define MSR_IA32_MC6_MISC		0x0000041B
+
+#define MSR_IA32_MC7_CTL2		0x00000287
+#define MSR_IA32_MC7_CTL		0x0000041C
+#define MSR_IA32_MC7_STATUS		0x0000041D
+#define MSR_IA32_MC7_ADDR		0x0000041E
+#define MSR_IA32_MC7_MISC		0x0000041F
+
+#define MSR_IA32_MC8_CTL2		0x00000288
+#define MSR_IA32_MC8_CTL		0x00000420
+#define MSR_IA32_MC8_STATUS		0x00000421
+#define MSR_IA32_MC8_ADDR		0x00000422
+#define MSR_IA32_MC8_MISC		0x00000423
+
 #define MSR_P6_PERFCTR0			0x000000c1
 #define MSR_P6_PERFCTR1			0x000000c2
 #define MSR_P6_EVNTSEL0			0x00000186

[-- Attachment #5: mce_dom0.patch --]
[-- Type: application/octet-stream, Size: 7354 bytes --]

diff -r ca8ac5fc168c arch/x86_64/Kconfig
--- a/arch/x86_64/Kconfig	Fri Feb 13 18:08:59 2009 +0800
+++ b/arch/x86_64/Kconfig	Mon Feb 16 21:30:34 2009 +0800
@@ -472,7 +472,6 @@
 
 config X86_MCE
 	bool "Machine check support" if EMBEDDED
-	depends on !X86_64_XEN
 	default y
 	help
 	   Include a machine check error handler to report hardware errors.
@@ -483,7 +482,7 @@
 config X86_MCE_INTEL
 	bool "Intel MCE features"
 	depends on X86_MCE && X86_LOCAL_APIC
-	default y
+	default n
 	help
 	   Additional support for intel specific MCE features such as
 	   the thermal monitor.
@@ -491,7 +490,7 @@
 config X86_MCE_AMD
 	bool "AMD MCE features"
 	depends on X86_MCE && X86_LOCAL_APIC
-	default y
+	default n
 	help
 	   Additional support for AMD specific MCE features such as
 	   the DRAM Error Threshold.
diff -r ca8ac5fc168c arch/x86_64/kernel/Makefile
--- a/arch/x86_64/kernel/Makefile	Fri Feb 13 18:08:59 2009 +0800
+++ b/arch/x86_64/kernel/Makefile	Mon Feb 16 21:30:34 2009 +0800
@@ -13,6 +13,7 @@
 obj-$(CONFIG_STACKTRACE)	+= stacktrace.o
 obj-$(CONFIG_X86_MCE)         += mce.o
 obj-$(CONFIG_X86_MCE_INTEL)	+= mce_intel.o
+obj-$(CONFIG_X86_MCE_INTEL)	+= mce_dom0.o
 obj-$(CONFIG_X86_MCE_AMD)	+= mce_amd.o
 obj-$(CONFIG_MTRR)		+= ../../i386/kernel/cpu/mtrr/
 obj-$(CONFIG_ACPI)		+= acpi/
diff -r ca8ac5fc168c arch/x86_64/kernel/entry-xen.S
--- a/arch/x86_64/kernel/entry-xen.S	Fri Feb 13 18:08:59 2009 +0800
+++ b/arch/x86_64/kernel/entry-xen.S	Mon Feb 16 21:30:34 2009 +0800
@@ -1258,13 +1258,8 @@
 
 #ifdef CONFIG_X86_MCE
 	/* runs on exception stack */
-ENTRY(machine_check)
-	INTR_FRAME
-	pushq $0
-	CFI_ADJUST_CFA_OFFSET 8	
-	paranoidentry do_machine_check
-	jmp paranoid_exit1
-	CFI_ENDPROC
+KPROBE_ENTRY(machine_check)
+	zeroentry do_machine_check
 END(machine_check)
 #endif
 
diff -r ca8ac5fc168c arch/x86_64/kernel/mce.c
--- a/arch/x86_64/kernel/mce.c	Fri Feb 13 18:08:59 2009 +0800
+++ b/arch/x86_64/kernel/mce.c	Mon Feb 16 21:30:34 2009 +0800
@@ -165,7 +165,7 @@
  * The actual machine check handler
  */
 
-void do_machine_check(struct pt_regs * regs, long error_code)
+asmlinkage void do_machine_check(struct pt_regs * regs, long error_code)
 {
 	struct mce m, panicm;
 	int nowayout = (tolerant < 1); 
@@ -276,9 +276,16 @@
 
 /*
  * Periodic polling timer for "silent" machine check errors.
- */
+ * We will disable polling in DOM0 since all CMCI/Polling
+ * mechanism will be done in XEN for Intel CPUs
+*/
 
+#if defined (CONFIG_XEN) && defined(CONFIG_X86_MCE_INTEL)
+static int check_interval = 0; /* disable polling */
+#else
 static int check_interval = 5 * 60; /* 5 minutes */
+#endif
+
 static void mcheck_timer(void *data);
 static DECLARE_WORK(mcheck_work, mcheck_timer, NULL);
 
@@ -649,6 +656,7 @@
 };
 #endif
 
+extern void bind_virq_for_mce(void);
 static __init int mce_init_device(void)
 {
 	int err;
@@ -664,6 +672,13 @@
 
 	register_hotcpu_notifier(&mce_cpu_notifier);
 	misc_register(&mce_log_device);
+
+    /*Register vIRQ handler for MCE LOG processing*/
+    printk(KERN_DEBUG "MCE: bind virq for DOM0 Logging\n");
+#if defined (CONFIG_XEN) && defined(CONFIG_X86_MCE_INTEL)
+    bind_virq_for_mce();
+#endif
+
 	return err;
 }
 
diff -r ca8ac5fc168c arch/x86_64/kernel/mce_dom0.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/arch/x86_64/kernel/mce_dom0.c	Mon Feb 16 21:30:34 2009 +0800
@@ -0,0 +1,90 @@
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <xen/interface/xen.h>
+#include <xen/evtchn.h>
+#include <xen/interface/vcpu.h>
+#include <asm/hypercall.h>
+#include <asm/mce.h>
+
+/*dom0 mce virq handler, this is called from polling or cmci*/
+static int convert_log(struct mc_info *mi)
+{
+	struct mcinfo_common *mic = NULL;
+	struct mcinfo_global *mc_global;
+	struct mcinfo_bank *mc_bank;
+	struct mce m;
+
+	x86_mcinfo_lookup(mic, mi, MC_TYPE_GLOBAL);
+	if (mic == NULL)
+	{
+		printk(KERN_ERR "DOM0_MCE_LOG: global data is NULL\n");
+		return -1;
+	}
+
+	mc_global = (struct mcinfo_global*)mic;
+	m.mcgstatus = mc_global->mc_gstatus;
+	m.cpu = mc_global->mc_coreid;/*for test*/
+	x86_mcinfo_lookup(mic, mi, MC_TYPE_BANK);
+	do
+	{
+		if (mic == NULL || mic->size == 0)
+			break;
+		if (mic->type == MC_TYPE_BANK)
+		{
+			mc_bank = (struct mcinfo_bank*)mic;
+			m.misc = mc_bank->mc_misc;
+			m.status = mc_bank->mc_status;
+			m.addr = mc_bank->mc_addr;
+			m.tsc = mc_bank->mc_tsc;
+			m.res1 = mc_bank->mc_ctl2;
+			m.bank = mc_bank->mc_bank;
+			printk(KERN_DEBUG "[CPU%d, BANK%d, addr %llx, state %llx]\n", 
+                m.bank, m.cpu, m.addr, m.status);
+			/*log this record*/
+			mce_log(&m);
+		}
+		mic = x86_mcinfo_next(mic);
+	}while (1);
+
+	return 0;
+}
+
+static irqreturn_t mce_dom0_interrupt(int irq, void *dev_id,
+									struct pt_regs *regs)
+{
+	xen_mc_t mc_op;
+	int result = 0;
+
+	printk(KERN_DEBUG "MCE_DOM0_LOG: enter dom0 mce vIRQ\n");
+	mc_op.cmd = XEN_MC_fetch;
+	mc_op.interface_version = XEN_MCA_INTERFACE_VERSION;
+	mc_op.u.mc_fetch.flags = XEN_MC_CORRECTABLE;
+	mc_op.u.mc_fetch.fetch_idx = 0;
+	memset(&mc_op.u.mc_fetch.mc_info, 0, sizeof(mc_op.u.mc_fetch.mc_info));
+	result = HYPERVISOR_mca(&mc_op);
+	if (result)
+		printk(KERN_WARNING "MCE_DOM0_LOG: fetch mce global data failed\n");
+	else
+	{
+		result = convert_log(&mc_op.u.mc_fetch.mc_info);
+		if (result)
+			printk(KERN_WARNING "MCE_DOM0_LOG: convert log failed\n");
+	}
+	return IRQ_HANDLED;
+}
+
+
+void bind_virq_for_mce(void)
+{
+	int ret;
+
+	ret  = bind_virq_to_irqhandler(VIRQ_ARCH_0, 0, 
+		mce_dom0_interrupt, 0, "mce", NULL);
+
+	if ( ret<0 )
+	{
+		printk(KERN_ERR "MCE_DOM0_LOG: bind_virq for DOM0 failed\n");
+	}
+}
+
diff -r ca8ac5fc168c include/asm-x86_64/mach-xen/asm/hypercall.h
--- a/include/asm-x86_64/mach-xen/asm/hypercall.h	Fri Feb 13 18:08:59 2009 +0800
+++ b/include/asm-x86_64/mach-xen/asm/hypercall.h	Mon Feb 16 21:30:34 2009 +0800
@@ -215,7 +215,13 @@
 	platform_op->interface_version = XENPF_INTERFACE_VERSION;
 	return _hypercall1(int, platform_op, platform_op);
 }
-
+static inline int __must_check
+HYPERVISOR_mca(
+	struct xen_mc *mc_op)
+{
+	mc_op->interface_version = XEN_MCA_INTERFACE_VERSION;
+	return _hypercall1(int, mca, mc_op);
+}
 static inline int __must_check
 HYPERVISOR_set_debugreg(
 	unsigned int reg, unsigned long value)
diff -r ca8ac5fc168c include/asm-x86_64/mach-xen/irq_vectors.h
--- a/include/asm-x86_64/mach-xen/irq_vectors.h	Fri Feb 13 18:08:59 2009 +0800
+++ b/include/asm-x86_64/mach-xen/irq_vectors.h	Mon Feb 16 21:30:34 2009 +0800
@@ -57,6 +57,7 @@
 #define LOCAL_TIMER_VECTOR	0xef
 #endif
 
+#define THERMAL_APIC_VECTOR	0xfa
 #define SPURIOUS_APIC_VECTOR	0xff
 #define ERROR_APIC_VECTOR	0xfe
 
diff -r ca8ac5fc168c include/xen/interface/arch-x86/xen-mca.h
--- a/include/xen/interface/arch-x86/xen-mca.h	Fri Feb 13 18:08:59 2009 +0800
+++ b/include/xen/interface/arch-x86/xen-mca.h	Mon Feb 16 21:30:34 2009 +0800
@@ -56,7 +56,7 @@
 /* Hypercall */
 #define __HYPERVISOR_mca __HYPERVISOR_arch_0
 
-#define XEN_MCA_INTERFACE_VERSION 0x03000001
+#define XEN_MCA_INTERFACE_VERSION 0x03000002
 
 /* IN: Dom0 calls hypercall from MC event handler. */
 #define XEN_MC_CORRECTABLE  0x0
@@ -132,6 +132,8 @@
     uint64_t mc_addr;   /* bank address, only valid
                          * if addr bit is set in mc_status */
     uint64_t mc_misc;
+    uint64_t mc_ctl2;
+    uint64_t mc_tsc;
 };
 
 

[-- Attachment #6: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-16  5:35 [RFC] RAS(Part II)--MCA enalbing in XEN Ke, Liping
@ 2009-02-16 13:34 ` Christoph Egger
  2009-02-16 14:18   ` Christoph Egger
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Egger @ 2009-02-16 13:34 UTC (permalink / raw)
  To: Ke, Liping
  Cc: xen-devel, Frank.Vanderlinden@Sun.COM, Jiang, Yunhong,
	Keir Fraser, Gavin Maltby

[-- Attachment #1: Type: text/plain, Size: 4768 bytes --]


To me, it seems, the design has not been understood
and now, the code becomes more and more unmaintainable
bloat. I mean, the code is going to do far too much.

- The MCE routines in Xen are only for error data *collection*.
  Just pass it to Dom0 and that's it.
  Dom0 will do the error analysis and figure out what do to.
  It is the Dom0 which will do a hypercall to do things like
  page-offlining or cpu offlining or whatever is needed.
  Your code tries to move everyting back from Dom0 into the
  hypervisor. I remember Keir having rejected my MCE patches
  because he feared this bloat.

- Dom0 VIRQ is for correctable errors only. Uncorrectable errors
  are delivered via MCE trap. Dom0 and DomU register a handle
  via set_trap_table hypercall. A non-registrated handler means,
  the guest can't handle it by itself. Dom0 is always notified,
  the guest becomes only notified
  This seperation is completely ignored and misuse Dom0 VIRQ for everything
  (therefore the bunch of superflous flags (see next point))

- MCA flags: what are the differences between correctable 
  and recoverable ? what are the differences between uncorrectable,
  polled, reset and cmci and mce types ?

- You use dynamic memory allocation (which uses spinlocks) in MCE code
  and you roll your own mce handling instead of using the generic API in mce.c
  I suppose, you don't understand it at all.

- I attach the design document again, since I have the impression, noone
  at Intel read it, hence the misunderstandings.

I think, it is best to get Gavin's generic mce improvements upstream first.


On Monday 16 February 2009 06:35:14 Ke, Liping wrote:
> Hi, all
> These patches are for MCA enabling in XEN. It is sent as RFC firstly to
> collect some feedbacks for refinement if needed before the final patch. We
> also attach one description txt documents for your reference.
>
> Some implementation notes:
> 1) When error happens, if the error is fatal (pcc = 1) or can't be
> recovered (pcc = 0, yet no good recovery methods), for avoiding losing logs
> in DOM0, we will reset machine immediately. Most of MCA MSRs are sticky.
> After reboot, MCA polling mechanism will send vIRQ to DOM0 for logging.
> 2) When MCE# happens, all CPUs enter MCA context. The first CPU who
> read&clear the error MSR bank will be this MCE# owner. Necessary
> locks/synchronization will help to judge the owner and select most severe
> error. 3) For convenience, we will select the most offending CPU to do most
> of processing&recovery job. 4) MCE# happens, we will do three jobs:
>     a. Send vIRQ to DOM0 for logging
>     b. Send vMCE# to Impacted Guest (Currently Only inject to impacted
> DOM0) c. Guest vMCE MSR virtualization
> 5) Some further improvement/adds might be done if needed:
>     a) Impacted DOM judgement algorithm.
>     b) Now vMCE# injection is controlled by centralized data(vmce_data).
> The injection algorithm is a bit complex. We might change the algorithm
> which's based on PER_DOM data if you preferred. Notes for understanding:
>         1) If several banks impact one domain, yet those banks belong to
> the same pCPU, it will be injected only once. 2) If more than one bank
> impact one domain, yet error banks belong to different pCPU, ith will be
> injected nr_num(pCPU) times. 3) We use centralized data [two arrays
> impact_domid, impact_cpus map in vmce_data] to represent the injection
> algorithm. Combined the two array item (idx, impact_domid) and (idx,
> impact_cpus) into one item (idx, impact_domid, impact_cpus). This item
> records the impact_domain id and the error pCPU map (Finding UC errors on
> this CPU which impact this domain). Then, we can judge how to inject the
> vMCE (domid, impact_times[nr_pCPUs]).
>         4) Although data structure is ready, we only inject vMCE# to DOMD0
> currently. c) Connection with recovery actions (cpu/memory online/offline)
> d) More refines and tests for HVM might be done when needed.
>
> Patch Description:
> 1. basic_mca_support: Enable MCA support in XEN.
> 2. vmsr_virtualization: Guest MCE# MSR read/write virtualization support in
> XEN. 3. mce_dom0: Cooperating with XEN, DOM0 add vIRQ and vMCE# handler.
> Translate XEN log to DOM0, re-use Linux kernel and MCELOG mechanisms and
> MCE handler. This is mainly a demonstration patch.
>
> About Test:
> We did some internal test and the result is just fine.
>
> Any feedback is welcome and thanks a lot for your help! :-)
> Regards,
> Criping



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

[-- Attachment #2: xen_mca.pdf --]
[-- Type: application/pdf, Size: 325726 bytes --]

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-16 13:34 ` Christoph Egger
@ 2009-02-16 14:18   ` Christoph Egger
  2009-02-16 15:03     ` Keir Fraser
  2009-02-16 15:05     ` Jiang, Yunhong
  0 siblings, 2 replies; 45+ messages in thread
From: Christoph Egger @ 2009-02-16 14:18 UTC (permalink / raw)
  To: xen-devel
  Cc: Frank.Vanderlinden@Sun.COM, Jiang, Yunhong, Keir Fraser,
	Gavin Maltby, Ke, Liping


I realize from this and earlier MCE patches from Intel,
that Intel tries to change the machine check design
on its ground.

The basic ideas behind current design:

1. Xen collects error telemetry
2. Xen delivers correctable errors to Dom0 via VIRQ
3. Xen delivers uncorrectable errors to Dom0 via trap handler
4. Xen delivers uncorrectable errors to DomU only if Dom0 tells Xen to do so
5. Xen performs health measurements as told by Dom0 via hypercalls
   such as cpu- or page-offlining
6. Dom0 performs error analysis, figures out what is going on,
    calls hypercalls for the right health measurement


The basic ideas behind Intel's new design (as far as I can see them from their 
patches I have seen so far):

1. Xen collects error telemetry
2. Xen performs error analysis, figures out what is going on
3. Xen automatically does health measurements automatically
    like cpu- and page-offlining
4. Xen delivers error telemetry to Dom0 via VIRQ for error logging only
    independent of the error type
5. Inject MCEs into the guest directly
6. Don't use the MCE trap handler at all


IMO, any design change should be discussed first and not changed
silently, since this will confuse everyone and noone will know 
what is the right thing to do in Xen and in Dom0 and this
in turn will lead to error prone, unmaintainable code in both
Xen and in Dom0

Christoph


On Monday 16 February 2009 14:34:36 Christoph Egger wrote:
> To me, it seems, the design has not been understood
> and now, the code becomes more and more unmaintainable
> bloat. I mean, the code is going to do far too much.
>
> - The MCE routines in Xen are only for error data *collection*.
>   Just pass it to Dom0 and that's it.
>   Dom0 will do the error analysis and figure out what do to.
>   It is the Dom0 which will do a hypercall to do things like
>   page-offlining or cpu offlining or whatever is needed.
>   Your code tries to move everyting back from Dom0 into the
>   hypervisor. I remember Keir having rejected my MCE patches
>   because he feared this bloat.
>
> - Dom0 VIRQ is for correctable errors only. Uncorrectable errors
>   are delivered via MCE trap. Dom0 and DomU register a handle
>   via set_trap_table hypercall. A non-registrated handler means,
>   the guest can't handle it by itself. Dom0 is always notified,
>   the guest becomes only notified
>   This seperation is completely ignored and misuse Dom0 VIRQ for everything
>   (therefore the bunch of superflous flags (see next point))
>
> - MCA flags: what are the differences between correctable
>   and recoverable ? what are the differences between uncorrectable,
>   polled, reset and cmci and mce types ?
>
> - You use dynamic memory allocation (which uses spinlocks) in MCE code
>   and you roll your own mce handling instead of using the generic API in
> mce.c I suppose, you don't understand it at all.
>
> - I attach the design document again, since I have the impression, noone
>   at Intel read it, hence the misunderstandings.
>
> I think, it is best to get Gavin's generic mce improvements upstream first.
>
> On Monday 16 February 2009 06:35:14 Ke, Liping wrote:
> > Hi, all
> > These patches are for MCA enabling in XEN. It is sent as RFC firstly to
> > collect some feedbacks for refinement if needed before the final patch.
> > We also attach one description txt documents for your reference.
> >
> > Some implementation notes:
> > 1) When error happens, if the error is fatal (pcc = 1) or can't be
> > recovered (pcc = 0, yet no good recovery methods), for avoiding losing
> > logs in DOM0, we will reset machine immediately. Most of MCA MSRs are
> > sticky. After reboot, MCA polling mechanism will send vIRQ to DOM0 for
> > logging. 2) When MCE# happens, all CPUs enter MCA context. The first CPU
> > who read&clear the error MSR bank will be this MCE# owner. Necessary
> > locks/synchronization will help to judge the owner and select most severe
> > error. 3) For convenience, we will select the most offending CPU to do
> > most of processing&recovery job. 4) MCE# happens, we will do three jobs:
> > a. Send vIRQ to DOM0 for logging
> >     b. Send vMCE# to Impacted Guest (Currently Only inject to impacted
> > DOM0) c. Guest vMCE MSR virtualization
> > 5) Some further improvement/adds might be done if needed:
> >     a) Impacted DOM judgement algorithm.
> >     b) Now vMCE# injection is controlled by centralized data(vmce_data).
> > The injection algorithm is a bit complex. We might change the algorithm
> > which's based on PER_DOM data if you preferred. Notes for understanding:
> >         1) If several banks impact one domain, yet those banks belong to
> > the same pCPU, it will be injected only once. 2) If more than one bank
> > impact one domain, yet error banks belong to different pCPU, ith will be
> > injected nr_num(pCPU) times. 3) We use centralized data [two arrays
> > impact_domid, impact_cpus map in vmce_data] to represent the injection
> > algorithm. Combined the two array item (idx, impact_domid) and (idx,
> > impact_cpus) into one item (idx, impact_domid, impact_cpus). This item
> > records the impact_domain id and the error pCPU map (Finding UC errors on
> > this CPU which impact this domain). Then, we can judge how to inject the
> > vMCE (domid, impact_times[nr_pCPUs]).
> >         4) Although data structure is ready, we only inject vMCE# to
> > DOMD0 currently. c) Connection with recovery actions (cpu/memory
> > online/offline) d) More refines and tests for HVM might be done when
> > needed.
> >
> > Patch Description:
> > 1. basic_mca_support: Enable MCA support in XEN.
> > 2. vmsr_virtualization: Guest MCE# MSR read/write virtualization support
> > in XEN. 3. mce_dom0: Cooperating with XEN, DOM0 add vIRQ and vMCE#
> > handler. Translate XEN log to DOM0, re-use Linux kernel and MCELOG
> > mechanisms and MCE handler. This is mainly a demonstration patch.
> >
> > About Test:
> > We did some internal test and the result is just fine.
> >
> > Any feedback is welcome and thanks a lot for your help! :-)
> > Regards,
> > Criping



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-16 14:18   ` Christoph Egger
@ 2009-02-16 15:03     ` Keir Fraser
  2009-02-16 15:19       ` Jiang, Yunhong
  2009-02-16 17:58       ` Frank Van Der Linden
  2009-02-16 15:05     ` Jiang, Yunhong
  1 sibling, 2 replies; 45+ messages in thread
From: Keir Fraser @ 2009-02-16 15:03 UTC (permalink / raw)
  To: Christoph Egger, xen-devel
  Cc: Frank.Vanderlinden@Sun.COM, Jiang, Yunhong, Gavin Maltby, Ke, Liping

On 16/02/2009 14:18, "Christoph Egger" <Christoph.Egger@amd.com> wrote:

> IMO, any design change should be discussed first and not changed
> silently, since this will confuse everyone and noone will know
> what is the right thing to do in Xen and in Dom0 and this
> in turn will lead to error prone, unmaintainable code in both
> Xen and in Dom0

I certainly think we should have a shared approach for x86 machine-check
handling, rather than completely different architectures for AMD and Intel.
Fortunately Sun are an interested and active third party regarding this
feature. I'll be interested in their opinion.

 -- Keir

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-16 14:18   ` Christoph Egger
  2009-02-16 15:03     ` Keir Fraser
@ 2009-02-16 15:05     ` Jiang, Yunhong
  1 sibling, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-16 15:05 UTC (permalink / raw)
  To: Christoph Egger, xen-devel
  Cc: Maltby, Frank.Vanderlinden@Sun.COM, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 10161 bytes --]

Aha, Christoph, sorry for the suprise to you, but I think we have descript our suggestion to you (refert to http://markmail.org/message/vpcdojylxkrg6uz3 please). As I didn't get any response from your side, so I suppose you are waiting for the patch to get more idea, that's the reason Criping and I hurry up to cook the patch and send it out as RFC. The RFC means it is target for comments, as we know MCA handling is complex and need community discussion (I have to say sometimes patch is more clear than design doc, although cooking a patch need more effort).

Your description of our design is quite clear, that also means our RFC has achieved it's purpose :-) One exception is item 6, MCE trap handler in HV side is still needed for PV domain just as it is now (the bounce buffer, the trap priority etc), but for guest, yes, we try to re-use guest's MCA handler. 

As said already, MCE handling is complex, so can we discuss it in details on how to handle the MCA and get some consensus ? We have CC'ed all engineers we think may be interesting on it.

I merge comments to your another mail as below:

>- The MCE routines in Xen are only for error data *collection*.
>  Just pass it to Dom0 and that's it.
>  Dom0 will do the error analysis and figure out what do to.
>  It is the Dom0 which will do a hypercall to do things like
>  page-offlining or cpu offlining or whatever is needed.
>  Your code tries to move everyting back from Dom0 into the
>  hypervisor. I remember Keir having rejected my MCE patches
>  because he feared this bloat.

Sorry that I didn't notice Keir's feedback to your original patch, I will google it, or it will be great if you can share me when that happen?

>- MCA flags: what are the differences between correctable 
>  and recoverable ? what are the differences between uncorrectable,
>  polled, reset and cmci and mce types ?

Per my understanding, correctable error (sometimes it is called corrected error) means hardware have recovered the error and software is not impacted (although some proactive action is prefered), while recoverable means hardware does not recover the error but it is possible that softeare can recover the error (it is sometihng like non-fatal error in PCI-E spec, although not exactly same, I think).

>
>- You use dynamic memory allocation (which uses spinlocks) in MCE code
>  and you roll your own mce handling instead of using the 
>generic API in mce.c

I think that is in softIRQ context and should be ok for spinlocks.

>  I suppose, you don't understand it at all.
>
>- I attach the design document again, since I have the 
>impression, noone
>  at Intel read it, hence the misunderstandings.

I promise we read it carefully, otherwise my manager is sure to challenge me firstly before you, and it is really good written.

>
>I think, it is best to get Gavin's generic mce improvements 
>upstream first.
Sure, Gavin's improvement is important. Again, this patch is just a RFC, and some components is still WIP like inject per-domain MCA since we want to get input firstly.

Thanks
Yunhong Jiang

>-----Original Message-----
>From: Christoph Egger [mailto:Christoph.Egger@amd.com] 
>Sent: 2009年2月16日 22:18
>To: xen-devel@lists.xensource.com
>Cc: Ke, Liping; Frank.Vanderlinden@Sun.COM; Jiang, Yunhong; 
>Keir Fraser; Gavin Maltby
>Subject: Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
>
>
>I realize from this and earlier MCE patches from Intel,
>that Intel tries to change the machine check design
>on its ground.
>
>The basic ideas behind current design:
>
>1. Xen collects error telemetry
>2. Xen delivers correctable errors to Dom0 via VIRQ
>3. Xen delivers uncorrectable errors to Dom0 via trap handler
>4. Xen delivers uncorrectable errors to DomU only if Dom0 
>tells Xen to do so
>5. Xen performs health measurements as told by Dom0 via hypercalls
>   such as cpu- or page-offlining
>6. Dom0 performs error analysis, figures out what is going on,
>    calls hypercalls for the right health measurement
>
>
>The basic ideas behind Intel's new design (as far as I can see 
>them from their 
>patches I have seen so far):
>
>1. Xen collects error telemetry
>2. Xen performs error analysis, figures out what is going on
>3. Xen automatically does health measurements automatically
>    like cpu- and page-offlining
>4. Xen delivers error telemetry to Dom0 via VIRQ for error logging only
>    independent of the error type
>5. Inject MCEs into the guest directly
>6. Don't use the MCE trap handler at all
>
>
>IMO, any design change should be discussed first and not changed
>silently, since this will confuse everyone and noone will know 
>what is the right thing to do in Xen and in Dom0 and this
>in turn will lead to error prone, unmaintainable code in both
>Xen and in Dom0
>
>Christoph
>
>
>On Monday 16 February 2009 14:34:36 Christoph Egger wrote:
>> To me, it seems, the design has not been understood
>> and now, the code becomes more and more unmaintainable
>> bloat. I mean, the code is going to do far too much.
>>
>> - The MCE routines in Xen are only for error data *collection*.
>>   Just pass it to Dom0 and that's it.
>>   Dom0 will do the error analysis and figure out what do to.
>>   It is the Dom0 which will do a hypercall to do things like
>>   page-offlining or cpu offlining or whatever is needed.
>>   Your code tries to move everyting back from Dom0 into the
>>   hypervisor. I remember Keir having rejected my MCE patches
>>   because he feared this bloat.
>>
>> - Dom0 VIRQ is for correctable errors only. Uncorrectable errors
>>   are delivered via MCE trap. Dom0 and DomU register a handle
>>   via set_trap_table hypercall. A non-registrated handler means,
>>   the guest can't handle it by itself. Dom0 is always notified,
>>   the guest becomes only notified
>>   This seperation is completely ignored and misuse Dom0 VIRQ 
>for everything
>>   (therefore the bunch of superflous flags (see next point))
>>
>> - MCA flags: what are the differences between correctable
>>   and recoverable ? what are the differences between uncorrectable,
>>   polled, reset and cmci and mce types ?
>>
>> - You use dynamic memory allocation (which uses spinlocks) 
>in MCE code
>>   and you roll your own mce handling instead of using the 
>generic API in
>> mce.c I suppose, you don't understand it at all.
>>
>> - I attach the design document again, since I have the 
>impression, noone
>>   at Intel read it, hence the misunderstandings.
>>
>> I think, it is best to get Gavin's generic mce improvements 
>upstream first.
>>
>> On Monday 16 February 2009 06:35:14 Ke, Liping wrote:
>> > Hi, all
>> > These patches are for MCA enabling in XEN. It is sent as 
>RFC firstly to
>> > collect some feedbacks for refinement if needed before the 
>final patch.
>> > We also attach one description txt documents for your reference.
>> >
>> > Some implementation notes:
>> > 1) When error happens, if the error is fatal (pcc = 1) or can't be
>> > recovered (pcc = 0, yet no good recovery methods), for 
>avoiding losing
>> > logs in DOM0, we will reset machine immediately. Most of 
>MCA MSRs are
>> > sticky. After reboot, MCA polling mechanism will send vIRQ 
>to DOM0 for
>> > logging. 2) When MCE# happens, all CPUs enter MCA context. 
>The first CPU
>> > who read&clear the error MSR bank will be this MCE# owner. 
>Necessary
>> > locks/synchronization will help to judge the owner and 
>select most severe
>> > error. 3) For convenience, we will select the most 
>offending CPU to do
>> > most of processing&recovery job. 4) MCE# happens, we will 
>do three jobs:
>> > a. Send vIRQ to DOM0 for logging
>> >     b. Send vMCE# to Impacted Guest (Currently Only inject 
>to impacted
>> > DOM0) c. Guest vMCE MSR virtualization
>> > 5) Some further improvement/adds might be done if needed:
>> >     a) Impacted DOM judgement algorithm.
>> >     b) Now vMCE# injection is controlled by centralized 
>data(vmce_data).
>> > The injection algorithm is a bit complex. We might change 
>the algorithm
>> > which's based on PER_DOM data if you preferred. Notes for 
>understanding:
>> >         1) If several banks impact one domain, yet those 
>banks belong to
>> > the same pCPU, it will be injected only once. 2) If more 
>than one bank
>> > impact one domain, yet error banks belong to different 
>pCPU, ith will be
>> > injected nr_num(pCPU) times. 3) We use centralized data [two arrays
>> > impact_domid, impact_cpus map in vmce_data] to represent 
>the injection
>> > algorithm. Combined the two array item (idx, impact_domid) 
>and (idx,
>> > impact_cpus) into one item (idx, impact_domid, 
>impact_cpus). This item
>> > records the impact_domain id and the error pCPU map 
>(Finding UC errors on
>> > this CPU which impact this domain). Then, we can judge how 
>to inject the
>> > vMCE (domid, impact_times[nr_pCPUs]).
>> >         4) Although data structure is ready, we only 
>inject vMCE# to
>> > DOMD0 currently. c) Connection with recovery actions (cpu/memory
>> > online/offline) d) More refines and tests for HVM might be 
>done when
>> > needed.
>> >
>> > Patch Description:
>> > 1. basic_mca_support: Enable MCA support in XEN.
>> > 2. vmsr_virtualization: Guest MCE# MSR read/write 
>virtualization support
>> > in XEN. 3. mce_dom0: Cooperating with XEN, DOM0 add vIRQ and vMCE#
>> > handler. Translate XEN log to DOM0, re-use Linux kernel and MCELOG
>> > mechanisms and MCE handler. This is mainly a demonstration patch.
>> >
>> > About Test:
>> > We did some internal test and the result is just fine.
>> >
>> > Any feedback is welcome and thanks a lot for your help! :-)
>> > Regards,
>> > Criping
>
>
>
>-- 
>---to satisfy European Law for business letters:
>Advanced Micro Devices GmbH
>Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
>Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
>Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
>Registergericht Muenchen, HRB Nr. 43632
>
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-16 15:03     ` Keir Fraser
@ 2009-02-16 15:19       ` Jiang, Yunhong
  2009-02-16 17:58       ` Frank Van Der Linden
  1 sibling, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-16 15:19 UTC (permalink / raw)
  To: Keir Fraser, Christoph Egger, xen-devel
  Cc: Frank.Vanderlinden@Sun.COM, kaz, Gavin Maltby, Ke, Liping

[-- Attachment #1: Type: text/plain, Size: 1204 bytes --]

 

>-----Original Message-----
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] 
>Sent: 2009年2月16日 23:03
>To: Christoph Egger; xen-devel@lists.xensource.com
>Cc: Ke, Liping; Frank.Vanderlinden@Sun.COM; Jiang, Yunhong; 
>Gavin Maltby
>Subject: Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
>
>On 16/02/2009 14:18, "Christoph Egger" <Christoph.Egger@amd.com> wrote:
>
>> IMO, any design change should be discussed first and not changed
>> silently, since this will confuse everyone and noone will know
>> what is the right thing to do in Xen and in Dom0 and this
>> in turn will lead to error prone, unmaintainable code in both
>> Xen and in Dom0
>
>I certainly think we should have a shared approach for x86 
>machine-check
>handling, rather than completely different architectures for 
>AMD and Intel.
>Fortunately Sun are an interested and active third party regarding this
>feature. I'll be interested in their opinion.

Yes, we don;t want difference here, we change only mce-intel.c because this is just for discuss. 
And I remember SUZUKI Kazuhiro are also interesting on this topic (is CC'ed now).

Thanks
-- Yunhong Jiang
 
>
> -- Keir
>
>
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-16 15:03     ` Keir Fraser
  2009-02-16 15:19       ` Jiang, Yunhong
@ 2009-02-16 17:58       ` Frank Van Der Linden
  2009-02-17  5:50         ` Frank Van Der Linden
  2009-02-17  6:41         ` Jiang, Yunhong
  1 sibling, 2 replies; 45+ messages in thread
From: Frank Van Der Linden @ 2009-02-16 17:58 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Gavin Maltby, Christoph Egger, xen-devel, Jiang, Yunhong, Ke, Liping

Keir Fraser wrote:
> On 16/02/2009 14:18, "Christoph Egger" <Christoph.Egger@amd.com> wrote:
>
>   
>> IMO, any design change should be discussed first and not changed
>> silently, since this will confuse everyone and noone will know
>> what is the right thing to do in Xen and in Dom0 and this
>> in turn will lead to error prone, unmaintainable code in both
>> Xen and in Dom0
>>     
>
> I certainly think we should have a shared approach for x86 machine-check
> handling, rather than completely different architectures for AMD and Intel.
> Fortunately Sun are an interested and active third party regarding this
> feature. I'll be interested in their opinion.
>
>  -- Keir
>
>
>   
Today is a holiday here in the US, so I have only taken a superficial 
look at the patches.

However, my initial impression is that I share Christoph's concern. I 
like the original design, where the hypervisor deals with low-level 
information collection, passes it on to dom0, which then can make a 
high-level decision and instructs the hypervisor to take high-level 
action via a hypercall. The hypervisor does the actual MSR reads and 
writes, dom0 only acts on the values provided via hypercalls.

We added the physcpuinfo hypercall to stay in this framework: get 
physical information needed for analysis, but don't access any registers 
directly.

It seems that these new patches blur this distinction, especially the 
virtualized msr reads/writes. I am not sure what added value they have, 
except for being able to run an unmodified MCA handler. However, I think 
that any active MCA decision making should be centralized, and that 
centralized place would be dom0. Dom0 is already very much aware of the 
hypervisor, so I don't see the advantage of having an unmodified MCA 
handler there (our MCA handlers are virtually unmodified, it's just that 
the part where the telemetry is collected is inside Xen for the dom0 case).

I also agree that different behavior for AMD and Intel chips would not 
be good.

Perhaps the Intel folks can explain what the advantages of their 
approach are, and give some scenarios where there approach would be 
better? My first impression is that staying within the general framework 
as provided by Christoph's original work is the better option.

- Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-16 17:58       ` Frank Van Der Linden
@ 2009-02-17  5:50         ` Frank Van Der Linden
  2009-02-17  6:44           ` Jiang, Yunhong
  2009-02-17  6:53           ` Jiang, Yunhong
  2009-02-17  6:41         ` Jiang, Yunhong
  1 sibling, 2 replies; 45+ messages in thread
From: Frank Van Der Linden @ 2009-02-17  5:50 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Gavin Maltby, Christoph Egger, xen-devel, Ke, Liping, Jiang, Yunhong

I should probably clarify myself, since I may have created one wrong 
impression: I don't object to the parts of the Intel code where the 
hypervisor does more of the initial work (like is also done in the page 
offline code); it can be critical that this work is done quickly, and 
the hypervisor is the only place that has both the information and the 
means to do it.

So, doing some more work there in some cases is probably the best thing 
to do, even though there is natural resistance to adding more code to 
the hypervisor.

The main thing that I don't quite understand the benefits of is the vMCE 
code, which is why I asked if there are examples of where that approach 
would work better.

- Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-16 17:58       ` Frank Van Der Linden
  2009-02-17  5:50         ` Frank Van Der Linden
@ 2009-02-17  6:41         ` Jiang, Yunhong
  2009-02-18 18:05           ` Christoph Egger
  1 sibling, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-17  6:41 UTC (permalink / raw)
  To: Frank.Vanderlinden, Keir Fraser; +Cc: Maltby, Christoph Egger, xen-devel

I think the major difference including: a) How to handle the #MC, i.e. reset system, decide impacted components, take recover action like page offline etc. b) How to handle error impact guest. As to other item like log/telemetry, I think our implementation didn't have much different to current implementation.

For how the handle the #MC, we think keep #MC handling in the hypervisor handler will have following benifit:
a) When there is #MC happen, we need take action to reduce the severity of the error as soon as possible. After all, #MC is something different to normal interrupt.
b) Even if Dom0 will take central action, most of the work will be to invoke hypercall to Xen HV to take action still. 
c) Currently all #MC will first go-through Dom0 before inject to DomU, but we didn't think much benifit for such path, since HV knows about guest quite well. 

Above is the main reason that we keep #MC handling in Xen HV.

As how to handle error impact guest, I tried to describe 3 options in http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00643.html, basically we have 3 options (you can refer to above URL for more information): 
1) A PV #MC handler is implemented in guest. This PV handler gets MCA information from Xen HV through hypercall, it is what currently implemented.;
 2) Xen will provide MCA MSR virtualization so that guest's native #MC handler can run without changes;
 3)uses a PV #MC handler for guest as option 1, but interface between Xen/guest is abstract event, like offline offending page, terminate current execution context etc.  

We select option 2 in our current implementation, with following consideration: 
1) With this method, we can re-use native MCE handler , which may be tested more widely 
2) We can benifit from native MCE handler's improvement
3) it can support HVM guest better, especially this method can provide support to HVM/PV guest at the same time.
 4) We don't need maintain PV handler anymore, for various guest type.

One dis-advantage for this option is, guest (dom0) missed the physical CPU information.

We think it will be much better if we can define a clear abstract interface between Xen/guest, i.e. option 3, but even in that situation, current implementation can be the last resorted method if guest has no PV abstract event handler installed. 

Especially we apply this method to Dom0 , because after we place all #MC handling in Xen HV, dom0's MCE handler is same to normal guest, and we don't need to diffrenciate it anymore, you can see the changes to dom0 for MCA is very small now. BTW, one assumption here is, dom0's log/telemetry will all go-through the VIRQ handler while Dom0's #MC is just for it's recovery.

Of course, currently keep system running is far more important than guest #MC, and we can simply kill impacted guest. We implement the virtual MSR read/write mainly for Dom0 support (or maybe even dom0 can be killed now since it can't do much recovery still ).

Thanks
Yunhong Jiang

>>   
>Today is a holiday here in the US, so I have only taken a superficial 
>look at the patches.
>
>However, my initial impression is that I share Christoph's concern. I 
>like the original design, where the hypervisor deals with low-level 
>information collection, passes it on to dom0, which then can make a 
>high-level decision and instructs the hypervisor to take high-level 
>action via a hypercall. The hypervisor does the actual MSR reads and 
>writes, dom0 only acts on the values provided via hypercalls.
>
>We added the physcpuinfo hypercall to stay in this framework: get 
>physical information needed for analysis, but don't access any 
>registers 
>directly.
>
>It seems that these new patches blur this distinction, especially the 
>virtualized msr reads/writes. I am not sure what added value 
>they have, 
>except for being able to run an unmodified MCA handler. 
>However, I think 
>that any active MCA decision making should be centralized, and that 
>centralized place would be dom0. Dom0 is already very much 
>aware of the 
>hypervisor, so I don't see the advantage of having an unmodified MCA 
>handler there (our MCA handlers are virtually unmodified, it's 
>just that 
>the part where the telemetry is collected is inside Xen for 
>the dom0 case).
>
>I also agree that different behavior for AMD and Intel chips would not 
>be good.
>
>Perhaps the Intel folks can explain what the advantages of their 
>approach are, and give some scenarios where there approach would be 
>better? My first impression is that staying within the general 
>framework 
>as provided by Christoph's original work is the better option.
>
>- Frank
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-17  5:50         ` Frank Van Der Linden
@ 2009-02-17  6:44           ` Jiang, Yunhong
  2009-02-17  6:53           ` Jiang, Yunhong
  1 sibling, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-17  6:44 UTC (permalink / raw)
  To: Frank.Vanderlinden, Keir Fraser
  Cc: Gavin Maltby, Christoph Egger, xen-devel, Ke, Liping

[-- Attachment #1: Type: text/plain, Size: 1355 bytes --]

 

>-----Original Message-----
>From: Frank.Vanderlinden@Sun.COM [mailto:Frank.Vanderlinden@Sun.COM] 
>Sent: 2009年2月17日 13:50
>To: Keir Fraser
>Cc: Gavin Maltby; Christoph Egger; 
>xen-devel@lists.xensource.com; Jiang, Yunhong; Ke, Liping
>Subject: Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
>
>I should probably clarify myself, since I may have created one wrong 
>impression: I don't object to the parts of the Intel code where the 
>hypervisor does more of the initial work (like is also done in 
>the page 
>offline code); it can be critical that this work is done quickly, and 
>the hypervisor is the only place that has both the information and the 
>means to do it.

Yes, agree.

>
>So, doing some more work there in some cases is probably the 
>best thing 
>to do, even though there is natural resistance to adding more code to 
>the hypervisor.

We all agree to keep HV less code, and we will try to reduce the LOC in next round patch.

>
>The main thing that I don't quite understand the benefits of 
>is the vMCE 
>code, which is why I asked if there are examples of where that 
>approach 
>would work better.

Please see my mail I just sent out, you can also refer to http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00643.html. 

Thanks
Yunhong Jiang

>
>- Frank
>
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-17  5:50         ` Frank Van Der Linden
  2009-02-17  6:44           ` Jiang, Yunhong
@ 2009-02-17  6:53           ` Jiang, Yunhong
  1 sibling, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-17  6:53 UTC (permalink / raw)
  To: Jiang, Yunhong, Frank.Vanderlinden, Keir Fraser
  Cc: Gavin Maltby, Christoph Egger, xen-devel, Ke, Liping


>>So, doing some more work there in some cases is probably the 
>>best thing 
>>to do, even though there is natural resistance to adding more code to 
>>the hypervisor.
>
>We all agree to keep HV less code, and we will try to reduce 
>the LOC in next round patch.

BTW, some changes in Xen HV is needed no matter we place #MC to Xen or dom0, like ownership CPU check, select most severity CPU, post handler in softIRQ etc, which is complex also.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-17  6:41         ` Jiang, Yunhong
@ 2009-02-18 18:05           ` Christoph Egger
  2009-02-19  9:13             ` Jiang, Yunhong
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Egger @ 2009-02-18 18:05 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: Frank.Vanderlinden@Sun.COM, xen-devel, Keir Fraser, Gavin Maltby,
	Ke, Liping

On Tuesday 17 February 2009 07:41:29 Jiang, Yunhong wrote:
> I think the major difference including: a) How to handle the #MC, i.e.
> reset system, decide impacted components, take recover action like page
> offline etc. b) How to handle error impact guest. As to other item like
> log/telemetry, I think our implementation didn't have much different to
> current implementation.

The hardware doesn't know what recover actions the software can do.
If page A is faulty, and software maintains a copy in page B, then
software can turn an uncorrectable error into an correctable one.
If the hardware is aware of that copy (memory mirroring done by memory
controller), then the hardware itself turns the uncorrectable error
into an correctable one and reports an correctable error.

Therefore, I don't see why other flags than correctable and uncorrectable
are needed at all.


After some thinking on taking some quick actions, I can
agree on it if it meets the condition below. Be aware, error analyzes
is highly CPU vendor and even CPU family/model specific. Doing a
complete analyzes as Solaris does blows Xen up a *lot*.

Therefore, a *cheap* error analysis must be enough to figure out
if recover actions like page-offlining or cpu offlining
are *obviously* only the right thing to do.

If this is not the case, then let Dom0 decide what to do.

Christoph


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-18 18:05           ` Christoph Egger
@ 2009-02-19  9:13             ` Jiang, Yunhong
  2009-02-19 16:25               ` Christoph Egger
  0 siblings, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-19  9:13 UTC (permalink / raw)
  To: Christoph Egger
  Cc: Frank.Vanderlinden@Sun.COM, xen-devel, Keir Fraser, Ke, Liping,
	Gavin Maltby

xen-devel-bounces@lists.xensource.com <> wrote:
> On Tuesday 17 February 2009 07:41:29 Jiang, Yunhong wrote:
>> I think the major difference including: a) How to handle the #MC, i.e.
>> reset system, decide impacted components, take recover action like page
>> offline etc. b) How to handle error impact guest. As to other item like
>> log/telemetry, I think our implementation didn't have much different to
>> current implementation.
> 
> The hardware doesn't know what recover actions the software can do.
> If page A is faulty, and software maintains a copy in page B, then
> software can turn an uncorrectable error into an correctable one.
> If the hardware is aware of that copy (memory mirroring done by memory
> controller), then the hardware itself turns the uncorrectable error
> into an correctable one and reports an correctable error.
> 
> Therefore, I don't see why other flags than correctable and uncorrectable
> are needed at all.

Christoph, thanks for your reply.

I think recoverable means VMM/OS can take recover action like page offline, while unrecoverable means VMM/OS can't do anything and we have to reboot. The main reason we need these flag is, several step is required for MCA handling, for example, when multipel MCE happen to multiple CPU, firstly each CPU check it's own severity, seconldy we need check the most severity CPU and take action. For example, CPU A may get unrecoverable  while CPU B  get recoverable, they will check the information and the result, and the final solution will be unrecoverable .

> 
> 
> After some thinking on taking some quick actions, I can
> agree on it if it meets the condition below. Be aware, error analyzes
> is highly CPU vendor and even CPU family/model specific. Doing a
> complete analyzes as Solaris does blows Xen up a *lot*.

I didn't check Solaris code, so can Gavin or Frank gives us more information? At least currently it will not be large AFAIK, and if we do need model specific support (I don't know such requirement now, and I suppose it will not be common if exists, please correct me if wrong), dom0 can inform Xen for it.
 
> 
> Therefore, a *cheap* error analysis must be enough to figure out
> if recover actions like page-offlining or cpu offlining
> are *obviously* only the right thing to do.

Currently we only plan to support these two types, do you have plan for other recover action? And is that action be done better in Dom0 than in Xen?

Thanks
-- Yunhong Jiang

> 
> If this is not the case, then let Dom0 decide what to do.

> 
> Christoph
> 
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-19  9:13             ` Jiang, Yunhong
@ 2009-02-19 16:25               ` Christoph Egger
  2009-02-20  2:53                 ` Jiang, Yunhong
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Egger @ 2009-02-19 16:25 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: Frank.Vanderlinden@Sun.COM, xen-devel, Keir Fraser, Ke, Liping,
	Gavin Maltby

On Thursday 19 February 2009 10:13:18 Jiang, Yunhong wrote:
> xen-devel-bounces@lists.xensource.com <> wrote:
> > On Tuesday 17 February 2009 07:41:29 Jiang, Yunhong wrote:
> >> I think the major difference including: a) How to handle the #MC, i.e.
> >> reset system, decide impacted components, take recover action like page
> >> offline etc. b) How to handle error impact guest. As to other item like
> >> log/telemetry, I think our implementation didn't have much different to
> >> current implementation.
> >
> > The hardware doesn't know what recover actions the software can do.
> > If page A is faulty, and software maintains a copy in page B, then
> > software can turn an uncorrectable error into an correctable one.
> > If the hardware is aware of that copy (memory mirroring done by memory
> > controller), then the hardware itself turns the uncorrectable error
> > into an correctable one and reports an correctable error.
> >
> > Therefore, I don't see why other flags than correctable and uncorrectable
> > are needed at all.
>
> Christoph, thanks for your reply.
>
> I think recoverable means VMM/OS can take recover action like page offline,
> while unrecoverable means VMM/OS can't do anything and we have to reboot.

Ok, here is a different interpretation of what is correctable and 
uncorrectable.
Uncorrectable in your interpretation means neither hardware nor software can't
do anything.
Uncorrectable in my interpretation means the hardware can't correct it, but 
software may have more information and correct it.

> The main reason we need these flag is, several step is required for MCA
> handling, for example, when multiple MCE happen to multiple CPU, firstly
> each CPU check it's own severity, seconldy we need check the most severity
> CPU and take action. For example, CPU A may get unrecoverable  while CPU B 
> get recoverable, they will check the information and the result, and the
> final solution will be unrecoverable .

I brought up an example of a broken memory page for my argumentation,
you bring up a broken CPU for your argumentation.

We need to find a common denominator to compare.

If a CPU is completely broken and you are on UP, then game is over.
Not even a reboot can help.
On a SMP system, offline the CPU and inform Dom0.

> > After some thinking on taking some quick actions, I can
> > agree on it if it meets the condition below. Be aware, error analyzes
> > is highly CPU vendor and even CPU family/model specific. Doing a
> > complete analyzes as Solaris does blows Xen up a *lot*.
>
> I didn't check Solaris code, so can Gavin or Frank gives us more
> information? At least currently it will not be large AFAIK, and if we do
> need model specific support (I don't know such requirement now, and I
> suppose it will not be common if exists, please correct me if wrong), dom0
> can inform Xen for it.
>
> > Therefore, a *cheap* error analysis must be enough to figure out
> > if recover actions like page-offlining or cpu offlining
> > are *obviously* only the right thing to do.
>
> Currently we only plan to support these two types, do you have plan for
> other recover action? And is that action be done better in Dom0 than in
> Xen?

Yes!! Solaris maintains a list of broken pages which is even persistent
across reboot when the serial number of the DIMM didn't change.
For doing page offlining properly, SUN should design a hypercall allowing
the Dom0 to give Xen this list as early as possible at boot time.

Further, with our Shanghai CPU, we can disable certain parts of its L3 cache.
Instead of offlining that broken CPU completely, just disable the broken
part of it. The registers for this is in PCI config space. Since Xen delegates
PCI access to Dom0, Dom0 can do that.

Christoph

-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-19 16:25               ` Christoph Egger
@ 2009-02-20  2:53                 ` Jiang, Yunhong
  2009-02-20 21:01                   ` Frank van der Linden
  0 siblings, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-20  2:53 UTC (permalink / raw)
  To: Christoph Egger
  Cc: Frank.Vanderlinden@Sun.COM, xen-devel, Keir Fraser, Ke, Liping,
	Gavin Maltby

[-- Attachment #1: Type: text/plain, Size: 3934 bytes --]

Christoph Egger <mailto:Christoph.Egger@amd.com> wrote:
> Ok, here is a different interpretation of what is correctable and
> uncorrectable. Uncorrectable in your interpretation means neither hardware
> nor software can't
> do anything.
> Uncorrectable in my interpretation means the hardware can't
> correct it, but
> software may have more information and correct it.

Yes. Maybe "fatal" is more appropriate name here. 

> 
>> The main reason we need these flag is, several step is required for MCA
>> handling, for example, when multiple MCE happen to multiple CPU, firstly
>> each CPU check it's own severity, seconldy we need check the most severity
>> CPU and take action. For example, CPU A may get unrecoverable  while CPU B
>> get recoverable, they will check the information and the result, and the
>> final solution will be unrecoverable .
> 
> I brought up an example of a broken memory page for my argumentation,
> you bring up a broken CPU for your argumentation.
> 
> We need to find a common denominator to compare.
> 
> If a CPU is completely broken and you are on UP, then game is over. Not
> even a reboot can help. On a SMP system, offline the CPU and inform Dom0.

Sorry I didn't get relationship between the flags and comparing the two example :$

>> Currently we only plan to support these two types, do you have plan for
>> other recover action? And is that action be done better in Dom0 than in
>> Xen?
> 
> Yes!! Solaris maintains a list of broken pages which is even persistent
> across reboot when the serial number of the DIMM didn't change.
> For doing page offlining properly, SUN should design a
> hypercall allowing
> the Dom0 to give Xen this list as early as possible at boot time.

We have a patch to support  page offline (sent as RFC to mailing list), and it already export a hypercall for Dom0 to ask Xen to offline pages (this is for proactive action to CE errors from Dom0), also, as Frank suggested, we will add a hypercall for Dom0 to get page's offline status, so it should be OK.

> Further, with our Shanghai CPU, we can disable certain parts
> of its L3 cache.
> Instead of offlining that broken CPU completely, just disable
> the broken
> part of it. The registers for this is in PCI config space.
> Since Xen delegates
> PCI access to Dom0, Dom0 can do that.

Sorry that I have no idea of Shanghai, but I'm a bit suprised that when error happens to cache, we will transfer control to Dom0  and wait for Dom0's MCA handler to take action to disable the cache, it is really a loooong code path. Per my understanding, if there are issue in cache, we should clear/disable the cache ASAP to avoid more server result, and it is a extreme example to let Xen handle the MCA. Or maybe I missed something important in this feature?

BTW, I want to clarify that this patch is for #MC handling (i.e. the "uncorrected" error in your mind). For hardware correctable error (i.e. "correctable") , Xen will do nothing, but just pass it to Dom0 as vIRQ as our previous patch (http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00970.html ) shown, because CE will not impact system. So if the "cache index disable" is to disable part of cache after too many CE (Correctable Error) as proactive action, I think we are on the same page.

I attached two foil that are part of our Xen summit presentation. Page 1 is mainly for #MC handling, page2 is for CE handling (though CMCI or polling). The page 1 is described in the patch clearly. Page 2 is what our previous patch did .

Thanks
-- Yunhong Jiang

> 
> Christoph
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632

[-- Attachment #2: MCA.pdf --]
[-- Type: application/pdf, Size: 86134 bytes --]

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-20  2:53                 ` Jiang, Yunhong
@ 2009-02-20 21:01                   ` Frank van der Linden
  2009-02-23  9:01                     ` Jiang, Yunhong
  0 siblings, 1 reply; 45+ messages in thread
From: Frank van der Linden @ 2009-02-20 21:01 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: Gavin Maltby, Christoph Egger, xen-devel, Keir Fraser, Ke, Liping

I had some time to look over the patches in more detail and the previous 
discussions that were referenced.

 From your patches, what you write, and your slides, I gather the following:

* Corrected errors (found through polling and CMCI):
   1) Collected error data (telemetry)
   2) Inform dom0 through the VIRQ.

* Uncorrected errors:
   1) See if any immediate action can be taken (CPU offline,
      page retire)
   2) Collect telemetry
   3) Deliver vMCE to dom0 (and possibly domU)


I think it's fine that the hypervisor takes some immediate action in 
some cases. It is good to do this as quickly as possible, and only the 
hypervisor has all the information immediately available.

What would be needed for the Solaris framework, however, is to provide 
information on what action was taken, along with the telemetry. As 
Christoph noted, the Solaris FMA code checks, at bootup, if there were 
components that previously had errors, and if so, it disables them again 
to prevent further errors. To be able to do this, it needs the full 
information not just on the error data, but also on any action taken by 
the hypervisor, so that it can repeat this action. It may take some 
modifications in the FMA code to account for the case where an action 
has already been taken (to avoid trying to take conflicting action), but 
I think that shouldn't be a big problem. Although I don't know that part 
of our code very well.

The part that I still have doubts about, is the vMCE code. As far as I 
can tell, it takes the information out of the MCA banks, and stores it, 
per event, in a linked list. Per vMCE, the head of the list is taken and 
used as an MSR context. The rdmsr instruction is trapped and redirected 
to that information. It seems that the wrmsr instruction is accepted, 
but has no effect (except that if the trap handler writes a value and 
then reads it back again immediately, the values will be the same).

The main argument for the vMCE code seems to be that it allows existing 
MCA handlers to be reused. However, I don't see the advantage in this. 
Basically, it allows the handler to retrieve the MCA banks through plain 
rdmsr instructions. Which is fine, but that's as far as it goes. Without 
any additional information, that feature does not seem useful. wrmsr 
instructions has no effect.

To take further action, the MCA code in dom0 (or a domU) needs to know 
that it is running under Xen, and it needs to have detailed physical 
information on the system. In other words, the existing code that can be 
used is only the code that gathers some information. So, the only thing 
that vMCE is good for, is that you can run unmodified error logging 
code. But you can't interpret any of the error information further 
without knowing more. Especially for a domU, which might not know 
anything, this doesn't seem useful. What would the user of a domU do 
with that information?

To recap, I think the part where Xen itself takes action is fine, with 
some modifications. But I don't see any advantages in vMCE delivery, 
unless I'm missing something of course..

- Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-20 21:01                   ` Frank van der Linden
@ 2009-02-23  9:01                     ` Jiang, Yunhong
  2009-02-24 18:53                       ` Frank van der Linden
  0 siblings, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-23  9:01 UTC (permalink / raw)
  To: Frank.Vanderlinden
  Cc: Christoph Egger, xen-devel, Ke, Liping, Gavin Maltby,
	Keir Fraser, Kleen, Andi

Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:
> I had some time to look over the patches in more detail and
> the previous
> discussions that were referenced.
> 
> From your patches, what you write, and your slides, I gather
> the following:
> 
> * Corrected errors (found through polling and CMCI):
>   1) Collected error data (telemetry)
>   2) Inform dom0 through the VIRQ.
> 
> * Uncorrected errors:
>   1) See if any immediate action can be taken (CPU offline,      page
>   retire) 2) Collect telemetry
>   3) Deliver vMCE to dom0 (and possibly domU)

One notice is, we delieve vMCE to dom0/domU only when it is impacted. The idea behind this is, MCE is handled by Xen HV totally, while guest's vMCE handler will only works for itself. For example, when a page broken, Xen will firstly mark the page offline in Xen side (i.e. take the recover action), then, it will inject a vMCE to guest corresponding (dom0 or domU), the guest will kill the application using the page, free the page, or do more action.

And we always pass the vIRQ to dom0 for logging and telemetry, user space tools can take more proactive action for this if needed.

> 
> 
> I think it's fine that the hypervisor takes some immediate action in
> some cases. It is good to do this as quickly as possible, and only the
> hypervisor has all the information immediately available.
> 
> What would be needed for the Solaris framework, however, is to provide
> information on what action was taken, along with the telemetry. As

Agree that this modification is needed. Sorry we didn't reliaze the requirement from Dom0 after reboot.

Either we can pass the action in the telemetry, or Dom0 can take action specific method ,like retrieve the offlined page from Xen before reboot. If we take the former, we may need a interface definition.

> Christoph noted, the Solaris FMA code checks, at bootup, if there were
> components that previously had errors, and if so, it disables
> them again
> to prevent further errors. To be able to do this, it needs the full
> information not just on the error data, but also on any action
> taken by
> the hypervisor, so that it can repeat this action. It may take some
> modifications in the FMA code to account for the case where an action
> has already been taken (to avoid trying to take conflicting
> action), but
> I think that shouldn't be a big problem. Although I don't know
> that part
> of our code very well.
> 
> The part that I still have doubts about, is the vMCE code. As far as I
> can tell, it takes the information out of the MCA banks, and
> stores it,
> per event, in a linked list. Per vMCE, the head of the list is
> taken and
> used as an MSR context. The rdmsr instruction is trapped and redirected
> to that information. It seems that the wrmsr instruction is accepted,
> but has no effect (except that if the trap handler writes a value and
> then reads it back again immediately, the values will be the same).
> The main argument for the vMCE code seems to be that it allows existing
> MCA handlers to be reused. However, I don't see the advantage in this.
> Basically, it allows the handler to retrieve the MCA banks
> through plain
> rdmsr instructions. Which is fine, but that's as far as it
> goes. Without
> any additional information, that feature does not seem useful. wrmsr
> instructions has no effect. 

What do you mean of the effect of wrmsr instruction. We need considering inject #GP if invalid wrmsr , or remove the event when guest clear the MCi_STATUS_MCA if needed. We send this RFC early to get feedback firstly for the design idea. 
Or you mean more than this for the wrmsr?

> 
> To take further action, the MCA code in dom0 (or a domU) needs to know
> that it is running under Xen, and it needs to have detailed physical

Our purpose is guest has no idea it is running under xen as descripted above. And what information do you think a normal guest's MCA handler needs to know, and use the detailed physical information? After all, a guest cares only itself. Also, maybe we can't provide PV handler for all guest (like windows).

Dom0 is a special case, it's vIRQ handler knows it is running under Xen, but that is for log/telemetry and for proactive action. 

> information on the system. In other words, the existing code
> that can be

What do you mean of "existing", our patch or current Xen implementation?

> used is only the code that gathers some information. So, the
> only thing
> that vMCE is good for, is that you can run unmodified error logging
> code. But you can't interpret any of the error information further
> without knowing more. Especially for a domU, which might not know
> anything, this doesn't seem useful. What would the user of a domU do with
> that information? 
> To recap, I think the part where Xen itself takes action is fine, with
> some modifications. But I don't see any advantages in vMCE delivery,
> unless I'm missing something of course..

I think the main advantage are:
a) We don't need maintain a PV MCA handler for guest, especially for HVM guest
b) We can get benifit from guest's MCA improvement/enhancement .
c) Applying this to dom0, we don't need different mechanism to dom0/hvm.

Thanks
Yunhong Jiang

> 
> - Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-23  9:01                     ` Jiang, Yunhong
@ 2009-02-24 18:53                       ` Frank van der Linden
       [not found]                         ` <2E9E6F5F5978EF44A8590E339E888CF988279945@irsmsx503.ger.corp.intel.com>
  0 siblings, 1 reply; 45+ messages in thread
From: Frank van der Linden @ 2009-02-24 18:53 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: Christoph Egger, xen-devel, Ke, Liping, Gavin Maltby,
	Keir Fraser, Kleen, Andi

Thanks for your reply. Let me explain my comments a little:

Jiang, Yunhong wrote:
> 
> One notice is, we delieve vMCE to dom0/domU only when it is impacted. The idea behind this is, MCE is handled by Xen HV totally, while guest's vMCE handler will only works for itself. For example, when a page broken, Xen will firstly mark the page offline in Xen side (i.e. take the recover action), then, it will inject a vMCE to guest corresponding (dom0 or domU), the guest will kill the application using the page, free the page, or do more action.
> 
> And we always pass the vIRQ to dom0 for logging and telemetry, user space tools can take more proactive action for this if needed.

I understand this part, and have no problems with them mechanism itself. 
I think it has advantages over the original concept, where dom0 informs 
domUs. My question is: what useful action can a domU take without fully 
knowing the physical system? I'll go more in to that below.

>> What would be needed for the Solaris framework, however, is to provide
>> information on what action was taken, along with the telemetry. As
> 
> Agree that this modification is needed. Sorry we didn't reliaze the requirement from Dom0 after reboot.
> 
> Either we can pass the action in the telemetry, or Dom0 can take action specific method ,like retrieve the offlined page from Xen before reboot. If we take the former, we may need a interface definition.

Passing the action along with the telemetry seems the best way to go to 
me. Since the telemetry is used to determine which action to take, any 
information on actions already taken should come at the same time.

\
> 
> What do you mean of the effect of wrmsr instruction. We need considering inject #GP if invalid wrmsr , or remove the event when guest clear the MCi_STATUS_MCA if needed. We send this RFC early to get feedback firstly for the design idea. 
> Or you mean more than this for the wrmsr?
> 
>> To take further action, the MCA code in dom0 (or a domU) needs to know
>> that it is running under Xen, and it needs to have detailed physical
> 
> Our purpose is guest has no idea it is running under xen as descripted above. And what information do you think a normal guest's MCA handler needs to know, and use the detailed physical information? After all, a guest cares only itself. Also, maybe we can't provide PV handler for all guest (like windows).
> 
> Dom0 is a special case, it's vIRQ handler knows it is running under Xen, but that is for log/telemetry and for proactive action. 
> 
>> information on the system. In other words, the existing code
>> that can be
> 
> What do you mean of "existing", our patch or current Xen implementation?
> 
>> used is only the code that gathers some information. So, the
>> only thing
>> that vMCE is good for, is that you can run unmodified error logging
>> code. But you can't interpret any of the error information further
>> without knowing more. Especially for a domU, which might not know
>> anything, this doesn't seem useful. What would the user of a domU do with
>> that information? 
>> To recap, I think the part where Xen itself takes action is fine, with
>> some modifications. But I don't see any advantages in vMCE delivery,
>> unless I'm missing something of course..
> 
> I think the main advantage are:
> a) We don't need maintain a PV MCA handler for guest, especially for HVM guest
> b) We can get benifit from guest's MCA improvement/enhancement .
> c) Applying this to dom0, we don't need different mechanism to dom0/hvm.

Ok, my main issue here is: if you want to enable a guest to run 
unmodified MCA code (which you state as a goal, and as an advantage of 
the vMCE approach), then what can the guest actually do. Or the dom0, 
for that matter?

MCA information is highly specific to the hardware. Without additional 
information on the hardware, it is hard, or even impossible, for the 
unmodified MCA handler in dom0 or a domU to do anything useful. It will 
interpret the information to fit the virtualized environment it is in, 
which doesn't match the reality of the hardware at all. So what can it 
do? It can just read the MSRs and log the information, but even that 
information wouldn't be useful; it is already available to dom0, where 
the code and/or person who can make sense of the data will see it. The 
unmodified MCA handler also can't take any corrective action; it might 
think that it is taking action, but in fact, its wrmsr instructions have 
no effect (and they shouldn't, guests should definitely not be able to 
do MSR writes).

I only see one possible exception to this: if you translate the ADDR MSR 
of a bank to a guest address in the vmca info before delivering the 
vMCE, then the guest could do something useful, because its virtualized 
MSR reads would then produce a guest address, and it could do something 
useful with it. But currently, your code doesn't seem to do this; the 
virtualized MSR will produce the machine address, which the guest can't 
do anything with, unless it knows its running under Xen.

So that's my main problem here: there is a contradiction. The vMCE 
mechanism as you implement it enables guests to run an unmodified MCA 
handler, but there isn't actually much that the guest can do with that, 
without knowing it runs under Xen. I see only one specific use for this: 
if you translate the ADDR info to a guest address, it could potentially 
try to do a "local" page retire.

- Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
       [not found]                         ` <2E9E6F5F5978EF44A8590E339E888CF988279945@irsmsx503.ger.corp.intel.com>
@ 2009-02-24 19:07                           ` Frank van der Linden
  2009-02-25  2:26                             ` Jiang, Yunhong
                                               ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Frank van der Linden @ 2009-02-24 19:07 UTC (permalink / raw)
  To: Kleen, Andi
  Cc: Christoph Egger, xen-devel, Jiang, Yunhong, Ke, Liping,
	Gavin Maltby, Keir Fraser

Kleen, Andi wrote:
>> MCA information is highly specific to the hardware.
> 
> Actually Intel has architectural machine checks and except for
> some optional addon information explicitely marked it's all architectural
> (as in defined to stay the same going forward)

True, I probably expressed myself poorly here. I meant to say: it's a 
physical hardware error, and in an unmodified virtualized environment 
the information about the physical hardware isn't there.

> For DomU translation of the address is needed, that's correct.
> For Dom0 logging physical is good because the logging tools
> might need that.

Right. As far as I understand it, this patch proposes to deliver the 
actual physical information to dom0 via the existing vIRQ mechanism, 
while the vMCE mechanism delivers virtualized info to any guest (both 
dom0 and domU).

- Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
       [not found]                             ` <2E9E6F5F5978EF44A8590E339E888CF98827996D@irsmsx503.ger.corp.intel.com>
@ 2009-02-24 20:47                               ` Frank van der Linden
  2009-02-25  2:25                                 ` Jiang, Yunhong
  2009-02-25  2:31                               ` Jiang, Yunhong
  2009-02-25 10:57                               ` Christoph Egger
  2 siblings, 1 reply; 45+ messages in thread
From: Frank van der Linden @ 2009-02-24 20:47 UTC (permalink / raw)
  To: Kleen, Andi
  Cc: Christoph Egger, xen-devel, Jiang, Yunhong, Ke, Liping,
	Gavin Maltby, Keir Fraser

Kleen, Andi wrote:
>> Kleen, Andi wrote:
> 
> So it's generally better to inject generic events, not just blindly forward.
> 

Agreed. I can see advantages to the vMCE code, but it has to deliver 
something to the domU that makes it do something reasonable. That's why 
I have some doubts about the patch that was sent, it doesn't quite seem 
to achieve that (certainly not without translating the address).

- Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-24 20:47                               ` Frank van der Linden
@ 2009-02-25  2:25                                 ` Jiang, Yunhong
  2009-02-25 12:19                                   ` Christoph Egger
  0 siblings, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-25  2:25 UTC (permalink / raw)
  To: Frank.Vanderlinden, Kleen, Andi
  Cc: Gavin Maltby, Christoph Egger, xen-devel, Keir Fraser, Ke, Liping

Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:
> Kleen, Andi wrote:
>>> Kleen, Andi wrote:
>> 
>> So it's generally better to inject generic events, not just blindly
>> forward. 
>> 
> 
> Agreed. I can see advantages to the vMCE code, but it has to deliver
> something to the domU that makes it do something reasonable.
> That's why
> I have some doubts about the patch that was sent, it doesn't
> quite seem
> to achieve that (certainly not without translating the address).
> 
> - Frank

Yes, we should have include the translation. We didn't do that when sending out the patch because we thought the PV guest has idea of m2p translation. Later we realized the translation is needed for PV guest after more consideration, since the unmodified #MC handler will use guest address. Of course we always need the translation for HVM guest, which however is not in that patch's scope . Sorry for any confusion caused.

One thing need notice is, the information passed through vIRQ is physical information while dom0s' MCA handler will get guest information, so user space tools should be aware of such constraints.

So, Frank/Egger, can I assume followed are consensus currently?

1) MCE is handled by Xen HV totally, while guest's vMCE handler will only works for itself. 
2) Xen present a virtual #MC to guest through MSR access emulation.(Xen will do the translation if needed).
3) Guest's unmodified MCE handler will handle the vMCE injected. 
4) Dom0 will get all log/telemetry through hypercall. 
5) The action taken by xen will be passed to dom0 through the telemetry mechanism.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-24 19:07                           ` Frank van der Linden
@ 2009-02-25  2:26                             ` Jiang, Yunhong
  2009-02-25 10:37                             ` Christoph Egger
       [not found]                             ` <2E9E6F5F5978EF44A8590E339E888CF98827996D@irsmsx503.ger.corp.intel.com>
  2 siblings, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-25  2:26 UTC (permalink / raw)
  To: Frank van der Linden, Kleen, Andi
  Cc: Christoph Egger, xen-devel, Gavin, Ke, Liping, Maltby, Keir Fraser

> Right. As far as I understand it, this patch proposes to deliver the
> actual physical information to dom0 via the existing vIRQ mechanism,
> while the vMCE mechanism delivers virtualized info to any guest (both dom0
> and domU). 

Yes, excactly.

> 
> - Frank
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
       [not found]                             ` <2E9E6F5F5978EF44A8590E339E888CF98827996D@irsmsx503.ger.corp.intel.com>
  2009-02-24 20:47                               ` Frank van der Linden
@ 2009-02-25  2:31                               ` Jiang, Yunhong
  2009-02-25 10:57                               ` Christoph Egger
  2 siblings, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-25  2:31 UTC (permalink / raw)
  To: Kleen, Andi, Frank.Vanderlinden
  Cc: Gavin Maltby, Christoph Egger, xen-devel, Keir Fraser, Ke, Liping

> That's needed anyways for example to support migration between different
> types of CPUs. The DomU really cannot take a specific CPU type
> for granted or rather has to assume some fallback CPU. Also
> for virtualization
> it's a common case that guests run very old OS, so it's better to give
> them the oldest possible events too.
> 
> So it's generally better to inject generic events, not just
> blindly forward.

Andi, what's the meaning of "generic event"? Do you mean the option 3, i.e. some abstract event like page offlie, killing current execution event?
Or you mean translate physical MSR value to guest-aware MSR value?

Thanks
Yunhong Jiang

> 
> Only for Dom0 which does logging the physical hardware needs
> to be described
> correctly.
> 
> -Andi

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-24 19:07                           ` Frank van der Linden
  2009-02-25  2:26                             ` Jiang, Yunhong
@ 2009-02-25 10:37                             ` Christoph Egger
       [not found]                             ` <2E9E6F5F5978EF44A8590E339E888CF98827996D@irsmsx503.ger.corp.intel.com>
  2 siblings, 0 replies; 45+ messages in thread
From: Christoph Egger @ 2009-02-25 10:37 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: xen-devel, Jiang, Yunhong, Ke, Liping, Gavin Maltby, Keir Fraser,
	Kleen, Andi

On Tuesday 24 February 2009 20:07:16 Frank van der Linden wrote:
> Kleen, Andi wrote:
> >> MCA information is highly specific to the hardware.
> >
> > Actually Intel has architectural machine checks and except for
> > some optional addon information explicitely marked it's all architectural
> > (as in defined to stay the same going forward)
>
> True, I probably expressed myself poorly here. I meant to say: it's a
> physical hardware error, and in an unmodified virtualized environment
> the information about the physical hardware isn't there.
>
> > For DomU translation of the address is needed, that's correct.
> > For Dom0 logging physical is good because the logging tools
> > might need that.
>
> Right. As far as I understand it, this patch proposes to deliver the
> actual physical information to dom0 via the existing vIRQ mechanism,
> while the vMCE mechanism delivers virtualized info to any guest (both
> dom0 and domU).

The translation is still problematic: What if an error occured which impacts
multiple physical contigous pages ? Translated into guest-physical
address space, they may be non-contigous.

That's why the original design does not support HVM guests unless they
are aware about running in Xen via an PV machine check driver.

Christoph

-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
       [not found]                             ` <2E9E6F5F5978EF44A8590E339E888CF98827996D@irsmsx503.ger.corp.intel.com>
  2009-02-24 20:47                               ` Frank van der Linden
  2009-02-25  2:31                               ` Jiang, Yunhong
@ 2009-02-25 10:57                               ` Christoph Egger
  2 siblings, 0 replies; 45+ messages in thread
From: Christoph Egger @ 2009-02-25 10:57 UTC (permalink / raw)
  To: Kleen, Andi
  Cc: xen-devel, Frank.Vanderlinden@Sun.COM, Jiang, Yunhong, Ke,
	Liping, Gavin Maltby, Keir Fraser

On Tuesday 24 February 2009 21:33:47 Kleen, Andi wrote:
> >Kleen, Andi wrote:
> >>> MCA information is highly specific to the hardware.
> >>
> >> Actually Intel has architectural machine checks and except for
> >> some optional addon information explicitely marked it's all
> >
> >architectural
> >
> >> (as in defined to stay the same going forward)
> >
> >True, I probably expressed myself poorly here. I meant to say: it's a
> >physical hardware error, and in an unmodified virtualized environment
> >the information about the physical hardware isn't there.
>
> In a DomU it's not important that the physical hardware is correctly
> described, the only thing that matters is that the event triggers
> the DomU code to do the expected action.

I agree with that. The DomU see's a hw environment which may
(partially) match the physical hardware. The physical machine check error
must be translated in a way that fits into the guest's hw environment.
This is not just limited to the memory layout.

An example to clarify the point
(which actually won't apply directly to Xen, but you should get the idea):

The guest hw environment is an (emulated) sparc CPU, memory
and PCI devices. The host's hw environment is a x86 PC. Now an
machine check error occurs. If you want to forward it into the guest,
you must translate it in a way as the guest OS would expect it from
a native sparc machine.


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-25  2:25                                 ` Jiang, Yunhong
@ 2009-02-25 12:19                                   ` Christoph Egger
  2009-02-25 17:32                                     ` Frank van der Linden
  2009-02-25 22:30                                     ` Gavin Maltby
  0 siblings, 2 replies; 45+ messages in thread
From: Christoph Egger @ 2009-02-25 12:19 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: xen-devel, Gavin Maltby, Ke, Liping, Frank.Vanderlinden@Sun.COM,
	Keir Fraser, Kleen, Andi

On Wednesday 25 February 2009 03:25:12 Jiang, Yunhong wrote:
> Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:
> > Kleen, Andi wrote:
> >>> Kleen, Andi wrote:
> >>
> >> So it's generally better to inject generic events, not just blindly
> >> forward.
> >
> > Agreed. I can see advantages to the vMCE code, but it has to deliver
> > something to the domU that makes it do something reasonable.
> > That's why
> > I have some doubts about the patch that was sent, it doesn't
> > quite seem
> > to achieve that (certainly not without translating the address).
> >
> > - Frank
>
> Yes, we should have include the translation. We didn't do that when sending
> out the patch because we thought the PV guest has idea of m2p translation.
> Later we realized the translation is needed for PV guest after more
> consideration, since the unmodified #MC handler will use guest address. Of
> course we always need the translation for HVM guest, which however is not
> in that patch's scope . Sorry for any confusion caused.
>
> One thing need notice is, the information passed through vIRQ is physical
> information while dom0s' MCA handler will get guest information, so user
> space tools should be aware of such constraints.
>
> So, Frank/Egger, can I assume followed are consensus currently?
>
> 1) MCE is handled by Xen HV totally, while guest's vMCE handler will only
> works for itself.
> 2) Xen present a virtual #MC to guest through MSR access  
> emulation.(Xen will do the translation if needed).
> 3) Guest's unmodified 
> MCE handler will handle the vMCE injected.
> 4) Dom0 will get all log/telemetry through hypercall.
> 5) The action taken by xen will be passed to dom0 through the telemetry
> mechanism.

Mostly. Regarding 2) I want like to discuss first how to handle errors
impacting multiple contiguous physical pages which are non-contigous
in guest physical space.

And I also want to discuss about how to do recovery actions requiring
PCI access. One example for this is
Shanghai's "L3 Cache Index Disable"-Feature.
Xen delegates PCI config space to Dom0 and
via PCI passthrough partly to DomU.
That means, if registers in PCI config space are independently
accessable by Xen, Dom0 and/or DomU, they can interfere with each other.
Therefore, we need to
a) clearly define who handles what and
b) define some rules based on a)
c) discuss how to handle Dom0/DomU going wild
    and break the rules defined in b)


Christoph


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-25 12:19                                   ` Christoph Egger
@ 2009-02-25 17:32                                     ` Frank van der Linden
  2009-02-26  2:16                                       ` Jiang, Yunhong
  2009-03-02  5:51                                       ` Jiang, Yunhong
  2009-02-25 22:30                                     ` Gavin Maltby
  1 sibling, 2 replies; 45+ messages in thread
From: Frank van der Linden @ 2009-02-25 17:32 UTC (permalink / raw)
  To: Christoph Egger
  Cc: xen-devel, Jiang, Yunhong, Ke, Liping, Gavin Maltby, Keir Fraser,
	Kleen, Andi

Christoph Egger wrote:
> On Wednesday 25 February 2009 03:25:12 Jiang, Yunhong wrote:
>
>> So, Frank/Egger, can I assume followed are consensus currently?
>>
>> 1) MCE is handled by Xen HV totally, while guest's vMCE handler will only
>> works for itself.
>> 2) Xen present a virtual #MC to guest through MSR access  
>> emulation.(Xen will do the translation if needed).
>> 3) Guest's unmodified 
>> MCE handler will handle the vMCE injected.
>> 4) Dom0 will get all log/telemetry through hypercall.
>> 5) The action taken by xen will be passed to dom0 through the telemetry
>> mechanism.
> 
> Mostly. Regarding 2) I want like to discuss first how to handle errors
> impacting multiple contiguous physical pages which are non-contigous
> in guest physical space.
> 
> And I also want to discuss about how to do recovery actions requiring
> PCI access. One example for this is
> Shanghai's "L3 Cache Index Disable"-Feature.
> Xen delegates PCI config space to Dom0 and
> via PCI passthrough partly to DomU.
> That means, if registers in PCI config space are independently
> accessable by Xen, Dom0 and/or DomU, they can interfere with each other.
> Therefore, we need to
> a) clearly define who handles what and
> b) define some rules based on a)
> c) discuss how to handle Dom0/DomU going wild
>     and break the rules defined in b)

I also agree on the approach in principle, but would like to see these 
points addressed. For non-contiguous pages, I suppose Xen could deliver 
multiple #vMCEs to the guest, split into contiguous parts. The vmce code 
seems to be set up to be able to do this.

As for the Shanghai feature: Christoph, are there any documents 
available on that feature? What kind of errors are delivered 
(corrected/correctable)?

- Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-25 12:19                                   ` Christoph Egger
  2009-02-25 17:32                                     ` Frank van der Linden
@ 2009-02-25 22:30                                     ` Gavin Maltby
  1 sibling, 0 replies; 45+ messages in thread
From: Gavin Maltby @ 2009-02-25 22:30 UTC (permalink / raw)
  To: Christoph Egger
  Cc: xen-devel, Jiang, Yunhong, Ke, Liping, Frank.Vanderlinden,
	Keir Fraser, Kleen, Andi

Christoph Egger wrote:

> Mostly. Regarding 2) I want like to discuss first how to handle errors
> impacting multiple contiguous physical pages which are non-contigous
> in guest physical space.

I can't think of any such error types.  ECC checkwords don't span
page boundaries, so you only ever get an error at a time
affecting one small part of one page.  That physically adjacent
pages have both had errors would come our in the wash, but
they'd be processed and recognised individually.

Gavin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-25 17:32                                     ` Frank van der Linden
@ 2009-02-26  2:16                                       ` Jiang, Yunhong
  2009-03-02 14:58                                         ` Christoph Egger
  2009-03-02  5:51                                       ` Jiang, Yunhong
  1 sibling, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-02-26  2:16 UTC (permalink / raw)
  To: Frank.Vanderlinden, Christoph Egger
  Cc: Kleen, Andi, Gavin Maltby, xen-devel, Keir Fraser, Ke, Liping

[-- Attachment #1: Type: text/plain, Size: 3671 bytes --]

Christopher/Egger, thanks for reply very much, see comments below.

>-----Original Message-----
>From: Frank.Vanderlinden@Sun.COM [mailto:Frank.Vanderlinden@Sun.COM] 
>Sent: 2009年2月26日 1:33
>To: Christoph Egger
>Cc: Jiang, Yunhong; Kleen, Andi; 
>xen-devel@lists.xensource.com; Keir Fraser; Ke, Liping; Gavin Maltby
>Subject: Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
>
>Christoph Egger wrote:
>> On Wednesday 25 February 2009 03:25:12 Jiang, Yunhong wrote:
>>
>>> So, Frank/Egger, can I assume followed are consensus currently?
>>>
>>> 1) MCE is handled by Xen HV totally, while guest's vMCE 
>handler will only
>>> works for itself.
>>> 2) Xen present a virtual #MC to guest through MSR access  
>>> emulation.(Xen will do the translation if needed).
>>> 3) Guest's unmodified 
>>> MCE handler will handle the vMCE injected.
>>> 4) Dom0 will get all log/telemetry through hypercall.
>>> 5) The action taken by xen will be passed to dom0 through 
>the telemetry
>>> mechanism.
>> 
>> Mostly. Regarding 2) I want like to discuss first how to 
>handle errors
>> impacting multiple contiguous physical pages which are non-contigous
>> in guest physical space.


>> 
>> And I also want to discuss about how to do recovery actions requiring
>> PCI access. One example for this is
>> Shanghai's "L3 Cache Index Disable"-Feature.
>> Xen delegates PCI config space to Dom0 and
>> via PCI passthrough partly to DomU.
>> That means, if registers in PCI config space are independently
>> accessable by Xen, Dom0 and/or DomU, they can interfere with 
>each other.
>> Therefore, we need to
>> a) clearly define who handles what and
>> b) define some rules based on a)
>> c) discuss how to handle Dom0/DomU going wild
>>     and break the rules defined in b)
>
>I also agree on the approach in principle, but would like to see these 
>points addressed. For non-contiguous pages, I suppose Xen 
>could deliver 
>multiple #vMCEs to the guest, split into contiguous parts. The 
>vmce code 
>seems to be set up to be able to do this.

For the contigous pages, I agree with Gavin that such contiguous page error should be triggered as multiple #MC and so is ok.

For PCI config space issue, Christoph, can you please share more information on it (or provide some document as Frank suggested), like is it for CE (Correctable error or UC(UnCorrectable error), is it in PCI range or PCI-E range (i.e. through 0xCF8/CFC or through MMCONFIG), how the device's BDF caculated etc. Followed is some of my understanding.

Firstly, if it is CE, Xen will do nothing and dom0 will take recovery action. If it is UC, Xen will take action when all CPU is in SoftIRQ context, and dom0 will not take action, so it should be ok. 

Secondly, in Xen environment, per my understanding, CPU is owned by Xen HV, so I'm not sure when dom0 disable L3 cache (if it is CE), should Xen be aware or not. That is, should dom0 disable the cache directly, or it should user hypercall to ask Xen do that. Keir can give us more suggestion.

For item C, currently Xen/dom0 can both access configuration space, while domU will do that through PCI_frontend/backend. Because PCI backend only cover device assigned to domU, so we don't need worry about domU and dom0 should be trusted. However, one thing left is, if this range is beyond 0x100 (i.e. in pci-e range), we need add mmconfig support in Xen, although it can be added simply.

Thanks
-- Yunhong Jiang

>
>As for the Shanghai feature: Christoph, are there any documents 
>available on that feature? What kind of errors are delivered 
>(corrected/correctable)?
>
>- Frank
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-25 17:32                                     ` Frank van der Linden
  2009-02-26  2:16                                       ` Jiang, Yunhong
@ 2009-03-02  5:51                                       ` Jiang, Yunhong
  2009-03-02 14:51                                         ` Christoph Egger
  2009-03-02 17:47                                         ` Frank van der Linden
  1 sibling, 2 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-03-02  5:51 UTC (permalink / raw)
  To: Jiang, Yunhong, Frank.Vanderlinden, Christoph Egger
  Cc: Kleen, Andi, Gavin Maltby, xen-devel, Keir Fraser, Ke, Liping

[-- Attachment #1: Type: text/plain, Size: 4020 bytes --]

Frank/Christopher, can you please give more comments for it, or you are OK with this?
For the action reporting mechanism, we will send out a proposal for review soon.

Thanks
Yunhong Jiang

Jiang, Yunhong <> wrote:
> Christopher/Frank, thanks for reply very much, see comments below.
> 
>> -----Original Message-----
>> From: Frank.Vanderlinden@Sun.COM [mailto:Frank.Vanderlinden@Sun.COM] Sent:
>> 2009年2月26日 1:33 To: Christoph Egger
>> Cc: Jiang, Yunhong; Kleen, Andi;
>> xen-devel@lists.xensource.com; Keir Fraser; Ke, Liping; Gavin Maltby
>> Subject: Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
>> 
>> Christoph Egger wrote:
>>> On Wednesday 25 February 2009 03:25:12 Jiang, Yunhong wrote:
>>> 
>>>> So, Frank/Egger, can I assume followed are consensus currently?
>>>> 
>>>> 1) MCE is handled by Xen HV totally, while guest's vMCE handler will
>>>> only works for itself. 2) Xen present a virtual #MC to guest through MSR
>>>> access emulation.(Xen will do the translation if needed).
>>>> 3) Guest's unmodified
>>>> MCE handler will handle the vMCE injected.
>>>> 4) Dom0 will get all log/telemetry through hypercall.
>>>> 5) The action taken by xen will be passed to dom0 through the telemetry
>>>> mechanism.
>>> 
>>> Mostly. Regarding 2) I want like to discuss first how to handle errors
>>> impacting multiple contiguous physical pages which are non-contigous
>>> in guest physical space.
> 
> 
>>> 
>>> And I also want to discuss about how to do recovery actions requiring
>>> PCI access. One example for this is
>>> Shanghai's "L3 Cache Index Disable"-Feature.
>>> Xen delegates PCI config space to Dom0 and
>>> via PCI passthrough partly to DomU.
>>> That means, if registers in PCI config space are independently
>>> accessable by Xen, Dom0 and/or DomU, they can interfere with each other.
>>> Therefore, we need to a) clearly define who handles what and
>>> b) define some rules based on a)
>>> c) discuss how to handle Dom0/DomU going wild
>>>     and break the rules defined in b)
>> 
>> I also agree on the approach in principle, but would like to see these
>> points addressed. For non-contiguous pages, I suppose Xen
>> could deliver
>> multiple #vMCEs to the guest, split into contiguous parts. The
>> vmce code
>> seems to be set up to be able to do this.
> 
> For the contigous pages, I agree with Gavin that such
> contiguous page error should be triggered as multiple #MC and so is ok.
> 
> For PCI config space issue, Christoph, can you please share
> more information on it (or provide some document as Frank
> suggested), like is it for CE (Correctable error or
> UC(UnCorrectable error), is it in PCI range or PCI-E range
> (i.e. through 0xCF8/CFC or through MMCONFIG), how the device's
> BDF caculated etc. Followed is some of my understanding.
> 
> Firstly, if it is CE, Xen will do nothing and dom0 will take
> recovery action. If it is UC, Xen will take action when all
> CPU is in SoftIRQ context, and dom0 will not take action, so
> it should be ok.
> 
> Secondly, in Xen environment, per my understanding, CPU is
> owned by Xen HV, so I'm not sure when dom0 disable L3 cache
> (if it is CE), should Xen be aware or not. That is, should
> dom0 disable the cache directly, or it should user hypercall
> to ask Xen do that. Keir can give us more suggestion.
> 
> For item C, currently Xen/dom0 can both access configuration
> space, while domU will do that through PCI_frontend/backend.
> Because PCI backend only cover device assigned to domU, so we
> don't need worry about domU and dom0 should be trusted.
> However, one thing left is, if this range is beyond 0x100
> (i.e. in pci-e range), we need add mmconfig support in Xen,
> although it can be added simply.
> 
> Thanks
> -- Yunhong Jiang
> 
>> 
>> As for the Shanghai feature: Christoph, are there any documents
>> available on that feature? What kind of errors are delivered
>> (corrected/correctable)? 
>> 
>> - Frank

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-02  5:51                                       ` Jiang, Yunhong
@ 2009-03-02 14:51                                         ` Christoph Egger
  2009-03-02 16:09                                           ` Jiang, Yunhong
  2009-03-02 17:47                                         ` Frank van der Linden
  1 sibling, 1 reply; 45+ messages in thread
From: Christoph Egger @ 2009-03-02 14:51 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: xen-devel, Gavin Maltby, Ke, Liping, Frank.Vanderlinden@Sun.COM,
	Keir Fraser, Kleen, Andi

On Monday 02 March 2009 06:51:22 Jiang, Yunhong wrote:
> Frank/Christopher, can you please give more comments for it, or you are OK

Sorry, for the delay. I'm also busy with other tasks.

> with this? For the action reporting mechanism, we will send out a proposal
> for review soon.

I would like to see interface definition first, which covers all aspects
we discussed.



>
> Thanks
> Yunhong Jiang
>
> Jiang, Yunhong <> wrote:
> > Christopher/Frank, thanks for reply very much, see comments below.
> >
> >> -----Original Message-----
> >> From: Frank.Vanderlinden@Sun.COM [mailto:Frank.Vanderlinden@Sun.COM]
> >> Sent: 2009年2月26日 1:33 To: Christoph Egger
> >> Cc: Jiang, Yunhong; Kleen, Andi;
> >> xen-devel@lists.xensource.com; Keir Fraser; Ke, Liping; Gavin Maltby
> >> Subject: Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
> >>
> >> Christoph Egger wrote:
> >>> On Wednesday 25 February 2009 03:25:12 Jiang, Yunhong wrote:
> >>>> So, Frank/Egger, can I assume followed are consensus currently?
> >>>>
> >>>> 1) MCE is handled by Xen HV totally, while guest's vMCE handler will
> >>>> only works for itself. 2) Xen present a virtual #MC to guest through
> >>>> MSR access emulation.(Xen will do the translation if needed).
> >>>> 3) Guest's unmodified
> >>>> MCE handler will handle the vMCE injected.
> >>>> 4) Dom0 will get all log/telemetry through hypercall.
> >>>> 5) The action taken by xen will be passed to dom0 through the
> >>>> telemetry mechanism.
> >>>
> >>> Mostly. Regarding 2) I want like to discuss first how to handle errors
> >>> impacting multiple contiguous physical pages which are non-contigous
> >>> in guest physical space.
> >>>
> >>>
> >>>
> >>> And I also want to discuss about how to do recovery actions requiring
> >>> PCI access. One example for this is
> >>> Shanghai's "L3 Cache Index Disable"-Feature.
> >>> Xen delegates PCI config space to Dom0 and
> >>> via PCI passthrough partly to DomU.
> >>> That means, if registers in PCI config space are independently
> >>> accessable by Xen, Dom0 and/or DomU, they can interfere with each
> >>> other. Therefore, we need to a) clearly define who handles what and
> >>> b) define some rules based on a)
> >>> c) discuss how to handle Dom0/DomU going wild
> >>>     and break the rules defined in b)
> >>
> >> I also agree on the approach in principle, but would like to see these
> >> points addressed. For non-contiguous pages, I suppose Xen
> >> could deliver
> >> multiple #vMCEs to the guest, split into contiguous parts. The
> >> vmce code
> >> seems to be set up to be able to do this.
> >
> > For the contigous pages, I agree with Gavin that such
> > contiguous page error should be triggered as multiple #MC and so is ok.
> >
> > For PCI config space issue, Christoph, can you please share
> > more information on it (or provide some document as Frank
> > suggested), like is it for CE (Correctable error or
> > UC(UnCorrectable error), is it in PCI range or PCI-E range
> > (i.e. through 0xCF8/CFC or through MMCONFIG), how the device's
> > BDF caculated etc. Followed is some of my understanding.
> >
> > Firstly, if it is CE, Xen will do nothing and dom0 will take
> > recovery action. If it is UC, Xen will take action when all
> > CPU is in SoftIRQ context, and dom0 will not take action, so
> > it should be ok.
> >
> > Secondly, in Xen environment, per my understanding, CPU is
> > owned by Xen HV, so I'm not sure when dom0 disable L3 cache
> > (if it is CE), should Xen be aware or not. That is, should
> > dom0 disable the cache directly, or it should user hypercall
> > to ask Xen do that. Keir can give us more suggestion.
> >
> > For item C, currently Xen/dom0 can both access configuration
> > space, while domU will do that through PCI_frontend/backend.
> > Because PCI backend only cover device assigned to domU, so we
> > don't need worry about domU and dom0 should be trusted.
> > However, one thing left is, if this range is beyond 0x100
> > (i.e. in pci-e range), we need add mmconfig support in Xen,
> > although it can be added simply.
> >
> > Thanks
> > -- Yunhong Jiang
> >
> >> As for the Shanghai feature: Christoph, are there any documents
> >> available on that feature?

Yes, our BKDG.

> >> What kind of errors are delivered (corrected/correctable)?

The error type can be both depending on whether correction
via ECC was successful or not.


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-02-26  2:16                                       ` Jiang, Yunhong
@ 2009-03-02 14:58                                         ` Christoph Egger
  2009-03-02 16:15                                           ` Jiang, Yunhong
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Egger @ 2009-03-02 14:58 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: xen-devel, Gavin Maltby, Ke, Liping, Frank.Vanderlinden@Sun.COM,
	Keir Fraser, Kleen, Andi

On Thursday 26 February 2009 03:16:29 Jiang, Yunhong wrote:
> Christopher/Egger, thanks for reply very much, see comments below.
>
> >-----Original Message-----
> >From: Frank.Vanderlinden@Sun.COM [mailto:Frank.Vanderlinden@Sun.COM]
> >Sent: 2009年2月26日 1:33
> >To: Christoph Egger
> >Cc: Jiang, Yunhong; Kleen, Andi;
> >xen-devel@lists.xensource.com; Keir Fraser; Ke, Liping; Gavin Maltby
> >Subject: Re: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
> >
> >Christoph Egger wrote:
> >> On Wednesday 25 February 2009 03:25:12 Jiang, Yunhong wrote:
> >>> So, Frank/Egger, can I assume followed are consensus currently?
> >>>
> >>> 1) MCE is handled by Xen HV totally, while guest's vMCE
> >
> >handler will only
> >
> >>> works for itself.
> >>> 2) Xen present a virtual #MC to guest through MSR access
> >>> emulation.(Xen will do the translation if needed).
> >>> 3) Guest's unmodified
> >>> MCE handler will handle the vMCE injected.
> >>> 4) Dom0 will get all log/telemetry through hypercall.
> >>> 5) The action taken by xen will be passed to dom0 through
> >
> >the telemetry
> >
> >>> mechanism.
> >>
> >> Mostly. Regarding 2) I want like to discuss first how to
> >
> >handle errors
> >
> >> impacting multiple contiguous physical pages which are non-contigous
> >> in guest physical space.
> >>
> >>
> >>
> >> And I also want to discuss about how to do recovery actions requiring
> >> PCI access. One example for this is
> >> Shanghai's "L3 Cache Index Disable"-Feature.
> >> Xen delegates PCI config space to Dom0 and
> >> via PCI passthrough partly to DomU.
> >> That means, if registers in PCI config space are independently
> >> accessable by Xen, Dom0 and/or DomU, they can interfere with
> >
> >each other.
> >
> >> Therefore, we need to
> >> a) clearly define who handles what and
> >> b) define some rules based on a)
> >> c) discuss how to handle Dom0/DomU going wild
> >>     and break the rules defined in b)
> >
> >I also agree on the approach in principle, but would like to see these
> >points addressed. For non-contiguous pages, I suppose Xen
> >could deliver
> >multiple #vMCEs to the guest, split into contiguous parts. The
> >vmce code
> >seems to be set up to be able to do this.

For virtual MCEs that is ok. But note, for unmodified guests, the MC handler
is written with the assumption that the CPU powers off when an #MCE
happens before the handler cleared the MCIP bit in the MCG_STATUS MSR.

>
> For the contigous pages, I agree with Gavin that such contiguous page error
> should be triggered as multiple #MC and so is ok.
>
> For PCI config space issue, Christoph, can you please share more
> information on it (or provide some document as Frank suggested), like is it
> for CE (Correctable error or UC(UnCorrectable error), is it in PCI range or
> PCI-E range (i.e. through 0xCF8/CFC or through MMCONFIG), how the device's
> BDF caculated etc. Followed is some of my understanding.

I would like to see a generic solution that works with any feature
requiring access to the pci space rather a per-feature solution.


> Firstly, if it is CE, Xen will do nothing and dom0 will take recovery
> action. If it is UC, Xen will take action when all CPU is in SoftIRQ
> context, and dom0 will not take action, so it should be ok.
>
> Secondly, in Xen environment, per my understanding, CPU is owned by Xen HV,
> so I'm not sure when dom0 disable L3 cache (if it is CE), should Xen be
> aware or not. That is, should dom0 disable the cache directly, or it should
> user hypercall to ask Xen do that. Keir can give us more suggestion.
>
> For item C, currently Xen/dom0 can both access configuration space, while
> domU will do that through PCI_frontend/backend. Because PCI backend only
> cover device assigned to domU, so we don't need worry about domU and dom0
> should be trusted. However, one thing left is, if this range is beyond
> 0x100 (i.e. in pci-e range), we need add mmconfig support in Xen, although
> it can be added simply.
>
> Thanks
> -- Yunhong Jiang
>
> >As for the Shanghai feature: Christoph, are there any documents
> >available on that feature? What kind of errors are delivered
> >(corrected/correctable)?
> >
> >- Frank



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-02 14:51                                         ` Christoph Egger
@ 2009-03-02 16:09                                           ` Jiang, Yunhong
  0 siblings, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-03-02 16:09 UTC (permalink / raw)
  To: Christoph Egger
  Cc: xen-devel, Frank.Vanderlinden@Sun.COM, Gavin, Ke, Liping, Maltby,
	Keir Fraser, Kleen, Andi

xen-devel-bounces@lists.xensource.com <> wrote:
> On Monday 02 March 2009 06:51:22 Jiang, Yunhong wrote:
>> Frank/Christopher, can you please give more comments for it, or you are OK
> 
> Sorry, for the delay. I'm also busy with other tasks.
> 
>> with this? For the action reporting mechanism, we will send out a proposal
>> for review soon.
> 
> I would like to see interface definition first, which covers
> all aspects
> we discussed.
> 

>>>> As for the Shanghai feature: Christoph, are there any documents
>>>> available on that feature?
> 
> Yes, our BKDG.

I checked BKDG for both Family 10 and 11 (http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41256.pdf and http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116.pdf, and didn't find the related info, Can you share more info like the URL and the section number?

> 
>>>> What kind of errors are delivered (corrected/correctable)?
> 
> The error type can be both depending on whether correction
> via ECC was successful or not.

So you mean if ECC failed in the L3 cache, Xen must do "L3 Cache Index Disable" immediately to avoid the cache not be used anymore? 

Thanks
Yunhong Jiang

> 
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-02 14:58                                         ` Christoph Egger
@ 2009-03-02 16:15                                           ` Jiang, Yunhong
  0 siblings, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-03-02 16:15 UTC (permalink / raw)
  To: Christoph Egger
  Cc: xen-devel, Gavin Maltby, Ke, Liping, Frank.Vanderlinden@Sun.COM,
	Keir Fraser, Kleen, Andi

> 
> For virtual MCEs that is ok. But note, for unmodified guests,
> the MC handler
> is written with the assumption that the CPU powers off when an #MCE
> happens before the handler cleared the MCIP bit in the MCG_STATUS MSR.

That should depends on implementation, for example, we can inject the vMCE one by one, i.e. only inject next after the first is handled already.

> 
>> 
>> For the contigous pages, I agree with Gavin that such contiguous page error
>> should be triggered as multiple #MC and so is ok.
>> 
>> For PCI config space issue, Christoph, can you please share more
>> information on it (or provide some document as Frank suggested), like is it
>> for CE (Correctable error or UC(UnCorrectable error), is it in PCI range or
>> PCI-E range (i.e. through 0xCF8/CFC or through MMCONFIG), how the device's
>> BDF caculated etc. Followed is some of my understanding.
> 
> I would like to see a generic solution that works with any feature
> requiring access to the pci space rather a per-feature solution.

I think the solution is , Xen care for MCE while dom0 care for CE error. Or another solution is all PCI access for CPU RAS is done by Xen since Xen owns CPU. ISome information like how the pci config space is arranged will be helpful, I think.

Thanks
Yunhong Jiang

> 
> 
>> Firstly, if it is CE, Xen will do nothing and dom0 will take recovery
>> action. If it is UC, Xen will take action when all CPU is in SoftIRQ
>> context, and dom0 will not take action, so it should be ok.
>> 
>> Secondly, in Xen environment, per my understanding, CPU is owned by Xen HV,
>> so I'm not sure when dom0 disable L3 cache (if it is CE), should Xen be
>> aware or not. That is, should dom0 disable the cache directly, or it should
>> user hypercall to ask Xen do that. Keir can give us more suggestion.
>> 
>> For item C, currently Xen/dom0 can both access configuration space, while
>> domU will do that through PCI_frontend/backend. Because PCI backend only
>> cover device assigned to domU, so we don't need worry about domU and dom0
>> should be trusted. However, one thing left is, if this range is beyond
>> 0x100 (i.e. in pci-e range), we need add mmconfig support in Xen, although
>> it can be added simply. 
>> 
>> Thanks
>> -- Yunhong Jiang
>> 
>>> As for the Shanghai feature: Christoph, are there any documents
>>> available on that feature? What kind of errors are delivered
>>> (corrected/correctable)? 
>>> 
>>> - Frank
> 
> 
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-02  5:51                                       ` Jiang, Yunhong
  2009-03-02 14:51                                         ` Christoph Egger
@ 2009-03-02 17:47                                         ` Frank van der Linden
  2009-03-05  4:45                                           ` Jiang, Yunhong
  2009-03-05  8:31                                           ` Jiang, Yunhong
  1 sibling, 2 replies; 45+ messages in thread
From: Frank van der Linden @ 2009-03-02 17:47 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: Christoph Egger, xen-devel, Ke, Liping, Gavin Maltby,
	Keir Fraser, Kleen, Andi

Jiang, Yunhong wrote:
> Frank/Christopher, can you please give more comments for it, or you are OK with this?
> For the action reporting mechanism, we will send out a proposal for review soon.

I'm ok with this. We need a little more information on the AMD
mechanism, but it seems to me that we can fit this in.

Sometime this week, I'll also send out the last of our changes that
haven't been sent upstream to xen-unstable yet. Maybe we can combine
some things in to one patch, like the telemetry handling changes that
Gavin did. The other changes are error injection (for debugging) and
panic crash dump support for our FMA tools, but those are probably only
interesting to us.

- Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-02 17:47                                         ` Frank van der Linden
@ 2009-03-05  4:45                                           ` Jiang, Yunhong
  2009-03-05  8:31                                           ` Jiang, Yunhong
  1 sibling, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-03-05  4:45 UTC (permalink / raw)
  To: Frank.Vanderlinden
  Cc: Christoph Egger, xen-devel, Ke, Liping, Gavin Maltby,
	Keir Fraser, Kleen, Andi

Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:
> Jiang, Yunhong wrote:
>> Frank/Christopher, can you please give more comments for it, or you are OK
>> with this? For the action reporting mechanism, we will send out a proposal
>> for review soon. 
> 
> I'm ok with this. We need a little more information on the AMD
> mechanism, but it seems to me that we can fit this in.
> 
> Sometime this week, I'll also send out the last of our changes that
> haven't been sent upstream to xen-unstable yet. Maybe we can combine
> some things in to one patch, like the telemetry handling changes that
> Gavin did. The other changes are error injection (for debugging) and
> panic crash dump support for our FMA tools, but those are probably only
> interesting to us. 
> 
> - Frank

Glad to knows about the conclusion. See my reply to Christoph on the AMD mechanism, but still waiting for response. 

Thanks
Yunhong Jiang

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-02 17:47                                         ` Frank van der Linden
  2009-03-05  4:45                                           ` Jiang, Yunhong
@ 2009-03-05  8:31                                           ` Jiang, Yunhong
  2009-03-05 14:53                                             ` Christoph Egger
  1 sibling, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-03-05  8:31 UTC (permalink / raw)
  To: Frank.Vanderlinden
  Cc: Christoph Egger, xen-devel, Ke, Liping, Gavin Maltby,
	Keir Fraser, Kleen, Andi

Christoph/Frank, Followed is the interface definition, please have a look.

Thanks
Yunhong Jiang

1) Interface between Xen/dom0 for passing xen's recovery action information to dom0. 
   Usage model: After offlining broken page, Xen might pass its page-offline recovery action 
   result information to dom0. Dom0 will save the information in non-volatile memory for further 
   proactive actions, such as offlining the easy-broken page early when doing next reboot.


struct page_offline_action
{
    /* Params for passing the offlined page number to DOM0 */
    uint64_t mfn;
    uint64_t status; /* Similar to page offline hypercall */
};

struct cpu_offline_action
{
    /* Params for passing the identity of the offlined CPU to DOM0 */
    uint32_t mc_socketid;
    uint16_t mc_coreid;
    uint16_t mc_core_threadid;
};

struct cache_shrink_action
{
    /* TBD, Christoph, please fill it */
};

/* Recover action flags, giving recovery result information to guest */
/* Recovery successfully after taking certain recovery actions below */
#define REC_ACT_RECOVERED      (0x1 << 0)
/* For solaris's usage that dom0 will take ownership when crash */
#define REC_ACT_RESET          (0x1 << 2)
/* No action is performed by XEN */
#define REC_ACT_INFO           (0x1 << 3)

/* Recover action type definition, valid only when flags &  REC_ACT_RECOVERED */
#define MC_ACT_PAGE_OFFLINE 1
#define MC_ACT_CPU_OFFLINE   2
#define MC_ACT_CACHE_SHIRNK 3

struct recovery_action
{
    uint8_t flags;
    uint8_t action_type;
    union
    {
        struct page_offline_action page_retire;
        struct cpu_offline_action cpu_offline;
        struct cache_shrink_action cache_shrink;
        uint8_t pad[MAX_ACTION_SIZE];
    } action_info;
}

struct mcinfo_bank {
    struct mcinfo_common common;

    uint16_t mc_bank; /* bank nr */
    uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on dom0
                        * and if mc_addr is valid. Never valid on DomU. */
    uint64_t mc_status; /* bank status */
    uint64_t mc_addr;   /* bank address, only valid
                         * if addr bit is set in mc_status */
    uint64_t mc_misc;
    uint64_t mc_ctrl2;
    uint64_t mc_tsc;
    /* Recovery action is performed per bank */
    struct recovery_action action;
};

2) Below two interfaces are for MCA processing internal use.
    a. pre_handler will be called earlier in MCA ISR context, mainly for early need_reset 
        detection for avoiding log missing (flag MCA_RESET).  Also, pre_handler might
        be able to find the impacted domain if possible.
    b. mca_error_handler is actually a (error_action_index, recovery_handler pointer) pair. 
       The defined recovery_handler function performs the actual recovery operations in 
       softIrq context after the per_bank MCA error matching the corresponding mca_code index. 
       If pre_handler can't judge the impacted domain, recovery_handler must figure it out.

/* Error has been recovered successfully */
#define MCA_RECOVERD    0
/* Error impact one guest as stated in owner field */
#define MCA_OWNER       1
/* Error can't be recovered and need reboot system */
#define MCA_RESET       2
/* Error should be handled in softIRQ context */
#define MCA_MORE_ACTION 3

struct mca_handle_result
{
    uint32_t flags;
    /* Valid only when flags & MCA_OWNER */
    domid_d owner;
    /* valid only when flags & MCA_RECOVERD */
    struct  recovery_action *action;
};

struct mca_error_handler
{
    /*
     * Assume we will need only architecture defined code. If the index can't be setup by
     * mca_code, we will add a function to do the (index, recovery_handler) mapping check.
     * This mca_code represents the recovery handler pointer index for identifying this 
     * particular error's corresponding recover action
    */
    uint16_t mca_code;

    /* Handler to be called in softIRQ handler context */
    int recovery_handler(struct mcinfo_bank *bank,
                     struct mcinfo_global *global,
                     struct mcinfo_extended *extention,
                     struct mca_handle_result *result);

};

struct mca_error_handler intel_mca_handler[] = 
{
    ....
};

struct mca_error_handler amd_mca_handler[] =
{
    ....
};


/* HandlVer to be called in MCA ISR in MCA context */
int intel_mca_pre_handler(struct cpu_user_regs *regs,
                                struct mca_handle_result *result);

int amd_mca_pre_handler(struct cpu_user_regs *regs,
                            struct mca_handle_result *result);


Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:
> Jiang, Yunhong wrote:
>> Frank/Christopher, can you please give more comments for it, or you are OK
>> with this? For the action reporting mechanism, we will send out a proposal
>> for review soon. 
> 
> I'm ok with this. We need a little more information on the AMD
> mechanism, but it seems to me that we can fit this in.
> 
> Sometime this week, I'll also send out the last of our changes that
> haven't been sent upstream to xen-unstable yet. Maybe we can combine
> some things in to one patch, like the telemetry handling changes that
> Gavin did. The other changes are error injection (for debugging) and
> panic crash dump support for our FMA tools, but those are probably only
> interesting to us. 
> 
> - Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-05  8:31                                           ` Jiang, Yunhong
@ 2009-03-05 14:53                                             ` Christoph Egger
  2009-03-05 15:19                                               ` Jiang, Yunhong
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Egger @ 2009-03-05 14:53 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: xen-devel, Gavin Maltby, Ke, Liping, Frank.Vanderlinden@Sun.COM,
	Keir Fraser, Kleen, Andi


MC_ACT_CACHE_SHIRNK  <-- typo. should be MC_ACT_CACHE_SHRINK 

The L3 cache index disable feature works like this:

You read the bits 17:6  from the MSR 0xC0000408 (which is MC4_MISC1)
and write it into the index field. This MSR does not belong to the standard
mc bank data and is therefore provided by mcinfo_extended.
The index field are the bits 11:0 of the PCI function 3 register
"L3 Cache Index Disable".

Why is the recover action bound to the bank ?
I would like to see a struct mcinfo_recover  rather extending
struct mcinfo_bank.  That gives us flexibility.

Christoph


On Thursday 05 March 2009 09:31:27 Jiang, Yunhong wrote:
> Christoph/Frank, Followed is the interface definition, please have a look.
>
> Thanks
> Yunhong Jiang
>
> 1) Interface between Xen/dom0 for passing xen's recovery action information
> to dom0. Usage model: After offlining broken page, Xen might pass its
> page-offline recovery action result information to dom0. Dom0 will save the
> information in non-volatile memory for further proactive actions, such as
> offlining the easy-broken page early when doing next reboot.
>
>
> struct page_offline_action
> {
>     /* Params for passing the offlined page number to DOM0 */
>     uint64_t mfn;
>     uint64_t status; /* Similar to page offline hypercall */
> };
>
> struct cpu_offline_action
> {
>     /* Params for passing the identity of the offlined CPU to DOM0 */
>     uint32_t mc_socketid;
>     uint16_t mc_coreid;
>     uint16_t mc_core_threadid;
> };
>
> struct cache_shrink_action
> {
>     /* TBD, Christoph, please fill it */
> };
>
> /* Recover action flags, giving recovery result information to guest */
> /* Recovery successfully after taking certain recovery actions below */
> #define REC_ACT_RECOVERED      (0x1 << 0)
> /* For solaris's usage that dom0 will take ownership when crash */
> #define REC_ACT_RESET          (0x1 << 2)
> /* No action is performed by XEN */
> #define REC_ACT_INFO           (0x1 << 3)
>
> /* Recover action type definition, valid only when flags & 
> REC_ACT_RECOVERED */
> #define MC_ACT_PAGE_OFFLINE 1 
> #define MC_ACT_CPU_OFFLINE   2
> #define MC_ACT_CACHE_SHIRNK 3
>
> struct recovery_action
> {
>     uint8_t flags;
>     uint8_t action_type;
>     union
>     {
>         struct page_offline_action page_retire;
>         struct cpu_offline_action cpu_offline;
>         struct cache_shrink_action cache_shrink;
>         uint8_t pad[MAX_ACTION_SIZE];
>     } action_info;
> }
>
> struct mcinfo_bank {
>     struct mcinfo_common common;
>
>     uint16_t mc_bank; /* bank nr */
>     uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on dom0
>                         * and if mc_addr is valid. Never valid on DomU. */
>     uint64_t mc_status; /* bank status */
>     uint64_t mc_addr;   /* bank address, only valid
>                          * if addr bit is set in mc_status */
>     uint64_t mc_misc;
>     uint64_t mc_ctrl2;
>     uint64_t mc_tsc;
>     /* Recovery action is performed per bank */
>     struct recovery_action action;
> };
>
> 2) Below two interfaces are for MCA processing internal use.
>     a. pre_handler will be called earlier in MCA ISR context, mainly for
> early need_reset detection for avoiding log missing (flag MCA_RESET). 
> Also, pre_handler might be able to find the impacted domain if possible.
>     b. mca_error_handler is actually a (error_action_index,
> recovery_handler pointer) pair. The defined recovery_handler function
> performs the actual recovery operations in softIrq context after the
> per_bank MCA error matching the corresponding mca_code index. If
> pre_handler can't judge the impacted domain, recovery_handler must figure
> it out.
>
> /* Error has been recovered successfully */
> #define MCA_RECOVERD    0
> /* Error impact one guest as stated in owner field */
> #define MCA_OWNER       1
> /* Error can't be recovered and need reboot system */
> #define MCA_RESET       2
> /* Error should be handled in softIRQ context */
> #define MCA_MORE_ACTION 3
>
> struct mca_handle_result
> {
>     uint32_t flags;
>     /* Valid only when flags & MCA_OWNER */
>     domid_d owner;
>     /* valid only when flags & MCA_RECOVERD */
>     struct  recovery_action *action;
> };
>
> struct mca_error_handler
> {
>     /*
>      * Assume we will need only architecture defined code. If the index
> can't be setup by * mca_code, we will add a function to do the (index,
> recovery_handler) mapping check. * This mca_code represents the recovery
> handler pointer index for identifying this * particular error's
> corresponding recover action
>     */
>     uint16_t mca_code;
>
>     /* Handler to be called in softIRQ handler context */
>     int recovery_handler(struct mcinfo_bank *bank,
>                      struct mcinfo_global *global,
>                      struct mcinfo_extended *extention,
>                      struct mca_handle_result *result);
>
> };
>
> struct mca_error_handler intel_mca_handler[] =
> {
>     ....
> };
>
> struct mca_error_handler amd_mca_handler[] =
> {
>     ....
> };
>
>
> /* HandlVer to be called in MCA ISR in MCA context */
> int intel_mca_pre_handler(struct cpu_user_regs *regs,
>                                 struct mca_handle_result *result);
>
> int amd_mca_pre_handler(struct cpu_user_regs *regs,
>                             struct mca_handle_result *result);
>
> Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:
> > Jiang, Yunhong wrote:
> >> Frank/Christopher, can you please give more comments for it, or you are
> >> OK with this? For the action reporting mechanism, we will send out a
> >> proposal for review soon.
> >
> > I'm ok with this. We need a little more information on the AMD
> > mechanism, but it seems to me that we can fit this in.
> >
> > Sometime this week, I'll also send out the last of our changes that
> > haven't been sent upstream to xen-unstable yet. Maybe we can combine
> > some things in to one patch, like the telemetry handling changes that
> > Gavin did. The other changes are error injection (for debugging) and
> > panic crash dump support for our FMA tools, but those are probably only
> > interesting to us.
> >
> > - Frank



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-05 14:53                                             ` Christoph Egger
@ 2009-03-05 15:19                                               ` Jiang, Yunhong
  2009-03-05 17:28                                                 ` Christoph Egger
  0 siblings, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-03-05 15:19 UTC (permalink / raw)
  To: Christoph Egger
  Cc: xen-devel, Gavin Maltby, Ke, Liping, Frank.Vanderlinden@Sun.COM,
	Keir Fraser, Kleen, Andi

Christoph Egger <mailto:Christoph.Egger@amd.com> wrote:
> MC_ACT_CACHE_SHIRNK  <-- typo. should be MC_ACT_CACHE_SHRINK

Ahh, yes, I will fix it.

> 
> The L3 cache index disable feature works like this:
> 
> You read the bits 17:6  from the MSR 0xC0000408 (which is MC4_MISC1)
> and write it into the index field. This MSR does not belong to
> the standard
> mc bank data and is therefore provided by mcinfo_extended.
> The index field are the bits 11:0 of the PCI function 3 register "L3 Cache
> Index Disable". 

So what's the offset of "L3 Cache Index Disable"? Is it in 256 byte or 4K byte?

For the PCI access, I'd prefer to have xen to control all these, i.e. even if dom0 want to disable the L3 cache, it is done through a hypercall. The reason is, Xen control the CPU, so keep it in Xen will make things simpler.

Of course, it is ok for me too, if you want to keep Xen for #MC handler and Dom0 for CE handler.

> 
> Why is the recover action bound to the bank ?
> I would like to see a struct mcinfo_recover  rather extending
> struct mcinfo_bank.  That gives us flexibility.

I'd get input from Frank or Gavin. Place mcinfo_recover in mcinfo_back has advantage of keep connection of the error source and the action, but it do make the mcinfo_bank more complex. Or we can keep the cpu/bank information in the mcinfo_recover also, so that we keep the flexibility and don't lose the connection.

Thanks
Yunhong Jiang



> 
> Christoph
> 
> 
> On Thursday 05 March 2009 09:31:27 Jiang, Yunhong wrote:
>> Christoph/Frank, Followed is the interface definition, please have a look.
>> 
>> Thanks
>> Yunhong Jiang
>> 
>> 1) Interface between Xen/dom0 for passing xen's recovery action information
>> to dom0. Usage model: After offlining broken page, Xen might pass its
>> page-offline recovery action result information to dom0. Dom0 will save the
>> information in non-volatile memory for further proactive actions, such as
>> offlining the easy-broken page early when doing next reboot.
>> 
>> 
>> struct page_offline_action
>> {
>>     /* Params for passing the offlined page number to DOM0 */     uint64_t
>>     mfn; uint64_t status; /* Similar to page offline hypercall */ };
>> 
>> struct cpu_offline_action
>> {
>>     /* Params for passing the identity of the offlined CPU to DOM0 */    
>>     uint32_t mc_socketid; uint16_t mc_coreid;
>>     uint16_t mc_core_threadid;
>> };
>> 
>> struct cache_shrink_action
>> {
>>     /* TBD, Christoph, please fill it */
>> };
>> 
>> /* Recover action flags, giving recovery result information to guest */
>> /* Recovery successfully after taking certain recovery actions below */
>> #define REC_ACT_RECOVERED      (0x1 << 0)
>> /* For solaris's usage that dom0 will take ownership when crash */
>> #define REC_ACT_RESET          (0x1 << 2)
>> /* No action is performed by XEN */
>> #define REC_ACT_INFO           (0x1 << 3)
>> 
>> /* Recover action type definition, valid only when flags &
>> REC_ACT_RECOVERED */ #define MC_ACT_PAGE_OFFLINE 1
>> #define MC_ACT_CPU_OFFLINE   2
>> #define MC_ACT_CACHE_SHIRNK 3
>> 
>> struct recovery_action
>> {
>>     uint8_t flags;
>>     uint8_t action_type;
>>     union
>>     {
>>         struct page_offline_action page_retire;
>>         struct cpu_offline_action cpu_offline;
>>         struct cache_shrink_action cache_shrink;
>>         uint8_t pad[MAX_ACTION_SIZE];
>>     } action_info;
>> }
>> 
>> struct mcinfo_bank {
>>     struct mcinfo_common common;
>> 
>>     uint16_t mc_bank; /* bank nr */
>>     uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on dom0
>>                         * and if mc_addr is valid. Never valid on DomU. */
>>     uint64_t mc_status; /* bank status */
>>     uint64_t mc_addr;   /* bank address, only valid
>>                          * if addr bit is set in mc_status */     uint64_t
>>     mc_misc; uint64_t mc_ctrl2;
>>     uint64_t mc_tsc;
>>     /* Recovery action is performed per bank */
>>     struct recovery_action action;
>> };
>> 
>> 2) Below two interfaces are for MCA processing internal use.
>>     a. pre_handler will be called earlier in MCA ISR context, mainly for
>> early need_reset detection for avoiding log missing (flag MCA_RESET).
>> Also, pre_handler might be able to find the impacted domain if possible.
>>     b. mca_error_handler is actually a (error_action_index,
>> recovery_handler pointer) pair. The defined recovery_handler function
>> performs the actual recovery operations in softIrq context after the
>> per_bank MCA error matching the corresponding mca_code index. If
>> pre_handler can't judge the impacted domain, recovery_handler must figure
>> it out. 
>> 
>> /* Error has been recovered successfully */
>> #define MCA_RECOVERD    0
>> /* Error impact one guest as stated in owner field */ #define MCA_OWNER   
>> 1 /* Error can't be recovered and need reboot system */ #define MCA_RESET 
>> 2 /* Error should be handled in softIRQ context */
>> #define MCA_MORE_ACTION 3
>> 
>> struct mca_handle_result
>> {
>>     uint32_t flags;
>>     /* Valid only when flags & MCA_OWNER */
>>     domid_d owner;
>>     /* valid only when flags & MCA_RECOVERD */
>>     struct  recovery_action *action;
>> };
>> 
>> struct mca_error_handler
>> {
>>     /*
>>      * Assume we will need only architecture defined code. If the index
>> can't be setup by * mca_code, we will add a function to do the (index,
>> recovery_handler) mapping check. * This mca_code represents the recovery
>> handler pointer index for identifying this * particular error's
>>     corresponding recover action */
>>     uint16_t mca_code;
>> 
>>     /* Handler to be called in softIRQ handler context */
>>     int recovery_handler(struct mcinfo_bank *bank,
>>                      struct mcinfo_global *global,
>>                      struct mcinfo_extended *extention,
>>                      struct mca_handle_result *result);
>> 
>> };
>> 
>> struct mca_error_handler intel_mca_handler[] =
>> {
>>     ....
>> };
>> 
>> struct mca_error_handler amd_mca_handler[] =
>> {
>>     ....
>> };
>> 
>> 
>> /* HandlVer to be called in MCA ISR in MCA context */
>> int intel_mca_pre_handler(struct cpu_user_regs *regs,
>>                                 struct mca_handle_result *result);
>> 
>> int amd_mca_pre_handler(struct cpu_user_regs *regs,
>>                             struct mca_handle_result *result);
>> 
>> Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:
>>> Jiang, Yunhong wrote:
>>>> Frank/Christopher, can you please give more comments for it, or you are
>>>> OK with this? For the action reporting mechanism, we will send out a
>>>> proposal for review soon.
>>> 
>>> I'm ok with this. We need a little more information on the AMD
>>> mechanism, but it seems to me that we can fit this in.
>>> 
>>> Sometime this week, I'll also send out the last of our changes that
>>> haven't been sent upstream to xen-unstable yet. Maybe we can combine
>>> some things in to one patch, like the telemetry handling changes that
>>> Gavin did. The other changes are error injection (for debugging) and
>>> panic crash dump support for our FMA tools, but those are probably only
>>> interesting to us. 
>>> 
>>> - Frank
> 
> 
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-05 15:19                                               ` Jiang, Yunhong
@ 2009-03-05 17:28                                                 ` Christoph Egger
  2009-03-06  2:11                                                   ` Jiang, Yunhong
  2009-03-10  1:19                                                   ` Jiang, Yunhong
  0 siblings, 2 replies; 45+ messages in thread
From: Christoph Egger @ 2009-03-05 17:28 UTC (permalink / raw)
  To: xen-devel
  Cc: Gavin Maltby, Jiang, Yunhong, Ke, Liping,
	Frank.Vanderlinden@Sun.COM, Keir Fraser, Kleen, Andi

On Thursday 05 March 2009 16:19:40 Jiang, Yunhong wrote:
> Christoph Egger <mailto:Christoph.Egger@amd.com> wrote:
> > MC_ACT_CACHE_SHIRNK  <-- typo. should be MC_ACT_CACHE_SHRINK
>
> Ahh, yes, I will fix it.
>
> > The L3 cache index disable feature works like this:
> >
> > You read the bits 17:6  from the MSR 0xC0000408 (which is MC4_MISC1)
> > and write it into the index field. This MSR does not belong to
> > the standard
> > mc bank data and is therefore provided by mcinfo_extended.
> > The index field are the bits 11:0 of the PCI function 3 register "L3
> > Cache Index Disable".
>
> So what's the offset of "L3 Cache Index Disable"? Is it in 256 byte or 4K
> byte?

Sorry, which offset do you mean ?

>
> For the PCI access, I'd prefer to have xen to control all these, i.e. even
> if dom0 want to disable the L3 cache, it is done through a hypercall. The
> reason is, Xen control the CPU, so keep it in Xen will make things simpler.
>
> Of course, it is ok for me too, if you want to keep Xen for #MC handler and
> Dom0 for CE handler.

We still need to define the rules to prevent interferes and clarify how to
deal with Dom0/DomU going wild and breaking the rules.

> > Why is the recover action bound to the bank ?
> > I would like to see a struct mcinfo_recover  rather extending
> > struct mcinfo_bank.  That gives us flexibility.
>
> I'd get input from Frank or Gavin. Place mcinfo_recover in mcinfo_back has
> advantage of keep connection of the error source and the action, but it do
> make the mcinfo_bank more complex. Or we can keep the cpu/bank information
> in the mcinfo_recover also, so that we keep the flexibility and don't lose
> the connection.

From your suggestions I prefer the last one, but is still limited due
to the assumption that each struct mcinfo_bank and each struct mcinfo_extended
stands for exactly one error.

This assumption doesn't cover follow-up errors which may be needed to 
determine the real root cause. Some of them may even be ignored
depending on what is going on.

Christoph

>
> Thanks
> Yunhong Jiang
>
> > Christoph
> >
> > On Thursday 05 March 2009 09:31:27 Jiang, Yunhong wrote:
> >> Christoph/Frank, Followed is the interface definition, please have a
> >> look.
> >>
> >> Thanks
> >> Yunhong Jiang
> >>
> >> 1) Interface between Xen/dom0 for passing xen's recovery action
> >> information to dom0. Usage model: After offlining broken page, Xen might
> >> pass its page-offline recovery action result information to dom0. Dom0
> >> will save the information in non-volatile memory for further proactive
> >> actions, such as offlining the easy-broken page early when doing next
> >> reboot.
> >>
> >>
> >> struct page_offline_action
> >> {
> >>     /* Params for passing the offlined page number to DOM0 */    
> >> uint64_t mfn; uint64_t status; /* Similar to page offline hypercall */
> >> };
> >>
> >> struct cpu_offline_action
> >> {
> >>     /* Params for passing the identity of the offlined CPU to DOM0 */
> >>     uint32_t mc_socketid; uint16_t mc_coreid;
> >>     uint16_t mc_core_threadid;
> >> };
> >>
> >> struct cache_shrink_action
> >> {
> >>     /* TBD, Christoph, please fill it */
> >> };
> >>
> >> /* Recover action flags, giving recovery result information to guest */
> >> /* Recovery successfully after taking certain recovery actions below */
> >> #define REC_ACT_RECOVERED      (0x1 << 0)
> >> /* For solaris's usage that dom0 will take ownership when crash */
> >> #define REC_ACT_RESET          (0x1 << 2)
> >> /* No action is performed by XEN */
> >> #define REC_ACT_INFO           (0x1 << 3)
> >>
> >> /* Recover action type definition, valid only when flags &
> >> REC_ACT_RECOVERED */ #define MC_ACT_PAGE_OFFLINE 1
> >> #define MC_ACT_CPU_OFFLINE   2
> >> #define MC_ACT_CACHE_SHIRNK 3
> >>
> >> struct recovery_action
> >> {
> >>     uint8_t flags;
> >>     uint8_t action_type;
> >>     union
> >>     {
> >>         struct page_offline_action page_retire;
> >>         struct cpu_offline_action cpu_offline;
> >>         struct cache_shrink_action cache_shrink;
> >>         uint8_t pad[MAX_ACTION_SIZE];
> >>     } action_info;
> >> }
> >>
> >> struct mcinfo_bank {
> >>     struct mcinfo_common common;
> >>
> >>     uint16_t mc_bank; /* bank nr */
> >>     uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on
> >> dom0 * and if mc_addr is valid. Never valid on DomU. */ uint64_t
> >> mc_status; /* bank status */
> >>     uint64_t mc_addr;   /* bank address, only valid
> >>                          * if addr bit is set in mc_status */    
> >> uint64_t mc_misc; uint64_t mc_ctrl2;
> >>     uint64_t mc_tsc;
> >>     /* Recovery action is performed per bank */
> >>     struct recovery_action action;
> >> };
> >>
> >> 2) Below two interfaces are for MCA processing internal use.
> >>     a. pre_handler will be called earlier in MCA ISR context, mainly for
> >> early need_reset detection for avoiding log missing (flag MCA_RESET).
> >> Also, pre_handler might be able to find the impacted domain if possible.
> >>     b. mca_error_handler is actually a (error_action_index,
> >> recovery_handler pointer) pair. The defined recovery_handler function
> >> performs the actual recovery operations in softIrq context after the
> >> per_bank MCA error matching the corresponding mca_code index. If
> >> pre_handler can't judge the impacted domain, recovery_handler must
> >> figure it out.
> >>
> >> /* Error has been recovered successfully */
> >> #define MCA_RECOVERD    0
> >> /* Error impact one guest as stated in owner field */ #define MCA_OWNER
> >> 1 /* Error can't be recovered and need reboot system */ #define
> >> MCA_RESET 2 /* Error should be handled in softIRQ context */
> >> #define MCA_MORE_ACTION 3
> >>
> >> struct mca_handle_result
> >> {
> >>     uint32_t flags;
> >>     /* Valid only when flags & MCA_OWNER */
> >>     domid_d owner;
> >>     /* valid only when flags & MCA_RECOVERD */
> >>     struct  recovery_action *action;
> >> };
> >>
> >> struct mca_error_handler
> >> {
> >>     /*
> >>      * Assume we will need only architecture defined code. If the index
> >> can't be setup by * mca_code, we will add a function to do the (index,
> >> recovery_handler) mapping check. * This mca_code represents the recovery
> >> handler pointer index for identifying this * particular error's
> >>     corresponding recover action */
> >>     uint16_t mca_code;
> >>
> >>     /* Handler to be called in softIRQ handler context */
> >>     int recovery_handler(struct mcinfo_bank *bank,
> >>                      struct mcinfo_global *global,
> >>                      struct mcinfo_extended *extention,
> >>                      struct mca_handle_result *result);
> >>
> >> };
> >>
> >> struct mca_error_handler intel_mca_handler[] =
> >> {
> >>     ....
> >> };
> >>
> >> struct mca_error_handler amd_mca_handler[] =
> >> {
> >>     ....
> >> };
> >>
> >>
> >> /* HandlVer to be called in MCA ISR in MCA context */
> >> int intel_mca_pre_handler(struct cpu_user_regs *regs,
> >>                                 struct mca_handle_result *result);
> >>
> >> int amd_mca_pre_handler(struct cpu_user_regs *regs,
> >>                             struct mca_handle_result *result);
> >>
> >> Frank.Vanderlinden@Sun.COM <mailto:Frank.Vanderlinden@Sun.COM> wrote:
> >>> Jiang, Yunhong wrote:
> >>>> Frank/Christopher, can you please give more comments for it, or you
> >>>> are OK with this? For the action reporting mechanism, we will send out
> >>>> a proposal for review soon.
> >>>
> >>> I'm ok with this. We need a little more information on the AMD
> >>> mechanism, but it seems to me that we can fit this in.
> >>>
> >>> Sometime this week, I'll also send out the last of our changes that
> >>> haven't been sent upstream to xen-unstable yet. Maybe we can combine
> >>> some things in to one patch, like the telemetry handling changes that
> >>> Gavin did. The other changes are error injection (for debugging) and
> >>> panic crash dump support for our FMA tools, but those are probably only
> >>> interesting to us.
> >>>
> >>> - Frank
> >
> > --
> > ---to satisfy European Law for business letters:
> > Advanced Micro Devices GmbH
> > Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> > Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> > Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> > Registergericht Muenchen, HRB Nr. 43632



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-05 17:28                                                 ` Christoph Egger
@ 2009-03-06  2:11                                                   ` Jiang, Yunhong
  2009-03-10  1:19                                                   ` Jiang, Yunhong
  1 sibling, 0 replies; 45+ messages in thread
From: Jiang, Yunhong @ 2009-03-06  2:11 UTC (permalink / raw)
  To: Christoph Egger, xen-devel
  Cc: Kleen, Andi, Frank.Vanderlinden@Sun.COM, Keir Fraser,
	Gavin Maltby, Ke, Liping

Christoph Egger <mailto:Christoph.Egger@amd.com> wrote:
> On Thursday 05 March 2009 16:19:40 Jiang, Yunhong wrote:
>> Christoph Egger <mailto:Christoph.Egger@amd.com> wrote:
>>> MC_ACT_CACHE_SHIRNK  <-- typo. should be MC_ACT_CACHE_SHRINK
>> 
>> Ahh, yes, I will fix it.
>> 
>>> The L3 cache index disable feature works like this:
>>> 
>>> You read the bits 17:6  from the MSR 0xC0000408 (which is MC4_MISC1)
>>> and write it into the index field. This MSR does not belong to
>>> the standard
>>> mc bank data and is therefore provided by mcinfo_extended.
>>> The index field are the bits 11:0 of the PCI function 3 register "L3
>>> Cache Index Disable".
>> 
>> So what's the offset of "L3 Cache Index Disable"? Is it in 256 byte or 4K
>> byte?
> 
> Sorry, which offset do you mean ?

I mean the offset of this register in the PCI function's configuration space. You know for a PCI device, it has 256 byte configuration register while PCI-E device has 4K configuration register.
Currently xen can access the 256 byte config register already, however, to support 4K range, it requires more stuff, like mmconfig sparse etc. That's the reason I ask the offset of this register.

> 
>> 
>> For the PCI access, I'd prefer to have xen to control all these, i.e. even
>> if dom0 want to disable the L3 cache, it is done through a hypercall. The
>> reason is, Xen control the CPU, so keep it in Xen will make things simpler.
>> 
>> Of course, it is ok for me too, if you want to keep Xen for #MC handler and
>> Dom0 for CE handler.
> 
> We still need to define the rules to prevent interferes and
> clarify how to
> deal with Dom0/DomU going wild and breaking the rules.

As discussed previously,  we don't need concern about DomU, all configuration space access from domU will be intercepted by dom0.

For Dom0, since currently all PCI access to 0xcf8/cfc will be intercepted by Xen,  so Xen can do checking. We can achieve same checking for mmconfig if remove that range from dom0. But I have to say I'm not sure if we do need concern too much what will happen when dom0 going wild ( after all, a crash in dom0 will lost everything), especially interfere on such access will not cause security issue (please correct me if I'm wrong ).

> 
>>> Why is the recover action bound to the bank ?
>>> I would like to see a struct mcinfo_recover  rather extending
>>> struct mcinfo_bank.  That gives us flexibility.
>> 
>> I'd get input from Frank or Gavin. Place mcinfo_recover in mcinfo_back has
>> advantage of keep connection of the error source and the action, but it do
>> make the mcinfo_bank more complex. Or we can keep the cpu/bank information
>> in the mcinfo_recover also, so that we keep the flexibility and don't lose
>> the connection.
> 
> From your suggestions I prefer the last one, but is still limited due
> to the assumption that each struct mcinfo_bank and each struct
> mcinfo_extended stands for exactly one error.
> 
> This assumption doesn't cover follow-up errors which may be needed to
> determine the real root cause. Some of them may even be ignored
> depending on what is going on.

I think the assumption here is a recover action will be triggered only by one bank. For example, we offline page because one MC bank tell us that page is broken.

The "follow-up errors" is something interesting to me, do you have any example? It's ok for us to not include the back information if there are such requirement.

Thanks
Yunhong Jiang

> 
> Christoph
> 
>> 
>> Thanks
>> Yunhong Jiang
>> 
>>> Christoph
>>> 
>>> On Thursday 05 March 2009 09:31:27 Jiang, Yunhong wrote:
>>>> Christoph/Frank, Followed is the interface definition, please have a
>>>> look. 
>>>> 
>>>> Thanks
>>>> Yunhong Jiang
>>>> 
>>>> 1) Interface between Xen/dom0 for passing xen's recovery action
>>>> information to dom0. Usage model: After offlining broken page, Xen might
>>>> pass its page-offline recovery action result information to dom0. Dom0
>>>> will save the information in non-volatile memory for further proactive
>>>> actions, such as offlining the easy-broken page early when doing next
>>>> reboot. 
>>>> 
>>>> 
>>>> struct page_offline_action
>>>> {
>>>>     /* Params for passing the offlined page number to DOM0 */
>>>> uint64_t mfn; uint64_t status; /* Similar to page offline hypercall */ };
>>>> 
>>>> struct cpu_offline_action
>>>> {
>>>>     /* Params for passing the identity of the offlined CPU to DOM0 */
>>>>     uint32_t mc_socketid; uint16_t mc_coreid;
>>>>     uint16_t mc_core_threadid;
>>>> };
>>>> 
>>>> struct cache_shrink_action
>>>> {
>>>>     /* TBD, Christoph, please fill it */
>>>> };
>>>> 
>>>> /* Recover action flags, giving recovery result information to guest */
>>>> /* Recovery successfully after taking certain recovery actions below */
>>>> #define REC_ACT_RECOVERED      (0x1 << 0)
>>>> /* For solaris's usage that dom0 will take ownership when crash */
>>>> #define REC_ACT_RESET          (0x1 << 2)
>>>> /* No action is performed by XEN */
>>>> #define REC_ACT_INFO           (0x1 << 3)
>>>> 
>>>> /* Recover action type definition, valid only when flags &
>>>> REC_ACT_RECOVERED */ #define MC_ACT_PAGE_OFFLINE 1
>>>> #define MC_ACT_CPU_OFFLINE   2
>>>> #define MC_ACT_CACHE_SHIRNK 3
>>>> 
>>>> struct recovery_action
>>>> {
>>>>     uint8_t flags;
>>>>     uint8_t action_type;
>>>>     union
>>>>     {
>>>>         struct page_offline_action page_retire;
>>>>         struct cpu_offline_action cpu_offline;
>>>>         struct cache_shrink_action cache_shrink;
>>>>         uint8_t pad[MAX_ACTION_SIZE];
>>>>     } action_info;
>>>> }
>>>> 
>>>> struct mcinfo_bank {
>>>>     struct mcinfo_common common;
>>>> 
>>>>     uint16_t mc_bank; /* bank nr */
>>>>     uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on
>>>> dom0 * and if mc_addr is valid. Never valid on DomU. */ uint64_t
>>>>     mc_status; /* bank status */ uint64_t mc_addr;   /* bank address,
>>>>                          only valid * if addr bit is set in mc_status */
>>>> uint64_t mc_misc; uint64_t mc_ctrl2;
>>>>     uint64_t mc_tsc;
>>>>     /* Recovery action is performed per bank */
>>>>     struct recovery_action action;
>>>> };
>>>> 
>>>> 2) Below two interfaces are for MCA processing internal use.
>>>>     a. pre_handler will be called earlier in MCA ISR context, mainly for
>>>> early need_reset detection for avoiding log missing (flag MCA_RESET).
>>>> Also, pre_handler might be able to find the impacted domain if possible.
>>>>     b. mca_error_handler is actually a (error_action_index,
>>>> recovery_handler pointer) pair. The defined recovery_handler function
>>>> performs the actual recovery operations in softIrq context after the
>>>> per_bank MCA error matching the corresponding mca_code index. If
>>>> pre_handler can't judge the impacted domain, recovery_handler must
>>>> figure it out. 
>>>> 
>>>> /* Error has been recovered successfully */
>>>> #define MCA_RECOVERD    0
>>>> /* Error impact one guest as stated in owner field */ #define MCA_OWNER
>>>> 1 /* Error can't be recovered and need reboot system */ #define
>>>> MCA_RESET 2 /* Error should be handled in softIRQ context */ #define
>>>> MCA_MORE_ACTION 3 
>>>> 
>>>> struct mca_handle_result
>>>> {
>>>>     uint32_t flags;
>>>>     /* Valid only when flags & MCA_OWNER */
>>>>     domid_d owner;
>>>>     /* valid only when flags & MCA_RECOVERD */
>>>>     struct  recovery_action *action;
>>>> };
>>>> 
>>>> struct mca_error_handler
>>>> {
>>>>     /*
>>>>      * Assume we will need only architecture defined code. If the index
>>>> can't be setup by * mca_code, we will add a function to do the (index,
>>>> recovery_handler) mapping check. * This mca_code represents the recovery
>>>> handler pointer index for identifying this * particular error's
>>>>     corresponding recover action */
>>>>     uint16_t mca_code;
>>>> 
>>>>     /* Handler to be called in softIRQ handler context */
>>>>     int recovery_handler(struct mcinfo_bank *bank,
>>>>                      struct mcinfo_global *global,
>>>>                      struct mcinfo_extended *extention,
>>>>                      struct mca_handle_result *result);
>>>> 
>>>> };
>>>> 
>>>> struct mca_error_handler intel_mca_handler[] =
>>>> {
>>>>     ....
>>>> };
>>>> 
>>>> struct mca_error_handler amd_mca_handler[] =
>>>> {
>>>>     ....
>>>> };
>>>> 
>>>> 
>>>> /* HandlVer to be called in MCA ISR in MCA context */
>>>> int intel_mca_pre_handler(struct cpu_user_regs *regs,
>>>>                                 struct mca_handle_result *result);
>>>> 
>>>> int amd_mca_pre_handler(struct cpu_user_regs *regs,
>>>>                             struct mca_handle_result *result);
>>>> 
>>>> Frank.Vanderlinden@Sun.COM
> <mailto:Frank.Vanderlinden@Sun.COM> wrote:
>>>>> Jiang, Yunhong wrote:
>>>>>> Frank/Christopher, can you please give more comments for it, or you
>>>>>> are OK with this? For the action reporting mechanism, we will send out
>>>>>> a proposal for review soon.
>>>>> 
>>>>> I'm ok with this. We need a little more information on the AMD
>>>>> mechanism, but it seems to me that we can fit this in.
>>>>> 
>>>>> Sometime this week, I'll also send out the last of our changes that
>>>>> haven't been sent upstream to xen-unstable yet. Maybe we can combine
>>>>> some things in to one patch, like the telemetry handling changes that
>>>>> Gavin did. The other changes are error injection (for debugging) and
>>>>> panic crash dump support for our FMA tools, but those are probably only
>>>>> interesting to us. 
>>>>> 
>>>>> - Frank
>>> 
>>> --
>>> ---to satisfy European Law for business letters:
>>> Advanced Micro Devices GmbH
>>> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
>>> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
>>> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
>>> Registergericht Muenchen, HRB Nr. 43632
> 
> 
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-05 17:28                                                 ` Christoph Egger
  2009-03-06  2:11                                                   ` Jiang, Yunhong
@ 2009-03-10  1:19                                                   ` Jiang, Yunhong
  2009-03-10 19:08                                                     ` Christoph Egger
  1 sibling, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-03-10  1:19 UTC (permalink / raw)
  To: Jiang, Yunhong, Christoph Egger, xen-devel
  Cc: Kleen, Andi, Frank.Vanderlinden@Sun.COM, Keir Fraser,
	Gavin Maltby, Ke, Liping

Christoph/Frank, do you have any comments?

Thanks
Yunhong Jiang

Jiang, Yunhong <> wrote:
> Christoph Egger <mailto:Christoph.Egger@amd.com> wrote:
>> On Thursday 05 March 2009 16:19:40 Jiang, Yunhong wrote:
>>> Christoph Egger <mailto:Christoph.Egger@amd.com> wrote:
>>>> MC_ACT_CACHE_SHIRNK  <-- typo. should be MC_ACT_CACHE_SHRINK
>>> 
>>> Ahh, yes, I will fix it.
>>> 
>>>> The L3 cache index disable feature works like this:
>>>> 
>>>> You read the bits 17:6  from the MSR 0xC0000408 (which is MC4_MISC1)
>>>> and write it into the index field. This MSR does not belong to
>>>> the standard
>>>> mc bank data and is therefore provided by mcinfo_extended.
>>>> The index field are the bits 11:0 of the PCI function 3 register "L3
>>>> Cache Index Disable".
>>> 
>>> So what's the offset of "L3 Cache Index Disable"? Is it in 256 byte or 4K
>>> byte?
>> 
>> Sorry, which offset do you mean ?
> 
> I mean the offset of this register in the PCI function's
> configuration space. You know for a PCI device, it has 256
> byte configuration register while PCI-E device has 4K
> configuration register.
> Currently xen can access the 256 byte config register already,
> however, to support 4K range, it requires more stuff, like
> mmconfig sparse etc. That's the reason I ask the offset of
> this register.
> 
>> 
>>> 
>>> For the PCI access, I'd prefer to have xen to control all these, i.e. even
>>> if dom0 want to disable the L3 cache, it is done through a hypercall. The
>>> reason is, Xen control the CPU, so keep it in Xen will make things
>>> simpler. 
>>> 
>>> Of course, it is ok for me too, if you want to keep Xen for #MC handler
>>> and Dom0 for CE handler.
>> 
>> We still need to define the rules to prevent interferes and
>> clarify how to
>> deal with Dom0/DomU going wild and breaking the rules.
> 
> As discussed previously,  we don't need concern about DomU,
> all configuration space access from domU will be intercepted by dom0.
> 
> For Dom0, since currently all PCI access to 0xcf8/cfc will be
> intercepted by Xen,  so Xen can do checking. We can achieve
> same checking for mmconfig if remove that range from dom0. But
> I have to say I'm not sure if we do need concern too much what
> will happen when dom0 going wild ( after all, a crash in dom0
> will lost everything), especially interfere on such access
> will not cause security issue (please correct me if I'm wrong ).
> 
>> 
>>>> Why is the recover action bound to the bank ?
>>>> I would like to see a struct mcinfo_recover  rather extending
>>>> struct mcinfo_bank.  That gives us flexibility.
>>> 
>>> I'd get input from Frank or Gavin. Place mcinfo_recover in mcinfo_back has
>>> advantage of keep connection of the error source and the action, but it do
>>> make the mcinfo_bank more complex. Or we can keep the cpu/bank information
>>> in the mcinfo_recover also, so that we keep the flexibility and don't lose
>>> the connection.
>> 
>> From your suggestions I prefer the last one, but is still limited due
>> to the assumption that each struct mcinfo_bank and each struct
>> mcinfo_extended stands for exactly one error.
>> 
>> This assumption doesn't cover follow-up errors which may be needed to
>> determine the real root cause. Some of them may even be ignored
>> depending on what is going on.
> 
> I think the assumption here is a recover action will be
> triggered only by one bank. For example, we offline page
> because one MC bank tell us that page is broken.
> 
> The "follow-up errors" is something interesting to me, do you
> have any example? It's ok for us to not include the back
> information if there are such requirement.
> 
> Thanks
> Yunhong Jiang
> 
>> 
>> Christoph
>> 
>>> 
>>> Thanks
>>> Yunhong Jiang
>>> 
>>>> Christoph
>>>> 
>>>> On Thursday 05 March 2009 09:31:27 Jiang, Yunhong wrote:
>>>>> Christoph/Frank, Followed is the interface definition, please have a
>>>>> look. 
>>>>> 
>>>>> Thanks
>>>>> Yunhong Jiang
>>>>> 
>>>>> 1) Interface between Xen/dom0 for passing xen's recovery action
>>>>> information to dom0. Usage model: After offlining broken page, Xen might
>>>>> pass its page-offline recovery action result information to dom0. Dom0
>>>>> will save the information in non-volatile memory for further proactive
>>>>> actions, such as offlining the easy-broken page early when doing next
>>>>> reboot. 
>>>>> 
>>>>> 
>>>>> struct page_offline_action
>>>>> {
>>>>>     /* Params for passing the offlined page number to DOM0 */
>>>>> uint64_t mfn; uint64_t status; /* Similar to page offline hypercall */
>>>>> }; 
>>>>> 
>>>>> struct cpu_offline_action
>>>>> {
>>>>>     /* Params for passing the identity of the offlined CPU to DOM0 */
>>>>>     uint32_t mc_socketid; uint16_t mc_coreid;
>>>>>     uint16_t mc_core_threadid;
>>>>> };
>>>>> 
>>>>> struct cache_shrink_action
>>>>> {
>>>>>     /* TBD, Christoph, please fill it */
>>>>> };
>>>>> 
>>>>> /* Recover action flags, giving recovery result information to guest */
>>>>> /* Recovery successfully after taking certain recovery actions below */
>>>>> #define REC_ACT_RECOVERED      (0x1 << 0)
>>>>> /* For solaris's usage that dom0 will take ownership when crash */
>>>>> #define REC_ACT_RESET          (0x1 << 2)
>>>>> /* No action is performed by XEN */
>>>>> #define REC_ACT_INFO           (0x1 << 3)
>>>>> 
>>>>> /* Recover action type definition, valid only when flags &
>>>>> REC_ACT_RECOVERED */ #define MC_ACT_PAGE_OFFLINE 1
>>>>> #define MC_ACT_CPU_OFFLINE   2
>>>>> #define MC_ACT_CACHE_SHIRNK 3
>>>>> 
>>>>> struct recovery_action
>>>>> {
>>>>>     uint8_t flags;
>>>>>     uint8_t action_type;
>>>>>     union
>>>>>     {
>>>>>         struct page_offline_action page_retire;
>>>>>         struct cpu_offline_action cpu_offline;
>>>>>         struct cache_shrink_action cache_shrink;
>>>>>         uint8_t pad[MAX_ACTION_SIZE];
>>>>>     } action_info;
>>>>> }
>>>>> 
>>>>> struct mcinfo_bank {
>>>>>     struct mcinfo_common common;
>>>>> 
>>>>>     uint16_t mc_bank; /* bank nr */
>>>>>     uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on
>>>>> dom0 * and if mc_addr is valid. Never valid on DomU. */ uint64_t
>>>>>     mc_status; /* bank status */ uint64_t mc_addr;   /* bank address,
>>>>>                          only valid * if addr bit is set in mc_status */
>>>>> uint64_t mc_misc; uint64_t mc_ctrl2;
>>>>>     uint64_t mc_tsc;
>>>>>     /* Recovery action is performed per bank */
>>>>>     struct recovery_action action;
>>>>> };
>>>>> 
>>>>> 2) Below two interfaces are for MCA processing internal use.
>>>>>     a. pre_handler will be called earlier in MCA ISR context, mainly for
>>>>> early need_reset detection for avoiding log missing (flag MCA_RESET).
>>>>> Also, pre_handler might be able to find the impacted domain if possible.
>>>>>     b. mca_error_handler is actually a (error_action_index,
>>>>> recovery_handler pointer) pair. The defined recovery_handler function
>>>>> performs the actual recovery operations in softIrq context after the
>>>>> per_bank MCA error matching the corresponding mca_code index. If
>>>>> pre_handler can't judge the impacted domain, recovery_handler must
>>>>> figure it out. 
>>>>> 
>>>>> /* Error has been recovered successfully */
>>>>> #define MCA_RECOVERD    0
>>>>> /* Error impact one guest as stated in owner field */ #define MCA_OWNER
>>>>> 1 /* Error can't be recovered and need reboot system */ #define
>>>>> MCA_RESET 2 /* Error should be handled in softIRQ context */ #define
>>>>> MCA_MORE_ACTION 3 
>>>>> 
>>>>> struct mca_handle_result
>>>>> {
>>>>>     uint32_t flags;
>>>>>     /* Valid only when flags & MCA_OWNER */
>>>>>     domid_d owner;
>>>>>     /* valid only when flags & MCA_RECOVERD */
>>>>>     struct  recovery_action *action;
>>>>> };
>>>>> 
>>>>> struct mca_error_handler
>>>>> {
>>>>>     /*
>>>>>      * Assume we will need only architecture defined code. If the index
>>>>> can't be setup by * mca_code, we will add a function to do the (index,
>>>>> recovery_handler) mapping check. * This mca_code represents the recovery
>>>>> handler pointer index for identifying this * particular error's
>>>>>     corresponding recover action */
>>>>>     uint16_t mca_code;
>>>>> 
>>>>>     /* Handler to be called in softIRQ handler context */
>>>>>     int recovery_handler(struct mcinfo_bank *bank,
>>>>>                      struct mcinfo_global *global,
>>>>>                      struct mcinfo_extended *extention,
>>>>>                      struct mca_handle_result *result);
>>>>> 
>>>>> };
>>>>> 
>>>>> struct mca_error_handler intel_mca_handler[] =
>>>>> {
>>>>>     ....
>>>>> };
>>>>> 
>>>>> struct mca_error_handler amd_mca_handler[] =
>>>>> {
>>>>>     ....
>>>>> };
>>>>> 
>>>>> 
>>>>> /* HandlVer to be called in MCA ISR in MCA context */
>>>>> int intel_mca_pre_handler(struct cpu_user_regs *regs,
>>>>>                                 struct mca_handle_result *result);
>>>>> 
>>>>> int amd_mca_pre_handler(struct cpu_user_regs *regs,
>>>>>                             struct mca_handle_result *result);
>>>>> 
>>>>> Frank.Vanderlinden@Sun.COM
>> <mailto:Frank.Vanderlinden@Sun.COM> wrote:
>>>>>> Jiang, Yunhong wrote:
>>>>>>> Frank/Christopher, can you please give more comments for it, or you
>>>>>>> are OK with this? For the action reporting mechanism, we will send out
>>>>>>> a proposal for review soon.
>>>>>> 
>>>>>> I'm ok with this. We need a little more information on the AMD
>>>>>> mechanism, but it seems to me that we can fit this in.
>>>>>> 
>>>>>> Sometime this week, I'll also send out the last of our changes that
>>>>>> haven't been sent upstream to xen-unstable yet. Maybe we can combine
>>>>>> some things in to one patch, like the telemetry handling changes that
>>>>>> Gavin did. The other changes are error injection (for debugging) and
>>>>>> panic crash dump support for our FMA tools, but those are probably
>>>>>> only interesting to us. 
>>>>>> 
>>>>>> - Frank
>>>> 
>>>> --
>>>> ---to satisfy European Law for business letters:
>>>> Advanced Micro Devices GmbH
>>>> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
>>>> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
>>>> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
>>>> Registergericht Muenchen, HRB Nr. 43632
>> 
>> 
>> 
>> --
>> ---to satisfy European Law for business letters:
>> Advanced Micro Devices GmbH
>> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
>> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
>> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
>> Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-10  1:19                                                   ` Jiang, Yunhong
@ 2009-03-10 19:08                                                     ` Christoph Egger
  2009-03-12 15:52                                                       ` Jiang, Yunhong
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Egger @ 2009-03-10 19:08 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: xen-devel, Gavin Maltby, Ke, Liping, Frank.Vanderlinden@Sun.COM,
	Keir Fraser, Kleen, Andi

On Tuesday 10 March 2009 02:19:04 Jiang, Yunhong wrote:
> Christoph/Frank, do you have any comments?
>
> Thanks
> Yunhong Jiang
>
> Jiang, Yunhong <> wrote:
> > Christoph Egger <mailto:Christoph.Egger@amd.com> wrote:
> >> On Thursday 05 March 2009 16:19:40 Jiang, Yunhong wrote:
> >>> Christoph Egger <mailto:Christoph.Egger@amd.com> wrote:
> >>>> MC_ACT_CACHE_SHIRNK  <-- typo. should be MC_ACT_CACHE_SHRINK
> >>>
> >>> Ahh, yes, I will fix it.
> >>>
> >>>> The L3 cache index disable feature works like this:
> >>>>
> >>>> You read the bits 17:6  from the MSR 0xC0000408 (which is MC4_MISC1)
> >>>> and write it into the index field. This MSR does not belong to
> >>>> the standard
> >>>> mc bank data and is therefore provided by mcinfo_extended.
> >>>> The index field are the bits 11:0 of the PCI function 3 register "L3
> >>>> Cache Index Disable".
> >>>
> >>> So what's the offset of "L3 Cache Index Disable"? Is it in 256 byte or
> >>> 4K byte?
> >>
> >> Sorry, which offset do you mean ?
> >
> > I mean the offset of this register in the PCI function's
> > configuration space. You know for a PCI device, it has 256
> > byte configuration register while PCI-E device has 4K
> > configuration register.
> > Currently xen can access the 256 byte config register already,
> > however, to support 4K range, it requires more stuff, like
> > mmconfig sparse etc. That's the reason I ask the offset of
> > this register.

Ah, I see. The registers of our memory controller are in the
PCI config space. It's no PCI-E device.

> >>> For the PCI access, I'd prefer to have xen to control all these, i.e.
> >>> even if dom0 want to disable the L3 cache, it is done through a
> >>> hypercall. The reason is, Xen control the CPU, so keep it in Xen will
> >>> make things simpler.
> >>>
> >>> Of course, it is ok for me too, if you want to keep Xen for #MC handler
> >>> and Dom0 for CE handler.
> >>
> >> We still need to define the rules to prevent interferes and
> >> clarify how to
> >> deal with Dom0/DomU going wild and breaking the rules.
> >
> > As discussed previously,  we don't need concern about DomU,
> > all configuration space access from domU will be intercepted by dom0.
> >
> > For Dom0, since currently all PCI access to 0xcf8/cfc will be
> > intercepted by Xen,  so Xen can do checking. We can achieve
> > same checking for mmconfig if remove that range from dom0. But
> > I have to say I'm not sure if we do need concern too much what
> > will happen when dom0 going wild ( after all, a crash in dom0
> > will lost everything), especially interfere on such access
> > will not cause security issue (please correct me if I'm wrong ).

This sounds like an assumption that an IOMMU is always available.


> >>>> Why is the recover action bound to the bank ?
> >>>> I would like to see a struct mcinfo_recover  rather extending
> >>>> struct mcinfo_bank.  That gives us flexibility.
> >>>
> >>> I'd get input from Frank or Gavin. Place mcinfo_recover in mcinfo_back
> >>> has advantage of keep connection of the error source and the action,
> >>> but it do make the mcinfo_bank more complex. Or we can keep the
> >>> cpu/bank information in the mcinfo_recover also, so that we keep the
> >>> flexibility and don't lose the connection.
> >>
> >> From your suggestions I prefer the last one, but is still limited due
> >> to the assumption that each struct mcinfo_bank and each struct
> >> mcinfo_extended stands for exactly one error.
> >>
> >> This assumption doesn't cover follow-up errors which may be needed to
> >> determine the real root cause. Some of them may even be ignored
> >> depending on what is going on.
> >
> > I think the assumption here is a recover action will be
> > triggered only by one bank. For example, we offline page
> > because one MC bank tell us that page is broken.

Only if the bank is the one from the memory controller.
What if the bank is the Data or Instruction Cache ?

> > The "follow-up errors" is something interesting to me, do you
> > have any example? It's ok for us to not include the back
> > information if there are such requirement.

An error in the Bus Unit can trigger a watchdog timeout
and cause a Load-Store error as a "follow-up error". This in turn
may trigger another "follow-up error" in the memory controller
or in the Data or Instruction Cache depending on what the CPU
tries to do.

I think, we should mark the 'struct mcinfo_global' as a kind of header for
each error. All following information describe the error (including the 
follow-up errors) and all recover actions. This gives us the flexibility
to get as many information as possible and allows to do
as many recover actions as necessary instead of just one.

Christoph


> >>>> On Thursday 05 March 2009 09:31:27 Jiang, Yunhong wrote:
> >>>>> Christoph/Frank, Followed is the interface definition, please have a
> >>>>> look.
> >>>>>
> >>>>> Thanks
> >>>>> Yunhong Jiang
> >>>>>
> >>>>> 1) Interface between Xen/dom0 for passing xen's recovery action
> >>>>> information to dom0. Usage model: After offlining broken page, Xen
> >>>>> might pass its page-offline recovery action result information to
> >>>>> dom0. Dom0 will save the information in non-volatile memory for
> >>>>> further proactive actions, such as offlining the easy-broken page
> >>>>> early when doing next reboot.
> >>>>>
> >>>>>
> >>>>> struct page_offline_action
> >>>>> {
> >>>>>     /* Params for passing the offlined page number to DOM0 */
> >>>>> uint64_t mfn; uint64_t status; /* Similar to page offline hypercall
> >>>>> */ };
> >>>>>
> >>>>> struct cpu_offline_action
> >>>>> {
> >>>>>     /* Params for passing the identity of the offlined CPU to DOM0 */
> >>>>>     uint32_t mc_socketid; uint16_t mc_coreid;
> >>>>>     uint16_t mc_core_threadid;
> >>>>> };
> >>>>>
> >>>>> struct cache_shrink_action
> >>>>> {
> >>>>>     /* TBD, Christoph, please fill it */
> >>>>> };
> >>>>>
> >>>>> /* Recover action flags, giving recovery result information to guest
> >>>>> */ /* Recovery successfully after taking certain recovery actions
> >>>>> below */ #define REC_ACT_RECOVERED      (0x1 << 0)
> >>>>> /* For solaris's usage that dom0 will take ownership when crash */
> >>>>> #define REC_ACT_RESET          (0x1 << 2)
> >>>>> /* No action is performed by XEN */
> >>>>> #define REC_ACT_INFO           (0x1 << 3)
> >>>>>
> >>>>> /* Recover action type definition, valid only when flags &
> >>>>> REC_ACT_RECOVERED */ #define MC_ACT_PAGE_OFFLINE 1
> >>>>> #define MC_ACT_CPU_OFFLINE   2
> >>>>> #define MC_ACT_CACHE_SHIRNK 3
> >>>>>
> >>>>> struct recovery_action
> >>>>> {
> >>>>>     uint8_t flags;
> >>>>>     uint8_t action_type;
> >>>>>     union
> >>>>>     {
> >>>>>         struct page_offline_action page_retire;
> >>>>>         struct cpu_offline_action cpu_offline;
> >>>>>         struct cache_shrink_action cache_shrink;
> >>>>>         uint8_t pad[MAX_ACTION_SIZE];
> >>>>>     } action_info;
> >>>>> }
> >>>>>
> >>>>> struct mcinfo_bank {
> >>>>>     struct mcinfo_common common;
> >>>>>
> >>>>>     uint16_t mc_bank; /* bank nr */
> >>>>>     uint16_t mc_domid; /* Usecase 5: domain referenced by mc_addr on
> >>>>> dom0 * and if mc_addr is valid. Never valid on DomU. */ uint64_t
> >>>>>     mc_status; /* bank status */ uint64_t mc_addr;   /* bank address,
> >>>>>                          only valid * if addr bit is set in mc_status
> >>>>> */ uint64_t mc_misc; uint64_t mc_ctrl2;
> >>>>>     uint64_t mc_tsc;
> >>>>>     /* Recovery action is performed per bank */
> >>>>>     struct recovery_action action;
> >>>>> };
> >>>>>
> >>>>> 2) Below two interfaces are for MCA processing internal use.
> >>>>>     a. pre_handler will be called earlier in MCA ISR context, mainly
> >>>>> for early need_reset detection for avoiding log missing (flag
> >>>>> MCA_RESET). Also, pre_handler might be able to find the impacted
> >>>>> domain if possible. b. mca_error_handler is actually a
> >>>>> (error_action_index,
> >>>>> recovery_handler pointer) pair. The defined recovery_handler function
> >>>>> performs the actual recovery operations in softIrq context after the
> >>>>> per_bank MCA error matching the corresponding mca_code index. If
> >>>>> pre_handler can't judge the impacted domain, recovery_handler must
> >>>>> figure it out.
> >>>>>
> >>>>> /* Error has been recovered successfully */
> >>>>> #define MCA_RECOVERD    0
> >>>>> /* Error impact one guest as stated in owner field */ #define
> >>>>> MCA_OWNER 1 /* Error can't be recovered and need reboot system */
> >>>>> #define MCA_RESET 2 /* Error should be handled in softIRQ context */
> >>>>> #define MCA_MORE_ACTION 3
> >>>>>
> >>>>> struct mca_handle_result
> >>>>> {
> >>>>>     uint32_t flags;
> >>>>>     /* Valid only when flags & MCA_OWNER */
> >>>>>     domid_d owner;
> >>>>>     /* valid only when flags & MCA_RECOVERD */
> >>>>>     struct  recovery_action *action;
> >>>>> };
> >>>>>
> >>>>> struct mca_error_handler
> >>>>> {
> >>>>>     /*
> >>>>>      * Assume we will need only architecture defined code. If the
> >>>>> index can't be setup by * mca_code, we will add a function to do the
> >>>>> (index, recovery_handler) mapping check. * This mca_code represents
> >>>>> the recovery handler pointer index for identifying this * particular
> >>>>> error's corresponding recover action */
> >>>>>     uint16_t mca_code;
> >>>>>
> >>>>>     /* Handler to be called in softIRQ handler context */
> >>>>>     int recovery_handler(struct mcinfo_bank *bank,
> >>>>>                      struct mcinfo_global *global,
> >>>>>                      struct mcinfo_extended *extention,
> >>>>>                      struct mca_handle_result *result);
> >>>>>
> >>>>> };
> >>>>>
> >>>>> struct mca_error_handler intel_mca_handler[] =
> >>>>> {
> >>>>>     ....
> >>>>> };
> >>>>>
> >>>>> struct mca_error_handler amd_mca_handler[] =
> >>>>> {
> >>>>>     ....
> >>>>> };
> >>>>>
> >>>>>
> >>>>> /* HandlVer to be called in MCA ISR in MCA context */
> >>>>> int intel_mca_pre_handler(struct cpu_user_regs *regs,
> >>>>>                                 struct mca_handle_result *result);
> >>>>>
> >>>>> int amd_mca_pre_handler(struct cpu_user_regs *regs,
> >>>>>                             struct mca_handle_result *result);
> >>>>>
> >>>>> Frank.Vanderlinden@Sun.COM
> >>
> >> <mailto:Frank.Vanderlinden@Sun.COM> wrote:
> >>>>>> Jiang, Yunhong wrote:
> >>>>>>> Frank/Christopher, can you please give more comments for it, or you
> >>>>>>> are OK with this? For the action reporting mechanism, we will send
> >>>>>>> out a proposal for review soon.
> >>>>>>
> >>>>>> I'm ok with this. We need a little more information on the AMD
> >>>>>> mechanism, but it seems to me that we can fit this in.
> >>>>>>
> >>>>>> Sometime this week, I'll also send out the last of our changes that
> >>>>>> haven't been sent upstream to xen-unstable yet. Maybe we can combine
> >>>>>> some things in to one patch, like the telemetry handling changes
> >>>>>> that Gavin did. The other changes are error injection (for
> >>>>>> debugging) and panic crash dump support for our FMA tools, but those
> >>>>>> are probably only interesting to us.
> >>>>>>
> >>>>>> - Frank
> >>>>
> >>>> --


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-10 19:08                                                     ` Christoph Egger
@ 2009-03-12 15:52                                                       ` Jiang, Yunhong
  2009-03-16 16:27                                                         ` Frank van der Linden
  0 siblings, 1 reply; 45+ messages in thread
From: Jiang, Yunhong @ 2009-03-12 15:52 UTC (permalink / raw)
  To: Christoph Egger
  Cc: xen-devel, Gavin Maltby, Ke, Liping, Frank.Vanderlinden@Sun.COM,
	Keir Fraser, Kleen, Andi

Christoph, sorry for later response. Please see inline reply.

>Ah, I see. The registers of our memory controller are in the
>PCI config space. It's no PCI-E device.

That's great.

>
>> >>> For the PCI access, I'd prefer to have xen to control 
>all these, i.e.
>> >>> even if dom0 want to disable the L3 cache, it is done through a
>> >>> hypercall. The reason is, Xen control the CPU, so keep 
>it in Xen will
>> >>> make things simpler.
>> >>>
>> >>> Of course, it is ok for me too, if you want to keep Xen 
>for #MC handler
>> >>> and Dom0 for CE handler.
>> >>
>> >> We still need to define the rules to prevent interferes and
>> >> clarify how to
>> >> deal with Dom0/DomU going wild and breaking the rules.
>> >
>> > As discussed previously,  we don't need concern about DomU,
>> > all configuration space access from domU will be 
>intercepted by dom0.
>> >
>> > For Dom0, since currently all PCI access to 0xcf8/cfc will be
>> > intercepted by Xen,  so Xen can do checking. We can achieve
>> > same checking for mmconfig if remove that range from dom0. But
>> > I have to say I'm not sure if we do need concern too much what
>> > will happen when dom0 going wild ( after all, a crash in dom0
>> > will lost everything), especially interfere on such access
>> > will not cause security issue (please correct me if I'm wrong ).
>
>This sounds like an assumption that an IOMMU is always available.

Xen's PCI access does not requires IOMMU, it is in arch/x86/pci.c .

>> > I think the assumption here is a recover action will be
>> > triggered only by one bank. For example, we offline page
>> > because one MC bank tell us that page is broken.
>
>Only if the bank is the one from the memory controller.
>What if the bank is the Data or Instruction Cache ?
>
>> > The "follow-up errors" is something interesting to me, do you
>> > have any example? It's ok for us to not include the back
>> > information if there are such requirement.
>
>An error in the Bus Unit can trigger a watchdog timeout
>and cause a Load-Store error as a "follow-up error". This in turn
>may trigger another "follow-up error" in the memory controller
>or in the Data or Instruction Cache depending on what the CPU
>tries to do.

Hmm, so will these follow-up error in the same bank or different bank? If in different bank, how can MCE handler knows they are related, or even should MCE handler knows about the relationship (I didn't find such code in current implementation). Or you mean we need give the relationship because Dom0 need such information?

>
>I think, we should mark the 'struct mcinfo_global' as a kind 
>of header for
>each error. All following information describe the error 
>(including the 
>follow-up errors) and all recover actions. This gives us the 
>flexibility
>to get as many information as possible and allows to do
>as many recover actions as necessary instead of just one.

I think your original proposal can also meet such purpose, i.e. include the mc_recover_info and we still need pass all mc_bacnk infor to dom0 for telemetry. If you prefer this one, can you please define the interface? Gavin/Frank, do you have any idea for this changes?

Thanks
-- Yunhong Jiang

>
>Christoph
>
>
>> >>>> On Thursday 05 March 2009 09:31:27 Jiang, Yunhong wrote:
>> >>>>> Christoph/Frank, Followed is the interface definition, 
>please have a
>> >>>>> look.
>> >>>>>
>> >>>>> Thanks
>> >>>>> Yunhong Jiang
>> >>>>>
>> >>>>> 1) Interface between Xen/dom0 for passing xen's recovery action
>> >>>>> information to dom0. Usage model: After offlining 
>broken page, Xen
>> >>>>> might pass its page-offline recovery action result 
>information to
>> >>>>> dom0. Dom0 will save the information in non-volatile memory for
>> >>>>> further proactive actions, such as offlining the 
>easy-broken page
>> >>>>> early when doing next reboot.
>> >>>>>
>> >>>>>
>> >>>>> struct page_offline_action
>> >>>>> {
>> >>>>>     /* Params for passing the offlined page number to DOM0 */
>> >>>>> uint64_t mfn; uint64_t status; /* Similar to page 
>offline hypercall
>> >>>>> */ };
>> >>>>>
>> >>>>> struct cpu_offline_action
>> >>>>> {
>> >>>>>     /* Params for passing the identity of the offlined 
>CPU to DOM0 */
>> >>>>>     uint32_t mc_socketid; uint16_t mc_coreid;
>> >>>>>     uint16_t mc_core_threadid;
>> >>>>> };
>> >>>>>
>> >>>>> struct cache_shrink_action
>> >>>>> {
>> >>>>>     /* TBD, Christoph, please fill it */
>> >>>>> };
>> >>>>>
>> >>>>> /* Recover action flags, giving recovery result 
>information to guest
>> >>>>> */ /* Recovery successfully after taking certain 
>recovery actions
>> >>>>> below */ #define REC_ACT_RECOVERED      (0x1 << 0)
>> >>>>> /* For solaris's usage that dom0 will take ownership 
>when crash */
>> >>>>> #define REC_ACT_RESET          (0x1 << 2)
>> >>>>> /* No action is performed by XEN */
>> >>>>> #define REC_ACT_INFO           (0x1 << 3)
>> >>>>>
>> >>>>> /* Recover action type definition, valid only when flags &
>> >>>>> REC_ACT_RECOVERED */ #define MC_ACT_PAGE_OFFLINE 1
>> >>>>> #define MC_ACT_CPU_OFFLINE   2
>> >>>>> #define MC_ACT_CACHE_SHIRNK 3
>> >>>>>
>> >>>>> struct recovery_action
>> >>>>> {
>> >>>>>     uint8_t flags;
>> >>>>>     uint8_t action_type;
>> >>>>>     union
>> >>>>>     {
>> >>>>>         struct page_offline_action page_retire;
>> >>>>>         struct cpu_offline_action cpu_offline;
>> >>>>>         struct cache_shrink_action cache_shrink;
>> >>>>>         uint8_t pad[MAX_ACTION_SIZE];
>> >>>>>     } action_info;
>> >>>>> }
>> >>>>>
>> >>>>> struct mcinfo_bank {
>> >>>>>     struct mcinfo_common common;
>> >>>>>
>> >>>>>     uint16_t mc_bank; /* bank nr */
>> >>>>>     uint16_t mc_domid; /* Usecase 5: domain referenced 
>by mc_addr on
>> >>>>> dom0 * and if mc_addr is valid. Never valid on DomU. 
>*/ uint64_t
>> >>>>>     mc_status; /* bank status */ uint64_t mc_addr;   
>/* bank address,
>> >>>>>                          only valid * if addr bit is 
>set in mc_status
>> >>>>> */ uint64_t mc_misc; uint64_t mc_ctrl2;
>> >>>>>     uint64_t mc_tsc;
>> >>>>>     /* Recovery action is performed per bank */
>> >>>>>     struct recovery_action action;
>> >>>>> };
>> >>>>>
>> >>>>> 2) Below two interfaces are for MCA processing internal use.
>> >>>>>     a. pre_handler will be called earlier in MCA ISR 
>context, mainly
>> >>>>> for early need_reset detection for avoiding log missing (flag
>> >>>>> MCA_RESET). Also, pre_handler might be able to find 
>the impacted
>> >>>>> domain if possible. b. mca_error_handler is actually a
>> >>>>> (error_action_index,
>> >>>>> recovery_handler pointer) pair. The defined 
>recovery_handler function
>> >>>>> performs the actual recovery operations in softIrq 
>context after the
>> >>>>> per_bank MCA error matching the corresponding mca_code 
>index. If
>> >>>>> pre_handler can't judge the impacted domain, 
>recovery_handler must
>> >>>>> figure it out.
>> >>>>>
>> >>>>> /* Error has been recovered successfully */
>> >>>>> #define MCA_RECOVERD    0
>> >>>>> /* Error impact one guest as stated in owner field */ #define
>> >>>>> MCA_OWNER 1 /* Error can't be recovered and need 
>reboot system */
>> >>>>> #define MCA_RESET 2 /* Error should be handled in 
>softIRQ context */
>> >>>>> #define MCA_MORE_ACTION 3
>> >>>>>
>> >>>>> struct mca_handle_result
>> >>>>> {
>> >>>>>     uint32_t flags;
>> >>>>>     /* Valid only when flags & MCA_OWNER */
>> >>>>>     domid_d owner;
>> >>>>>     /* valid only when flags & MCA_RECOVERD */
>> >>>>>     struct  recovery_action *action;
>> >>>>> };
>> >>>>>
>> >>>>> struct mca_error_handler
>> >>>>> {
>> >>>>>     /*
>> >>>>>      * Assume we will need only architecture defined 
>code. If the
>> >>>>> index can't be setup by * mca_code, we will add a 
>function to do the
>> >>>>> (index, recovery_handler) mapping check. * This 
>mca_code represents
>> >>>>> the recovery handler pointer index for identifying 
>this * particular
>> >>>>> error's corresponding recover action */
>> >>>>>     uint16_t mca_code;
>> >>>>>
>> >>>>>     /* Handler to be called in softIRQ handler context */
>> >>>>>     int recovery_handler(struct mcinfo_bank *bank,
>> >>>>>                      struct mcinfo_global *global,
>> >>>>>                      struct mcinfo_extended *extention,
>> >>>>>                      struct mca_handle_result *result);
>> >>>>>
>> >>>>> };
>> >>>>>
>> >>>>> struct mca_error_handler intel_mca_handler[] =
>> >>>>> {
>> >>>>>     ....
>> >>>>> };
>> >>>>>
>> >>>>> struct mca_error_handler amd_mca_handler[] =
>> >>>>> {
>> >>>>>     ....
>> >>>>> };
>> >>>>>
>> >>>>>
>> >>>>> /* HandlVer to be called in MCA ISR in MCA context */
>> >>>>> int intel_mca_pre_handler(struct cpu_user_regs *regs,
>> >>>>>                                 struct 
>mca_handle_result *result);
>> >>>>>
>> >>>>> int amd_mca_pre_handler(struct cpu_user_regs *regs,
>> >>>>>                             struct mca_handle_result *result);
>> >>>>>
>> >>>>> Frank.Vanderlinden@Sun.COM
>> >>
>> >> <mailto:Frank.Vanderlinden@Sun.COM> wrote:
>> >>>>>> Jiang, Yunhong wrote:
>> >>>>>>> Frank/Christopher, can you please give more comments 
>for it, or you
>> >>>>>>> are OK with this? For the action reporting 
>mechanism, we will send
>> >>>>>>> out a proposal for review soon.
>> >>>>>>
>> >>>>>> I'm ok with this. We need a little more information on the AMD
>> >>>>>> mechanism, but it seems to me that we can fit this in.
>> >>>>>>
>> >>>>>> Sometime this week, I'll also send out the last of 
>our changes that
>> >>>>>> haven't been sent upstream to xen-unstable yet. Maybe 
>we can combine
>> >>>>>> some things in to one patch, like the telemetry 
>handling changes
>> >>>>>> that Gavin did. The other changes are error injection (for
>> >>>>>> debugging) and panic crash dump support for our FMA 
>tools, but those
>> >>>>>> are probably only interesting to us.
>> >>>>>>
>> >>>>>> - Frank
>> >>>>
>> >>>> --
>
>
>-- 
>---to satisfy European Law for business letters:
>Advanced Micro Devices GmbH
>Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
>Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
>Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
>Registergericht Muenchen, HRB Nr. 43632
>
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Re: [RFC] RAS(Part II)--MCA enalbing in XEN
  2009-03-12 15:52                                                       ` Jiang, Yunhong
@ 2009-03-16 16:27                                                         ` Frank van der Linden
  0 siblings, 0 replies; 45+ messages in thread
From: Frank van der Linden @ 2009-03-16 16:27 UTC (permalink / raw)
  To: Jiang, Yunhong
  Cc: Christoph Egger, xen-devel, Ke, Liping, Gavin Maltby,
	Keir Fraser, Kleen, Andi

Jiang, Yunhong wrote:
 > Christoph Egger wrote:
>> I think, we should mark the 'struct mcinfo_global' as a kind 
>> of header for
>> each error. All following information describe the error 
>> (including the 
>> follow-up errors) and all recover actions. This gives us the 
>> flexibility
>> to get as many information as possible and allows to do
>> as many recover actions as necessary instead of just one.
> 
> I think your original proposal can also meet such purpose, i.e. include the mc_recover_info and we still need pass all mc_bacnk infor to dom0 for telemetry. If you prefer this one, can you please define the interface? Gavin/Frank, do you have any idea for this changes?

Sorry about the slow reply.

Our changes to the MCE code (to combine the AMD and Intel code as much 
as possible, and use a transactional approach to the telemetry) already 
pretty much uses mc_global as a header. With our code, dom0 retrieves 
one mcinfo structure, with one global structure (which always comes 
first, but that's not required).

In other words, using mc_global as kind of a header to the mcinfo data 
is fine, since we're already doing that.

And, since we're talking about transactions with one mcinfo structure at 
a time (with one mc_global structure), the recover_info structures can 
be separate from the bank structures.

- Frank

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2009-03-16 16:27 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-16  5:35 [RFC] RAS(Part II)--MCA enalbing in XEN Ke, Liping
2009-02-16 13:34 ` Christoph Egger
2009-02-16 14:18   ` Christoph Egger
2009-02-16 15:03     ` Keir Fraser
2009-02-16 15:19       ` Jiang, Yunhong
2009-02-16 17:58       ` Frank Van Der Linden
2009-02-17  5:50         ` Frank Van Der Linden
2009-02-17  6:44           ` Jiang, Yunhong
2009-02-17  6:53           ` Jiang, Yunhong
2009-02-17  6:41         ` Jiang, Yunhong
2009-02-18 18:05           ` Christoph Egger
2009-02-19  9:13             ` Jiang, Yunhong
2009-02-19 16:25               ` Christoph Egger
2009-02-20  2:53                 ` Jiang, Yunhong
2009-02-20 21:01                   ` Frank van der Linden
2009-02-23  9:01                     ` Jiang, Yunhong
2009-02-24 18:53                       ` Frank van der Linden
     [not found]                         ` <2E9E6F5F5978EF44A8590E339E888CF988279945@irsmsx503.ger.corp.intel.com>
2009-02-24 19:07                           ` Frank van der Linden
2009-02-25  2:26                             ` Jiang, Yunhong
2009-02-25 10:37                             ` Christoph Egger
     [not found]                             ` <2E9E6F5F5978EF44A8590E339E888CF98827996D@irsmsx503.ger.corp.intel.com>
2009-02-24 20:47                               ` Frank van der Linden
2009-02-25  2:25                                 ` Jiang, Yunhong
2009-02-25 12:19                                   ` Christoph Egger
2009-02-25 17:32                                     ` Frank van der Linden
2009-02-26  2:16                                       ` Jiang, Yunhong
2009-03-02 14:58                                         ` Christoph Egger
2009-03-02 16:15                                           ` Jiang, Yunhong
2009-03-02  5:51                                       ` Jiang, Yunhong
2009-03-02 14:51                                         ` Christoph Egger
2009-03-02 16:09                                           ` Jiang, Yunhong
2009-03-02 17:47                                         ` Frank van der Linden
2009-03-05  4:45                                           ` Jiang, Yunhong
2009-03-05  8:31                                           ` Jiang, Yunhong
2009-03-05 14:53                                             ` Christoph Egger
2009-03-05 15:19                                               ` Jiang, Yunhong
2009-03-05 17:28                                                 ` Christoph Egger
2009-03-06  2:11                                                   ` Jiang, Yunhong
2009-03-10  1:19                                                   ` Jiang, Yunhong
2009-03-10 19:08                                                     ` Christoph Egger
2009-03-12 15:52                                                       ` Jiang, Yunhong
2009-03-16 16:27                                                         ` Frank van der Linden
2009-02-25 22:30                                     ` Gavin Maltby
2009-02-25  2:31                               ` Jiang, Yunhong
2009-02-25 10:57                               ` Christoph Egger
2009-02-16 15:05     ` Jiang, Yunhong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.