All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] Nested virtualization for VMX
@ 2010-09-08 15:22 Qing He
  2010-09-08 15:22 ` [PATCH 01/16] vmx: nest: rename host_vmcs Qing He
                   ` (16 more replies)
  0 siblings, 17 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: eddie.dong

This patch set is the upgraded version of nested
virtualization for VMX, that allows a VMX guest (L1) to
run other VMX guests (L2).

The nested virtualization for vmx is built on homogeneous
L1 and L2 for better performance and minimal emulation,
the common code involved is small and contained in patch
03/16, for two flags, one for feature availability, the
other for indicating current mode.

The userspace components (xend/xm/xl) is not included since
Christopher's userspace patch has similar coverage. vEPT is
not included as well because it's still WIP.

Major changes to last version:
 - address Tim's comments on error handling and others
 - split context switch into smaller pieces with certain
   restructure for better readability
 - update interrupt handling, rewrite comments
 - move cpuid into userspace
 - etc.

The patch set includes the following patches.

[PATCH 01/16] vmx: nest: rename host_vmcs
[PATCH 02/16] vmx: nest: wrapper for control update
[PATCH 03/16] vmx: nest: nested availability and status flags
[PATCH 04/16] vmx: nest: nested control structure
[PATCH 05/16] vmx: nest: virtual vmcs layout
[PATCH 06/16] vmx: nest: handling VMX instruction exits
[PATCH 07/16] vmx: nest: switch current vmcs
[PATCH 08/16] vmx: nest: vmresume/vmlaunch
[PATCH 09/16] vmx: nest: shadow controls
[PATCH 10/16] vmx: nest: L1 <-> L2 context switch
[PATCH 11/16] vmx: nest: interrupt handling
[PATCH 12/16] vmx: nest: VMExit handler in L2
[PATCH 13/16] vmx: nest: L2 tsc
[PATCH 14/16] vmx: nest: CR0.TS and #NM
[PATCH 15/16] vmx: nest: capability reporting MSRs
[PATCH 16/16] vmx: nest: expose cpuid and CR4.VMXE

Thanks,
Qing He

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 01/16] vmx: nest: rename host_vmcs
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-10 13:27   ` Christoph Egger
  2010-09-08 15:22 ` [PATCH 02/16] vmx: nest: wrapper for control update Qing He
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

the VMCS region used for vmxon is named host_vmcs, which is
somewhat misleading in nested virtualization context, rename it
to vmxon_vmcs.

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r d6a8d49f3526 xen/arch/x86/hvm/vmx/vmcs.c
--- a/xen/arch/x86/hvm/vmx/vmcs.c	Mon Jul 26 14:42:21 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmcs.c	Wed Aug 04 16:30:40 2010 +0800
@@ -67,7 +67,7 @@
 u64 vmx_ept_vpid_cap __read_mostly;
 bool_t cpu_has_vmx_ins_outs_instr_info __read_mostly;
 
-static DEFINE_PER_CPU_READ_MOSTLY(struct vmcs_struct *, host_vmcs);
+static DEFINE_PER_CPU_READ_MOSTLY(struct vmcs_struct *, vmxon_vmcs);
 static DEFINE_PER_CPU(struct vmcs_struct *, current_vmcs);
 static DEFINE_PER_CPU(struct list_head, active_vmcs_list);
 
@@ -427,11 +427,11 @@
 
 int vmx_cpu_up_prepare(unsigned int cpu)
 {
-    if ( per_cpu(host_vmcs, cpu) != NULL )
+    if ( per_cpu(vmxon_vmcs, cpu) != NULL )
         return 0;
 
-    per_cpu(host_vmcs, cpu) = vmx_alloc_vmcs();
-    if ( per_cpu(host_vmcs, cpu) != NULL )
+    per_cpu(vmxon_vmcs, cpu) = vmx_alloc_vmcs();
+    if ( per_cpu(vmxon_vmcs, cpu) != NULL )
         return 0;
 
     printk("CPU%d: Could not allocate host VMCS\n", cpu);
@@ -440,8 +440,8 @@
 
 void vmx_cpu_dead(unsigned int cpu)
 {
-    vmx_free_vmcs(per_cpu(host_vmcs, cpu));
-    per_cpu(host_vmcs, cpu) = NULL;
+    vmx_free_vmcs(per_cpu(vmxon_vmcs, cpu));
+    per_cpu(vmxon_vmcs, cpu) = NULL;
 }
 
 int vmx_cpu_up(void)
@@ -498,7 +498,7 @@
     if ( (rc = vmx_cpu_up_prepare(cpu)) != 0 )
         return rc;
 
-    switch ( __vmxon(virt_to_maddr(this_cpu(host_vmcs))) )
+    switch ( __vmxon(virt_to_maddr(this_cpu(vmxon_vmcs))) )
     {
     case -2: /* #UD or #GP */
         if ( bios_locked &&

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 02/16] vmx: nest: wrapper for control update
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
  2010-09-08 15:22 ` [PATCH 01/16] vmx: nest: rename host_vmcs Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-10 13:29   ` Christoph Egger
  2010-09-08 15:22 ` [PATCH 03/16] vmx: nest: nested availability and status flags Qing He
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

In nested virtualization, the L0 controls may not be the same
with controls in physical VMCS.
Explict maintain guest controls in variables and use wrappers
for control update, do not rely on physical control value.

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r 905ca9cc0596 xen/arch/x86/hvm/vmx/intr.c
--- a/xen/arch/x86/hvm/vmx/intr.c	Wed Aug 04 16:30:40 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/intr.c	Thu Aug 05 15:32:24 2010 +0800
@@ -106,7 +106,7 @@
     if ( !(*cpu_exec_control & ctl) )
     {
         *cpu_exec_control |= ctl;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL, *cpu_exec_control);
+        vmx_update_cpu_exec_control(v);
     }
 }
 
@@ -121,7 +121,7 @@
     if ( unlikely(v->arch.hvm_vcpu.single_step) )
     {
         v->arch.hvm_vmx.exec_control |= CPU_BASED_MONITOR_TRAP_FLAG;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+        vmx_update_cpu_exec_control(v);
         return;
     }
 
diff -r 905ca9cc0596 xen/arch/x86/hvm/vmx/vmcs.c
--- a/xen/arch/x86/hvm/vmx/vmcs.c	Wed Aug 04 16:30:40 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmcs.c	Thu Aug 05 15:32:24 2010 +0800
@@ -839,10 +839,10 @@
     __vmwrite(VMCS_LINK_POINTER_HIGH, ~0UL);
 #endif
 
-    __vmwrite(EXCEPTION_BITMAP,
-              HVM_TRAP_MASK
+    v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK
               | (paging_mode_hap(d) ? 0 : (1U << TRAP_page_fault))
-              | (1U << TRAP_no_device));
+              | (1U << TRAP_no_device);
+    vmx_update_exception_bitmap(v);
 
     v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET;
     hvm_update_guest_cr(v, 0);
diff -r 905ca9cc0596 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Aug 04 16:30:40 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Thu Aug 05 15:32:24 2010 +0800
@@ -385,6 +385,22 @@
 
 #endif /* __i386__ */
 
+void vmx_update_cpu_exec_control(struct vcpu *v)
+{
+    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+}
+
+void vmx_update_secondary_exec_control(struct vcpu *v)
+{
+    __vmwrite(SECONDARY_VM_EXEC_CONTROL,
+              v->arch.hvm_vmx.secondary_exec_control);
+}
+
+void vmx_update_exception_bitmap(struct vcpu *v)
+{
+    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
+}
+
 static int vmx_guest_x86_mode(struct vcpu *v)
 {
     unsigned int cs_ar_bytes;
@@ -408,7 +424,7 @@
     /* Clear the DR dirty flag and re-enable intercepts for DR accesses. */
     v->arch.hvm_vcpu.flag_dr_dirty = 0;
     v->arch.hvm_vmx.exec_control |= CPU_BASED_MOV_DR_EXITING;
-    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+    vmx_update_cpu_exec_control(v);
 
     v->arch.guest_context.debugreg[0] = read_debugreg(0);
     v->arch.guest_context.debugreg[1] = read_debugreg(1);
@@ -622,7 +638,8 @@
 static void vmx_fpu_enter(struct vcpu *v)
 {
     setup_fpu(v);
-    __vm_clear_bit(EXCEPTION_BITMAP, TRAP_no_device);
+    v->arch.hvm_vmx.exception_bitmap &= ~(1u << TRAP_no_device);
+    vmx_update_exception_bitmap(v);
     v->arch.hvm_vmx.host_cr0 &= ~X86_CR0_TS;
     __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
 }
@@ -648,7 +665,8 @@
     {
         v->arch.hvm_vcpu.hw_cr[0] |= X86_CR0_TS;
         __vmwrite(GUEST_CR0, v->arch.hvm_vcpu.hw_cr[0]);
-        __vm_set_bit(EXCEPTION_BITMAP, TRAP_no_device);
+        v->arch.hvm_vmx.exception_bitmap |= (1u << TRAP_no_device);
+        vmx_update_exception_bitmap(v);
     }
 }
 
@@ -954,7 +972,7 @@
     v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
     if ( enable )
         v->arch.hvm_vmx.exec_control |= CPU_BASED_RDTSC_EXITING;
-    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+    vmx_update_cpu_exec_control(v);
     vmx_vmcs_exit(v);
 }
 
@@ -1047,7 +1065,7 @@
 
 void vmx_update_debug_state(struct vcpu *v)
 {
-    unsigned long intercepts, mask;
+    unsigned long mask;
 
     ASSERT(v == current);
 
@@ -1055,12 +1073,11 @@
     if ( !cpu_has_monitor_trap_flag )
         mask |= 1u << TRAP_debug;
 
-    intercepts = __vmread(EXCEPTION_BITMAP);
     if ( v->arch.hvm_vcpu.debug_state_latch )
-        intercepts |= mask;
+        v->arch.hvm_vmx.exception_bitmap |= mask;
     else
-        intercepts &= ~mask;
-    __vmwrite(EXCEPTION_BITMAP, intercepts);
+        v->arch.hvm_vmx.exception_bitmap &= ~mask;
+    vmx_update_exception_bitmap(v);
 }
 
 static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr)
@@ -1087,7 +1104,7 @@
             v->arch.hvm_vmx.exec_control &= ~cr3_ctls;
             if ( !hvm_paging_enabled(v) )
                 v->arch.hvm_vmx.exec_control |= cr3_ctls;
-            __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+            vmx_update_cpu_exec_control(v);
 
             /* Changing CR0.PE can change some bits in real CR4. */
             vmx_update_guest_cr(v, 4);
@@ -1122,7 +1139,8 @@
                     vmx_set_segment_register(v, s, &reg[s]);
                 v->arch.hvm_vcpu.hw_cr[4] |= X86_CR4_VME;
                 __vmwrite(GUEST_CR4, v->arch.hvm_vcpu.hw_cr[4]);
-                __vmwrite(EXCEPTION_BITMAP, 0xffffffff);
+                v->arch.hvm_vmx.exception_bitmap = 0xffffffff;
+                vmx_update_exception_bitmap(v);
             }
             else 
             {
@@ -1134,11 +1152,11 @@
                     ((v->arch.hvm_vcpu.hw_cr[4] & ~X86_CR4_VME)
                      |(v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_VME));
                 __vmwrite(GUEST_CR4, v->arch.hvm_vcpu.hw_cr[4]);
-                __vmwrite(EXCEPTION_BITMAP, 
-                          HVM_TRAP_MASK
+                v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK
                           | (paging_mode_hap(v->domain) ?
                              0 : (1U << TRAP_page_fault))
-                          | (1U << TRAP_no_device));
+                          | (1U << TRAP_no_device);
+                vmx_update_exception_bitmap(v);
                 vmx_update_debug_state(v);
             }
         }
@@ -1544,7 +1562,7 @@
 
     /* Allow guest direct access to DR registers */
     v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MOV_DR_EXITING;
-    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+    vmx_update_cpu_exec_control(v);
 }
 
 static void vmx_invlpg_intercept(unsigned long vaddr)
@@ -1928,18 +1946,18 @@
 void vmx_vlapic_msr_changed(struct vcpu *v)
 {
     struct vlapic *vlapic = vcpu_vlapic(v);
-    uint32_t ctl;
 
     if ( !cpu_has_vmx_virtualize_apic_accesses )
         return;
 
     vmx_vmcs_enter(v);
-    ctl  = __vmread(SECONDARY_VM_EXEC_CONTROL);
-    ctl &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+    v->arch.hvm_vmx.secondary_exec_control
+        &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
     if ( !vlapic_hw_disabled(vlapic) &&
          (vlapic_base_address(vlapic) == APIC_DEFAULT_PHYS_BASE) )
-        ctl |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
-    __vmwrite(SECONDARY_VM_EXEC_CONTROL, ctl);
+        v->arch.hvm_vmx.secondary_exec_control
+            |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+    vmx_update_secondary_exec_control(v);
     vmx_vmcs_exit(v);
 }
 
@@ -2469,14 +2487,12 @@
     case EXIT_REASON_PENDING_VIRT_INTR:
         /* Disable the interrupt window. */
         v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
-                  v->arch.hvm_vmx.exec_control);
+        vmx_update_cpu_exec_control(v);
         break;
     case EXIT_REASON_PENDING_VIRT_NMI:
         /* Disable the NMI window. */
         v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
-                  v->arch.hvm_vmx.exec_control);
+        vmx_update_cpu_exec_control(v);
         break;
     case EXIT_REASON_TASK_SWITCH: {
         const enum hvm_task_switch_reason reasons[] = {
@@ -2627,7 +2643,7 @@
 
     case EXIT_REASON_MONITOR_TRAP_FLAG:
         v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
-        __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+        vmx_update_cpu_exec_control(v);
         if ( v->domain->debugger_attached && v->arch.hvm_vcpu.single_step )
             domain_pause_for_debugger();
         break;
@@ -2677,16 +2693,14 @@
             /* VPID was disabled: now enabled. */
             curr->arch.hvm_vmx.secondary_exec_control |=
                 SECONDARY_EXEC_ENABLE_VPID;
-            __vmwrite(SECONDARY_VM_EXEC_CONTROL,
-                      curr->arch.hvm_vmx.secondary_exec_control);
+            vmx_update_secondary_exec_control(curr);
         }
         else if ( old_asid && !new_asid )
         {
             /* VPID was enabled: now disabled. */
             curr->arch.hvm_vmx.secondary_exec_control &=
                 ~SECONDARY_EXEC_ENABLE_VPID;
-            __vmwrite(SECONDARY_VM_EXEC_CONTROL,
-                      curr->arch.hvm_vmx.secondary_exec_control);
+            vmx_update_secondary_exec_control(curr);
         }
     }
 
diff -r 905ca9cc0596 xen/include/asm-x86/hvm/vmx/vmcs.h
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Aug 04 16:30:40 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Thu Aug 05 15:32:24 2010 +0800
@@ -97,6 +97,7 @@
     /* Cache of cpu execution control. */
     u32                  exec_control;
     u32                  secondary_exec_control;
+    u32                  exception_bitmap;
 
 #ifdef __x86_64__
     struct vmx_msr_state msr_state;
diff -r 905ca9cc0596 xen/include/asm-x86/hvm/vmx/vmx.h
--- a/xen/include/asm-x86/hvm/vmx/vmx.h	Wed Aug 04 16:30:40 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h	Thu Aug 05 15:32:24 2010 +0800
@@ -60,6 +60,9 @@
 void vmx_vlapic_msr_changed(struct vcpu *v);
 void vmx_realmode(struct cpu_user_regs *regs);
 void vmx_update_debug_state(struct vcpu *v);
+void vmx_update_cpu_exec_control(struct vcpu *v);
+void vmx_update_secondary_exec_control(struct vcpu *v);
+void vmx_update_exception_bitmap(struct vcpu *v);
 
 /*
  * Exit Reasons

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 03/16] vmx: nest: nested availability and status flags
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
  2010-09-08 15:22 ` [PATCH 01/16] vmx: nest: rename host_vmcs Qing He
  2010-09-08 15:22 ` [PATCH 02/16] vmx: nest: wrapper for control update Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-15 11:43   ` Christoph Egger
  2010-09-08 15:22 ` [PATCH 04/16] vmx: nest: nested control structure Qing He
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

These are the vendor neutral availability and status flags of nested
virtualization.

The availability hvm parameter can be used to disable all reporting
and functions of nested, improving guest security in certain circumstances.

The per vcpu flag in_nesting is used to indicate fundamental status:
the current mode.

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---
diff -r 11c98ab76326 xen/include/asm-x86/hvm/hvm.h
--- a/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 20:35:38 2010 +0800
+++ b/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 20:36:19 2010 +0800
@@ -250,6 +250,10 @@
 #define is_viridian_domain(_d)                                             \
  (is_hvm_domain(_d) && ((_d)->arch.hvm_domain.params[HVM_PARAM_VIRIDIAN]))
 
+#define is_nested_avail(_d)                                                \
+ (is_hvm_domain(_d) && ((_d)->arch.hvm_domain.params[HVM_PARAM_NESTEDHVM]))
+
+
 void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx,
                                    unsigned int *ecx, unsigned int *edx);
 void hvm_migrate_timers(struct vcpu *v);
diff -r 11c98ab76326 xen/include/asm-x86/hvm/vcpu.h
--- a/xen/include/asm-x86/hvm/vcpu.h	Wed Sep 08 20:35:38 2010 +0800
+++ b/xen/include/asm-x86/hvm/vcpu.h	Wed Sep 08 20:36:19 2010 +0800
@@ -71,6 +71,8 @@
     bool_t              debug_state_latch;
     bool_t              single_step;
 
+    bool_t              in_nesting;
+
     u64                 asid_generation;
     u32                 asid;
 
diff -r 11c98ab76326 xen/include/public/hvm/params.h
--- a/xen/include/public/hvm/params.h	Wed Sep 08 20:35:38 2010 +0800
+++ b/xen/include/public/hvm/params.h	Wed Sep 08 20:36:19 2010 +0800
@@ -113,6 +113,9 @@
 #define HVM_PARAM_CONSOLE_PFN    17
 #define HVM_PARAM_CONSOLE_EVTCHN 18
 
-#define HVM_NR_PARAMS          19
+/* Boolean: Enable nested virtualization (hvm only) */
+#define HVM_PARAM_NESTEDHVM    19
+
+#define HVM_NR_PARAMS          20
 
 #endif /* __XEN_PUBLIC_HVM_PARAMS_H__ */

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 04/16] vmx: nest: nested control structure
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (2 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 03/16] vmx: nest: nested availability and status flags Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-09  6:13   ` Dong, Eddie
  2010-09-15 11:27   ` Christoph Egger
  2010-09-08 15:22 ` [PATCH 05/16] vmx: nest: virtual vmcs layout Qing He
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

v->arch.hvm_vmx.nest as control structure

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---
diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/nest.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:03:41 2010 +0800
@@ -0,0 +1,45 @@
+/*
+ * nest.h: nested virtualization for VMX.
+ *
+ * Copyright (c) 2010, Intel Corporation.
+ * Author: Qing He <qing.he@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ */
+#ifndef __ASM_X86_HVM_NEST_H__
+#define __ASM_X86_HVM_NEST_H__
+
+struct vmcs_struct;
+
+struct vmx_nest_struct {
+    paddr_t              guest_vmxon_pa;
+
+    /* Saved host vmcs for vcpu itself */
+    struct vmcs_struct  *hvmcs;
+
+    /*
+     * Guest's `current vmcs' of vcpu
+     *  - gvmcs_pa: guest VMCS region physical address
+     *  - vvmcs:    (guest) virtual vmcs
+     *  - svmcs:    effective vmcs for the guest of this vcpu
+     *  - valid:    launch state: invalid on clear, valid on ld
+     */
+    paddr_t              gvmcs_pa;
+    void                *vvmcs;
+    struct vmcs_struct  *svmcs;
+    int                  vmcs_valid;
+};
+
+#endif /* __ASM_X86_HVM_NEST_H__ */
diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/vmcs.h
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:00:00 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:03:41 2010 +0800
@@ -22,6 +22,7 @@
 #include <asm/config.h>
 #include <asm/hvm/io.h>
 #include <asm/hvm/vpmu.h>
+#include <asm/hvm/vmx/nest.h>
 
 extern void vmcs_dump_vcpu(struct vcpu *v);
 extern void setup_vmcs_dump(void);
@@ -99,6 +100,9 @@
     u32                  secondary_exec_control;
     u32                  exception_bitmap;
 
+    /* nested virtualization */
+    struct vmx_nest_struct nest;
+
 #ifdef __x86_64__
     struct vmx_msr_state msr_state;
     unsigned long        shadow_gs;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 05/16] vmx: nest: virtual vmcs layout
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (3 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 04/16] vmx: nest: nested control structure Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-13 10:29   ` Tim Deegan
  2010-09-08 15:22 ` [PATCH 06/16] vmx: nest: handling VMX instruction exits Qing He
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

Since physical vmcs is transparent, a customized virtual vmcs (vvmcs) is
introduced. It converts the vmcs encoding to an offset into vvmcs page.

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r 5935027e5e70 xen/include/asm-x86/hvm/vmx/vvmcs.h
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/include/asm-x86/hvm/vmx/vvmcs.h	Wed Apr 21 00:57:40 2010 +0800
@@ -0,0 +1,154 @@
+/*
+ * vvmcs.h: virtual VMCS access for nested virtualization.
+ *
+ * Copyright (c) 2010, Intel Corporation.
+ * Author: Qing He <qing.he@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ */
+
+#include <xen/config.h>
+#include <asm/types.h>
+
+/*
+ * Virtual VMCS layout
+ *
+ * Since physical VMCS layout is unknown, a custom layout is used
+ * for virtual VMCS seen by guest. It occupies a 4k page, and the
+ * field is offset by an 9-bit offset into u64[], The offset is as
+ * follow, which means every <width, type> pair has a max of 32
+ * fields available.
+ *
+ *             9       7      5               0
+ *             --------------------------------
+ *     offset: | width | type |     index     |
+ *             --------------------------------
+ *
+ * Also, since the lower range <width=0, type={0,1}> has only one
+ * field: VPID, it is moved to a higher offset (63), and leaves the
+ * lower range to non-indexed field like VMCS revision.
+ *
+ */
+
+#define VVMCS_REVISION 0x40000001u
+
+struct vvmcs_header {
+    u32 revision;
+    u32 abort;
+};
+
+union vmcs_encoding {
+    struct {
+        u32 access_type : 1;
+        u32 index : 9;
+        u32 type : 2;
+        u32 rsv1 : 1;
+        u32 width : 2;
+        u32 rsv2 : 17;
+    };
+    u32 word;
+};
+
+enum vvmcs_encoding_width {
+    VVMCS_WIDTH_16 = 0,
+    VVMCS_WIDTH_64,
+    VVMCS_WIDTH_32,
+    VVMCS_WIDTH_NATURAL,
+};
+
+enum vvmcs_encoding_type {
+    VVMCS_TYPE_CONTROL = 0,
+    VVMCS_TYPE_RO,
+    VVMCS_TYPE_GSTATE,
+    VVMCS_TYPE_HSTATE,
+};
+
+static inline int vvmcs_offset(u32 width, u32 type, u32 index)
+{
+    int offset;
+
+    offset = (index & 0x1f) | type << 5 | width << 7;
+
+    if ( offset == 0 )    /* vpid */
+        offset = 0x3f;
+
+    return offset;
+}
+
+static inline u64 __get_vvmcs(void *vvmcs, u32 vmcs_encoding)
+{
+    union vmcs_encoding enc;
+    u64 *content = (u64 *) vvmcs;
+    int offset;
+    u64 res;
+
+    enc.word = vmcs_encoding;
+    offset = vvmcs_offset(enc.width, enc.type, enc.index);
+    res = content[offset];
+
+    switch ( enc.width ) {
+    case VVMCS_WIDTH_16:
+        res &= 0xffff;
+        break;
+    case VVMCS_WIDTH_64:
+        if ( enc.access_type )
+            res >>= 32;
+        break;
+    case VVMCS_WIDTH_32:
+        res &= 0xffffffff;
+        break;
+    case VVMCS_WIDTH_NATURAL:
+    default:
+        break;
+    }
+
+    return res;
+}
+
+static inline void __set_vvmcs(void *vvmcs, u32 vmcs_encoding, u64 val)
+{
+    union vmcs_encoding enc;
+    u64 *content = (u64 *) vvmcs;
+    int offset;
+    u64 res;
+
+    enc.word = vmcs_encoding;
+    offset = vvmcs_offset(enc.width, enc.type, enc.index);
+    res = content[offset];
+
+    switch ( enc.width ) {
+    case VVMCS_WIDTH_16:
+        res = val & 0xffff;
+        break;
+    case VVMCS_WIDTH_64:
+        if ( enc.access_type )
+        {
+            res &= 0xffffffff;
+            res |= val << 32;
+        }
+        else
+            res = val;
+        break;
+    case VVMCS_WIDTH_32:
+        res = val & 0xffffffff;
+        break;
+    case VVMCS_WIDTH_NATURAL:
+    default:
+        res = val;
+        break;
+    }
+
+    content[offset] = res;
+}

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (4 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 05/16] vmx: nest: virtual vmcs layout Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-10  7:05   ` Dong, Eddie
  2010-09-13 11:10   ` Tim Deegan
  2010-09-08 15:22 ` [PATCH 07/16] vmx: nest: switch current vmcs Qing He
                   ` (10 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

add a VMX instruction decoder and handle simple VMX instructions
except vmlaunch/vmresume and invept

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r f1c1d3077337 xen/arch/x86/hvm/vmx/Makefile
--- a/xen/arch/x86/hvm/vmx/Makefile	Wed Sep 08 21:03:46 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/Makefile	Wed Sep 08 21:30:01 2010 +0800
@@ -4,3 +4,4 @@
 obj-y += vmcs.o
 obj-y += vmx.o
 obj-y += vpmu_core2.o
+obj-y += nest.o
diff -r f1c1d3077337 xen/arch/x86/hvm/vmx/nest.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 21:30:01 2010 +0800
@@ -0,0 +1,635 @@
+/*
+ * nest.c: nested virtualization for VMX.
+ *
+ * Copyright (c) 2010, Intel Corporation.
+ * Author: Qing He <qing.he@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ */
+
+#include <xen/config.h>
+#include <asm/types.h>
+#include <asm/p2m.h>
+#include <asm/hvm/vmx/vmx.h>
+#include <asm/hvm/vmx/vvmcs.h>
+#include <asm/hvm/vmx/nest.h>
+
+/* map vvmcs instead of copy the whole page */
+#define CONFIG_VVMCS_MAPPING 0
+
+/*
+ * VMX instructions support functions
+ */
+
+enum vmx_regs_enc {
+    VMX_REG_RAX,
+    VMX_REG_RCX,
+    VMX_REG_RDX,
+    VMX_REG_RBX,
+    VMX_REG_RSP,
+    VMX_REG_RBP,
+    VMX_REG_RSI,
+    VMX_REG_RDI,
+#ifdef CONFIG_X86_64
+    VMX_REG_R8,
+    VMX_REG_R9,
+    VMX_REG_R10,
+    VMX_REG_R11,
+    VMX_REG_R12,
+    VMX_REG_R13,
+    VMX_REG_R14,
+    VMX_REG_R15,
+#endif
+};
+
+enum vmx_sregs_enc {
+    VMX_SREG_ES,
+    VMX_SREG_CS,
+    VMX_SREG_SS,
+    VMX_SREG_DS,
+    VMX_SREG_FS,
+    VMX_SREG_GS,
+};
+
+enum x86_segment sreg_to_index[] = {
+    [VMX_SREG_ES] = x86_seg_es,
+    [VMX_SREG_CS] = x86_seg_cs,
+    [VMX_SREG_SS] = x86_seg_ss,
+    [VMX_SREG_DS] = x86_seg_ds,
+    [VMX_SREG_FS] = x86_seg_fs,
+    [VMX_SREG_GS] = x86_seg_gs,
+};
+
+union vmx_inst_info {
+    struct {
+        unsigned int scaling           :2; /* bit 0-1 */
+        unsigned int __rsvd0           :1; /* bit 2 */
+        unsigned int reg1              :4; /* bit 3-6 */
+        unsigned int addr_size         :3; /* bit 7-9 */
+        unsigned int memreg            :1; /* bit 10 */
+        unsigned int __rsvd1           :4; /* bit 11-14 */
+        unsigned int segment           :3; /* bit 15-17 */
+        unsigned int index_reg         :4; /* bit 18-21 */
+        unsigned int index_reg_invalid :1; /* bit 22 */
+        unsigned int base_reg          :4; /* bit 23-26 */
+        unsigned int base_reg_invalid  :1; /* bit 27 */
+        unsigned int reg2              :4; /* bit 28-31 */
+    } fields;
+    u32 word;
+};
+
+struct vmx_inst_decoded {
+#define VMX_INST_MEMREG_TYPE_MEMORY 0
+#define VMX_INST_MEMREG_TYPE_REG    1
+    int type;
+    union {
+        struct {
+            unsigned long mem;
+            unsigned int  len;
+        };
+        enum vmx_regs_enc reg1;
+    };
+
+    enum vmx_regs_enc reg2;
+};
+
+enum vmx_ops_result {
+    VMSUCCEED,
+    VMFAIL_VALID,
+    VMFAIL_INVALID,
+};
+
+#define CASE_SET_REG(REG, reg)      \
+    case VMX_REG_ ## REG: regs->reg = value; break
+#define CASE_GET_REG(REG, reg)      \
+    case VMX_REG_ ## REG: value = regs->reg; break
+
+#define CASE_EXTEND_SET_REG         \
+    CASE_EXTEND_REG(S)
+#define CASE_EXTEND_GET_REG         \
+    CASE_EXTEND_REG(G)
+
+#ifdef __i386__
+#define CASE_EXTEND_REG(T)
+#else
+#define CASE_EXTEND_REG(T)          \
+    CASE_ ## T ## ET_REG(R8, r8);   \
+    CASE_ ## T ## ET_REG(R9, r9);   \
+    CASE_ ## T ## ET_REG(R10, r10); \
+    CASE_ ## T ## ET_REG(R11, r11); \
+    CASE_ ## T ## ET_REG(R12, r12); \
+    CASE_ ## T ## ET_REG(R13, r13); \
+    CASE_ ## T ## ET_REG(R14, r14); \
+    CASE_ ## T ## ET_REG(R15, r15)
+#endif
+
+static unsigned long reg_read(struct cpu_user_regs *regs,
+                              enum vmx_regs_enc index)
+{
+    unsigned long value = 0;
+
+    switch ( index ) {
+    CASE_GET_REG(RAX, eax);
+    CASE_GET_REG(RCX, ecx);
+    CASE_GET_REG(RDX, edx);
+    CASE_GET_REG(RBX, ebx);
+    CASE_GET_REG(RBP, ebp);
+    CASE_GET_REG(RSI, esi);
+    CASE_GET_REG(RDI, edi);
+    CASE_GET_REG(RSP, esp);
+    CASE_EXTEND_GET_REG;
+    default:
+        break;
+    }
+
+    return value;
+}
+
+static void reg_write(struct cpu_user_regs *regs,
+                      enum vmx_regs_enc index,
+                      unsigned long value)
+{
+    switch ( index ) {
+    CASE_SET_REG(RAX, eax);
+    CASE_SET_REG(RCX, ecx);
+    CASE_SET_REG(RDX, edx);
+    CASE_SET_REG(RBX, ebx);
+    CASE_SET_REG(RBP, ebp);
+    CASE_SET_REG(RSI, esi);
+    CASE_SET_REG(RDI, edi);
+    CASE_SET_REG(RSP, esp);
+    CASE_EXTEND_SET_REG;
+    default:
+        break;
+    }
+}
+
+static int decode_vmx_inst(struct cpu_user_regs *regs,
+                           struct vmx_inst_decoded *decode)
+{
+    struct vcpu *v = current;
+    union vmx_inst_info info;
+    struct segment_register seg;
+    unsigned long base, index, seg_base, disp, offset;
+    int scale;
+
+    info.word = __vmread(VMX_INSTRUCTION_INFO);
+
+    if ( info.fields.memreg ) {
+        decode->type = VMX_INST_MEMREG_TYPE_REG;
+        decode->reg1 = info.fields.reg1;
+    }
+    else
+    {
+        decode->type = VMX_INST_MEMREG_TYPE_MEMORY;
+        hvm_get_segment_register(v, sreg_to_index[info.fields.segment], &seg);
+        /* TODO: segment type check */
+        seg_base = seg.base;
+
+        base = info.fields.base_reg_invalid ? 0 :
+            reg_read(regs, info.fields.base_reg);
+
+        index = info.fields.index_reg_invalid ? 0 :
+            reg_read(regs, info.fields.index_reg);
+
+        scale = 1 << info.fields.scaling;
+
+        disp = __vmread(EXIT_QUALIFICATION);
+
+        offset = base + index * scale + disp;
+        if ( offset > seg.limit )
+            goto gp_fault;
+
+        decode->mem = seg_base + base + index * scale + disp;
+        decode->len = 1 << (info.fields.addr_size + 1);
+    }
+
+    decode->reg2 = info.fields.reg2;
+
+    return X86EMUL_OKAY;
+
+gp_fault:
+    hvm_inject_exception(TRAP_gp_fault, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+static int vmx_inst_check_privilege(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct segment_register cs;
+
+    hvm_get_segment_register(v, x86_seg_cs, &cs);
+
+    if ( !(v->arch.hvm_vcpu.guest_cr[0] & X86_CR0_PE) ||
+         !(v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_VMXE) ||
+         (regs->eflags & X86_EFLAGS_VM) ||
+         (hvm_long_mode_enabled(v) && cs.attr.fields.l == 0) )
+        goto invalid_op;
+
+    if ( (cs.sel & 3) > 0 )
+        goto gp_fault;
+
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+
+gp_fault:
+    hvm_inject_exception(TRAP_gp_fault, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+static void vmreturn(struct cpu_user_regs *regs, enum vmx_ops_result res)
+{
+    unsigned long eflags = regs->eflags;
+    unsigned long mask = X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+                         X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF;
+
+    eflags &= ~mask;
+
+    switch ( res ) {
+    case VMSUCCEED:
+        break;
+    case VMFAIL_VALID:
+        /* TODO: error number, useful for guest VMM debugging */
+        eflags |= X86_EFLAGS_ZF;
+        break;
+    case VMFAIL_INVALID:
+    default:
+        eflags |= X86_EFLAGS_CF;
+        break;
+    }
+
+    regs->eflags = eflags;
+}
+
+static int __clear_current_vvmcs(struct vmx_nest_struct *nest)
+{
+    int rc;
+
+    if ( nest->svmcs )
+        __vmpclear(virt_to_maddr(nest->svmcs));
+
+#if !CONFIG_VVMCS_MAPPING
+    rc = hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs, PAGE_SIZE);
+    if ( rc != HVMCOPY_okay )
+        return X86EMUL_EXCEPTION;
+#endif
+
+    nest->vmcs_valid = 0;
+
+    return X86EMUL_OKAY;
+}
+
+/*
+ * VMX instructions handling
+ */
+
+int vmx_nest_handle_vmxon(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    struct vmx_inst_decoded decode;
+    unsigned long gpa = 0;
+    int rc;
+
+    if ( !is_nested_avail(v->domain) )
+        goto invalid_op;
+
+    rc = vmx_inst_check_privilege(regs);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    rc = decode_vmx_inst(regs, &decode);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
+    rc = hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
+    if ( rc != HVMCOPY_okay )
+        return X86EMUL_EXCEPTION;
+
+    nest->guest_vmxon_pa = gpa;
+    nest->gvmcs_pa = 0;
+    nest->vmcs_valid = 0;
+#if !CONFIG_VVMCS_MAPPING
+    nest->vvmcs = alloc_xenheap_page();
+    if ( !nest->vvmcs )
+    {
+        gdprintk(XENLOG_ERR, "nest: allocation for virtual vmcs failed\n");
+        vmreturn(regs, VMFAIL_INVALID);
+        goto out;
+    }
+#endif
+    nest->svmcs = alloc_xenheap_page();
+    if ( !nest->svmcs )
+    {
+        gdprintk(XENLOG_ERR, "nest: allocation for shadow vmcs failed\n");
+        free_xenheap_page(nest->vvmcs);
+        vmreturn(regs, VMFAIL_INVALID);
+        goto out;
+    }
+
+    /*
+     * `fork' the host vmcs to shadow_vmcs
+     * vmcs_lock is not needed since we are on current
+     */
+    nest->hvmcs = v->arch.hvm_vmx.vmcs;
+    __vmpclear(virt_to_maddr(nest->hvmcs));
+    memcpy(nest->svmcs, nest->hvmcs, PAGE_SIZE);
+    __vmptrld(virt_to_maddr(nest->hvmcs));
+    v->arch.hvm_vmx.launched = 0;
+
+    vmreturn(regs, VMSUCCEED);
+
+out:
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    int rc;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    rc = vmx_inst_check_privilege(regs);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    nest->guest_vmxon_pa = 0;
+    __vmpclear(virt_to_maddr(nest->svmcs));
+
+#if !CONFIG_VVMCS_MAPPING
+    free_xenheap_page(nest->vvmcs);
+#endif
+    free_xenheap_page(nest->svmcs);
+
+    vmreturn(regs, VMSUCCEED);
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmptrld(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    unsigned long gpa = 0;
+    int rc;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    rc = vmx_inst_check_privilege(regs);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    rc = decode_vmx_inst(regs, &decode);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
+    rc = hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
+    if ( rc != HVMCOPY_okay )
+        return X86EMUL_EXCEPTION;
+
+    if ( gpa == nest->guest_vmxon_pa || gpa & 0xfff )
+    {
+        vmreturn(regs, VMFAIL_INVALID);
+        goto out;
+    }
+
+    if ( nest->gvmcs_pa != gpa )
+    {
+        if ( nest->vmcs_valid )
+        {
+            rc = __clear_current_vvmcs(nest);
+            if ( rc != X86EMUL_OKAY )
+                return rc;
+        }
+        nest->gvmcs_pa = gpa;
+        ASSERT(nest->vmcs_valid == 0);
+    }
+
+
+    if ( !nest->vmcs_valid )
+    {
+#if CONFIG_VVMCS_MAPPING
+        unsigned long mfn;
+        p2m_type_t p2mt;
+
+        mfn = mfn_x(gfn_to_mfn(p2m_get_hostp2m(v->domain),
+                               nest->gvmcs_pa >> PAGE_SHIFT, &p2mt));
+        nest->vvmcs = map_domain_page_global(mfn);
+#else
+        rc = hvm_copy_from_guest_phys(nest->vvmcs, nest->gvmcs_pa, PAGE_SIZE);
+        if ( rc != HVMCOPY_okay )
+            return X86EMUL_EXCEPTION;
+#endif
+        nest->vmcs_valid = 1;
+    }
+
+    vmreturn(regs, VMSUCCEED);
+
+out:
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmptrst(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    unsigned long gpa = 0;
+    int rc;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    rc = vmx_inst_check_privilege(regs);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    rc = decode_vmx_inst(regs, &decode);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
+
+    gpa = nest->gvmcs_pa;
+
+    rc = hvm_copy_to_guest_virt(decode.mem, &gpa, decode.len, 0);
+    if ( rc != HVMCOPY_okay )
+        return X86EMUL_EXCEPTION;
+
+    vmreturn(regs, VMSUCCEED);
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmclear(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    unsigned long gpa = 0;
+    int rc;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    rc = vmx_inst_check_privilege(regs);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    rc = decode_vmx_inst(regs, &decode);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
+    rc = hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
+    if ( rc != HVMCOPY_okay )
+        return X86EMUL_EXCEPTION;
+
+    if ( gpa & 0xfff )
+    {
+        vmreturn(regs, VMFAIL_VALID);
+        goto out;
+    }
+
+    if ( gpa != nest->gvmcs_pa )
+    {
+        gdprintk(XENLOG_ERR, "vmclear gpa not the same with current vmcs\n");
+        vmreturn(regs, VMSUCCEED);
+        goto out;
+    }
+
+    rc = __clear_current_vvmcs(nest);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+#if CONFIG_VVMCS_MAPPING
+    unmap_domain_page_global(nest->vvmcs);
+    nest->vvmcs == NULL;
+#endif
+
+    vmreturn(regs, VMSUCCEED);
+
+out:
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+
+
+int vmx_nest_handle_vmread(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    u64 value = 0;
+    int rc;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    rc = vmx_inst_check_privilege(regs);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    rc = decode_vmx_inst(regs, &decode);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    value = __get_vvmcs(nest->vvmcs, reg_read(regs, decode.reg2));
+
+    switch ( decode.type ) {
+    case VMX_INST_MEMREG_TYPE_MEMORY:
+        rc = hvm_copy_to_guest_virt(decode.mem, &value, decode.len, 0);
+        if ( rc != HVMCOPY_okay )
+            return X86EMUL_EXCEPTION;
+        break;
+    case VMX_INST_MEMREG_TYPE_REG:
+        reg_write(regs, decode.reg1, value);
+        break;
+    }
+
+    vmreturn(regs, VMSUCCEED);
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmwrite(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_inst_decoded decode;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    u64 value = 0;
+    int rc;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    rc = vmx_inst_check_privilege(regs);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    rc = decode_vmx_inst(regs, &decode);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    switch ( decode.type ) {
+    case VMX_INST_MEMREG_TYPE_MEMORY:
+        rc = hvm_copy_from_guest_virt(&value, decode.mem, decode.len, 0);
+        if ( rc != HVMCOPY_okay )
+            return X86EMUL_EXCEPTION;
+        break;
+    case VMX_INST_MEMREG_TYPE_REG:
+        value = reg_read(regs, decode.reg1);
+        break;
+    }
+
+    __set_vvmcs(nest->vvmcs, reg_read(regs, decode.reg2), value);
+
+    vmreturn(regs, VMSUCCEED);
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
diff -r f1c1d3077337 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 21:03:46 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 21:30:01 2010 +0800
@@ -2587,17 +2587,46 @@
         break;
     }
 
+    case EXIT_REASON_VMCLEAR:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmclear(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMPTRLD:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmptrld(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMPTRST:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmptrst(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMREAD:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmread(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMWRITE:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmwrite(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMXOFF:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmxoff(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+    case EXIT_REASON_VMXON:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmxon(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
+
     case EXIT_REASON_MWAIT_INSTRUCTION:
     case EXIT_REASON_MONITOR_INSTRUCTION:
-    case EXIT_REASON_VMCLEAR:
     case EXIT_REASON_VMLAUNCH:
-    case EXIT_REASON_VMPTRLD:
-    case EXIT_REASON_VMPTRST:
-    case EXIT_REASON_VMREAD:
     case EXIT_REASON_VMRESUME:
-    case EXIT_REASON_VMWRITE:
-    case EXIT_REASON_VMXOFF:
-    case EXIT_REASON_VMXON:
         vmx_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE);
         break;
 
diff -r f1c1d3077337 xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:03:46 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:30:01 2010 +0800
@@ -42,4 +42,14 @@
     int                  vmcs_valid;
 };
 
+int vmx_nest_handle_vmxon(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs);
+
+int vmx_nest_handle_vmptrld(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmptrst(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmclear(struct cpu_user_regs *regs);
+
+int vmx_nest_handle_vmread(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmwrite(struct cpu_user_regs *regs);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 07/16] vmx: nest: switch current vmcs
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (5 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 06/16] vmx: nest: handling VMX instruction exits Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-08 15:22 ` [PATCH 08/16] vmx: nest: vmresume/vmlaunch Qing He
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

facility to switch between host vmcs and shadow vmcs

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r e638812d8f46 xen/arch/x86/hvm/vmx/vmcs.c
--- a/xen/arch/x86/hvm/vmx/vmcs.c	Wed Sep 08 21:30:02 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmcs.c	Wed Sep 08 21:42:10 2010 +0800
@@ -642,6 +642,35 @@
               (unsigned long)&get_cpu_info()->guest_cpu_user_regs.error_code);
 }
 
+void vmx_vmcs_switch_current(struct vcpu *v,
+                             struct vmcs_struct *from,
+                             struct vmcs_struct *to)
+{
+    /* no foreign access */
+    if ( unlikely(v != current) )
+        return;
+
+    if ( unlikely(current->arch.hvm_vmx.vmcs != from) )
+        return;
+
+    spin_lock(&v->arch.hvm_vmx.vmcs_lock);
+
+    __vmpclear(virt_to_maddr(from));
+    __vmptrld(virt_to_maddr(to));
+
+    v->arch.hvm_vmx.vmcs = to;
+    v->arch.hvm_vmx.launched = 0;
+    this_cpu(current_vmcs) = to;
+
+    if ( v->arch.hvm_vmx.vmcs_host_updated )
+    {
+        v->arch.hvm_vmx.vmcs_host_updated = 0;
+        vmx_set_host_env(v);
+    }
+
+    spin_unlock(&v->arch.hvm_vmx.vmcs_lock);
+}
+
 void vmx_disable_intercept_for_msr(struct vcpu *v, u32 msr)
 {
     unsigned long *msr_bitmap = v->arch.hvm_vmx.msr_bitmap;
@@ -1080,6 +1109,12 @@
         hvm_migrate_pirqs(v);
         vmx_set_host_env(v);
         hvm_asid_flush_vcpu(v);
+
+        /*
+         * nesting: we need to do additional host env sync if we have other
+         * VMCS's. Currently this only works with only one active sVMCS.
+         */
+        v->arch.hvm_vmx.vmcs_host_updated = 1;
     }
 
     debug_state = v->domain->debugger_attached;
diff -r e638812d8f46 xen/include/asm-x86/hvm/vmx/vmcs.h
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:30:02 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:42:10 2010 +0800
@@ -102,6 +102,7 @@
 
     /* nested virtualization */
     struct vmx_nest_struct nest;
+    int                  vmcs_host_updated;
 
 #ifdef __x86_64__
     struct vmx_msr_state msr_state;
@@ -389,6 +390,9 @@
 int vmx_write_guest_msr(u32 msr, u64 val);
 int vmx_add_guest_msr(u32 msr);
 int vmx_add_host_load_msr(u32 msr);
+void vmx_vmcs_switch_current(struct vcpu *v,
+                             struct vmcs_struct *from,
+                             struct vmcs_struct *to);
 
 #endif /* ASM_X86_HVM_VMX_VMCS_H__ */

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 08/16] vmx: nest: vmresume/vmlaunch
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (6 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 07/16] vmx: nest: switch current vmcs Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-15  9:52   ` Christoph Egger
  2010-09-08 15:22 ` [PATCH 09/16] vmx: nest: shadow controls Qing He
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

vmresume and vmlaunch instructions and transitional states

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r e828d55c10bb xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 21:42:10 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:04:16 2010 +0800
@@ -633,3 +633,33 @@
     hvm_inject_exception(TRAP_invalid_op, 0, 0);
     return X86EMUL_EXCEPTION;
 }
+
+int vmx_nest_handle_vmresume(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    int rc;
+
+    if ( unlikely(!nest->guest_vmxon_pa) )
+        goto invalid_op;
+
+    rc = vmx_inst_check_privilege(regs);
+    if ( rc != X86EMUL_OKAY )
+        return rc;
+
+    if ( nest->vmcs_valid == 1 )
+        nest->vmresume_pending = 1;
+    else
+        vmreturn(regs, VMFAIL_INVALID);
+
+    return X86EMUL_OKAY;
+
+invalid_op:
+    hvm_inject_exception(TRAP_invalid_op, 0, 0);
+    return X86EMUL_EXCEPTION;
+}
+
+int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs)
+{
+    return vmx_nest_handle_vmresume(regs);
+}
diff -r e828d55c10bb xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 21:42:10 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:04:16 2010 +0800
@@ -2321,6 +2321,11 @@
     /* Now enable interrupts so it's safe to take locks. */
     local_irq_enable();
 
+    /* XXX: This looks ugly, but we need a mechanism to ensure
+     * any pending vmresume has really happened
+     */
+    v->arch.hvm_vmx.nest.vmresume_in_progress = 0;
+
     if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
         return vmx_failed_vmentry(exit_reason, regs);
 
@@ -2592,6 +2597,11 @@
         if ( vmx_nest_handle_vmclear(regs) == X86EMUL_OKAY )
             __update_guest_eip(inst_len);
         break;
+    case EXIT_REASON_VMLAUNCH:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmlaunch(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
     case EXIT_REASON_VMPTRLD:
         inst_len = __get_instruction_length();
         if ( vmx_nest_handle_vmptrld(regs) == X86EMUL_OKAY )
@@ -2607,6 +2617,11 @@
         if ( vmx_nest_handle_vmread(regs) == X86EMUL_OKAY )
             __update_guest_eip(inst_len);
         break;
+    case EXIT_REASON_VMRESUME:
+        inst_len = __get_instruction_length();
+        if ( vmx_nest_handle_vmresume(regs) == X86EMUL_OKAY )
+            __update_guest_eip(inst_len);
+        break;
     case EXIT_REASON_VMWRITE:
         inst_len = __get_instruction_length();
         if ( vmx_nest_handle_vmwrite(regs) == X86EMUL_OKAY )
@@ -2625,8 +2640,6 @@
 
     case EXIT_REASON_MWAIT_INSTRUCTION:
     case EXIT_REASON_MONITOR_INSTRUCTION:
-    case EXIT_REASON_VMLAUNCH:
-    case EXIT_REASON_VMRESUME:
         vmx_inject_hw_exception(TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE);
         break;
 
diff -r e828d55c10bb xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:42:10 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:04:16 2010 +0800
@@ -40,6 +40,20 @@
     void                *vvmcs;
     struct vmcs_struct  *svmcs;
     int                  vmcs_valid;
+
+    /*
+     * vmexit_pending and vmresume_pending is to mark pending
+     * switches, they are cleared when physical vmcs is changed.
+     */
+    int                  vmexit_pending;
+    int                  vmresume_pending;
+
+    /*
+     * upon L1->L2, there is a window between context switch and
+     * the physical vmentry of the shadow vmcs, protect against it
+     * with vmresume_in_progress
+     */
+    int                  vmresume_in_progress;
 };
 
 int vmx_nest_handle_vmxon(struct cpu_user_regs *regs);
@@ -52,4 +66,7 @@
 int vmx_nest_handle_vmread(struct cpu_user_regs *regs);
 int vmx_nest_handle_vmwrite(struct cpu_user_regs *regs);
 
+int vmx_nest_handle_vmresume(struct cpu_user_regs *regs);
+int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 09/16] vmx: nest: shadow controls
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (7 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 08/16] vmx: nest: vmresume/vmlaunch Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-08 15:22 ` [PATCH 10/16] vmx: nest: L1 <-> L2 context switch Qing He
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

automatically compute controls according to current mode

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r 625f74a9bb36 xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:04:16 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:07:15 2010 +0800
@@ -663,3 +663,35 @@
 {
     return vmx_nest_handle_vmresume(regs);
 }
+
+static void set_shadow_control(struct vmx_nest_struct *nest,
+                               unsigned int field,
+                               u32 host_value)
+{
+    u32 value;
+
+    value = (u32) __get_vvmcs(nest->vvmcs, field) | host_value;
+    __vmwrite(field, value);
+}
+
+void vmx_nest_update_exec_control(struct vcpu *v, unsigned long value)
+{
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    set_shadow_control(nest, CPU_BASED_VM_EXEC_CONTROL, value);
+}
+
+void vmx_nest_update_secondary_exec_control(struct vcpu *v,
+                                            unsigned long value)
+{
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    set_shadow_control(nest, SECONDARY_VM_EXEC_CONTROL, value);
+}
+
+void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value)
+{
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    set_shadow_control(nest, EXCEPTION_BITMAP, value);
+}
diff -r 625f74a9bb36 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:04:16 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:07:15 2010 +0800
@@ -387,18 +387,28 @@
 
 void vmx_update_cpu_exec_control(struct vcpu *v)
 {
-    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
+    if ( v->arch.hvm_vcpu.in_nesting )
+        vmx_nest_update_exec_control(v, v->arch.hvm_vmx.exec_control);
+    else
+        __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
 }
 
 void vmx_update_secondary_exec_control(struct vcpu *v)
 {
-    __vmwrite(SECONDARY_VM_EXEC_CONTROL,
-              v->arch.hvm_vmx.secondary_exec_control);
+    if ( v->arch.hvm_vcpu.in_nesting )
+        vmx_nest_update_secondary_exec_control(v,
+            v->arch.hvm_vmx.secondary_exec_control);
+    else
+        __vmwrite(SECONDARY_VM_EXEC_CONTROL,
+                  v->arch.hvm_vmx.secondary_exec_control);
 }
 
 void vmx_update_exception_bitmap(struct vcpu *v)
 {
-    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
+    if ( v->arch.hvm_vcpu.in_nesting )
+        vmx_nest_update_exception_bitmap(v, v->arch.hvm_vmx.exception_bitmap);
+    else
+        __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
 }
 
 static int vmx_guest_x86_mode(struct vcpu *v)
diff -r 625f74a9bb36 xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:04:16 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:07:15 2010 +0800
@@ -69,4 +69,9 @@
 int vmx_nest_handle_vmresume(struct cpu_user_regs *regs);
 int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs);
 
+void vmx_nest_update_exec_control(struct vcpu *v, unsigned long value);
+void vmx_nest_update_secondary_exec_control(struct vcpu *v,
+                                            unsigned long value);
+void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 10/16] vmx: nest: L1 <-> L2 context switch
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (8 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 09/16] vmx: nest: shadow controls Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-08 15:22 ` [PATCH 11/16] vmx: nest: interrupt handling Qing He
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

This patch adds mode switch between L1 and L2

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r 86c36f5c38f2 xen/arch/x86/hvm/vmx/entry.S
--- a/xen/arch/x86/hvm/vmx/entry.S	Wed Sep 08 22:07:15 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/entry.S	Wed Sep 08 22:11:52 2010 +0800
@@ -119,6 +119,7 @@
 .globl vmx_asm_do_vmentry
 vmx_asm_do_vmentry:
         call vmx_intr_assist
+        call vmx_nest_switch_mode
 
         get_current(bx)
         cli
diff -r 86c36f5c38f2 xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:07:15 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:11:52 2010 +0800
@@ -22,6 +22,8 @@
 #include <xen/config.h>
 #include <asm/types.h>
 #include <asm/p2m.h>
+#include <asm/paging.h>
+#include <asm/hvm/support.h>
 #include <asm/hvm/vmx/vmx.h>
 #include <asm/hvm/vmx/vvmcs.h>
 #include <asm/hvm/vmx/nest.h>
@@ -695,3 +697,349 @@
 
     set_shadow_control(nest, EXCEPTION_BITMAP, value);
 }
+
+/*
+ * Nested VMX context switch
+ */
+
+static unsigned long vmcs_gstate_field[] = {
+    /* 16 BITS */
+    GUEST_ES_SELECTOR,
+    GUEST_CS_SELECTOR,
+    GUEST_SS_SELECTOR,
+    GUEST_DS_SELECTOR,
+    GUEST_FS_SELECTOR,
+    GUEST_GS_SELECTOR,
+    GUEST_LDTR_SELECTOR,
+    GUEST_TR_SELECTOR,
+    /* 64 BITS */
+    VMCS_LINK_POINTER,
+    GUEST_IA32_DEBUGCTL,
+#ifndef CONFIG_X86_64
+    VMCS_LINK_POINTER_HIGH,
+    GUEST_IA32_DEBUGCTL_HIGH,
+#endif
+    /* 32 BITS */
+    GUEST_ES_LIMIT,
+    GUEST_CS_LIMIT,
+    GUEST_SS_LIMIT,
+    GUEST_DS_LIMIT,
+    GUEST_FS_LIMIT,
+    GUEST_GS_LIMIT,
+    GUEST_LDTR_LIMIT,
+    GUEST_TR_LIMIT,
+    GUEST_GDTR_LIMIT,
+    GUEST_IDTR_LIMIT,
+    GUEST_ES_AR_BYTES,
+    GUEST_CS_AR_BYTES,
+    GUEST_SS_AR_BYTES,
+    GUEST_DS_AR_BYTES,
+    GUEST_FS_AR_BYTES,
+    GUEST_GS_AR_BYTES,
+    GUEST_LDTR_AR_BYTES,
+    GUEST_TR_AR_BYTES,
+    GUEST_INTERRUPTIBILITY_INFO,
+    GUEST_ACTIVITY_STATE,
+    GUEST_SYSENTER_CS,
+    /* natural */
+    GUEST_ES_BASE,
+    GUEST_CS_BASE,
+    GUEST_SS_BASE,
+    GUEST_DS_BASE,
+    GUEST_FS_BASE,
+    GUEST_GS_BASE,
+    GUEST_LDTR_BASE,
+    GUEST_TR_BASE,
+    GUEST_GDTR_BASE,
+    GUEST_IDTR_BASE,
+    GUEST_DR7,
+    GUEST_RSP,
+    GUEST_RIP,
+    GUEST_RFLAGS,
+    GUEST_PENDING_DBG_EXCEPTIONS,
+    GUEST_SYSENTER_ESP,
+    GUEST_SYSENTER_EIP,
+};
+
+static unsigned long vmcs_ro_field[] = {
+    GUEST_PHYSICAL_ADDRESS,
+    VM_INSTRUCTION_ERROR,
+    VM_EXIT_REASON,
+    VM_EXIT_INTR_INFO,
+    VM_EXIT_INTR_ERROR_CODE,
+    IDT_VECTORING_INFO,
+    IDT_VECTORING_ERROR_CODE,
+    VM_EXIT_INSTRUCTION_LEN,
+    VMX_INSTRUCTION_INFO,
+    EXIT_QUALIFICATION,
+    GUEST_LINEAR_ADDRESS
+};
+
+static struct vmcs_host_to_guest {
+    unsigned long host_field;
+    unsigned long guest_field;
+} vmcs_h2g_field[] = {
+    {HOST_ES_SELECTOR, GUEST_ES_SELECTOR},
+    {HOST_CS_SELECTOR, GUEST_CS_SELECTOR},
+    {HOST_SS_SELECTOR, GUEST_SS_SELECTOR},
+    {HOST_DS_SELECTOR, GUEST_DS_SELECTOR},
+    {HOST_FS_SELECTOR, GUEST_FS_SELECTOR},
+    {HOST_GS_SELECTOR, GUEST_GS_SELECTOR},
+    {HOST_TR_SELECTOR, GUEST_TR_SELECTOR},
+    {HOST_SYSENTER_CS, GUEST_SYSENTER_CS},
+    {HOST_FS_BASE, GUEST_FS_BASE},
+    {HOST_GS_BASE, GUEST_GS_BASE},
+    {HOST_TR_BASE, GUEST_TR_BASE},
+    {HOST_GDTR_BASE, GUEST_GDTR_BASE},
+    {HOST_IDTR_BASE, GUEST_IDTR_BASE},
+    {HOST_SYSENTER_ESP, GUEST_SYSENTER_ESP},
+    {HOST_SYSENTER_EIP, GUEST_SYSENTER_EIP},
+};
+
+static void vvmcs_to_shadow(void *vvmcs, unsigned int field)
+{
+    u64 value;
+
+    value = __get_vvmcs(vvmcs, field);
+    __vmwrite(field, value);
+}
+
+static void vvmcs_from_shadow(void *vvmcs, unsigned int field)
+{
+    u64 value;
+    int rc;
+
+    value = __vmread_safe(field, &rc);
+    if ( !rc )
+        __set_vvmcs(vvmcs, field, value);
+}
+
+static void load_vvmcs_control(struct vmx_nest_struct *nest)
+{
+    u32 exit_control;
+    struct vcpu *v = current;
+
+    /* PIN_BASED, CPU_BASED controls: the union of L0 & L1 */
+    set_shadow_control(nest, PIN_BASED_VM_EXEC_CONTROL,
+                       vmx_pin_based_exec_control);
+    vmx_update_cpu_exec_control(v);
+
+    /* VM_EXIT_CONTROLS: owned by L0 except bits below */
+#define EXIT_CONTROL_GUEST_BITS    ((1<<2) | (1<<18) | (1<<20) | (1<<22))
+    exit_control = __get_vvmcs(nest->vvmcs, VM_EXIT_CONTROLS) &
+                   EXIT_CONTROL_GUEST_BITS;
+    exit_control |= (vmx_vmexit_control & ~EXIT_CONTROL_GUEST_BITS);
+    __vmwrite(VM_EXIT_CONTROLS, exit_control);
+
+    /* VM_ENTRY_CONTROLS: owned by L1 */
+    vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_CONTROLS);
+
+    vmx_update_exception_bitmap(v);
+}
+
+static void load_vvmcs_guest_state(struct vmx_nest_struct *nest)
+{
+    int i;
+
+    /* vvmcs.gstate to svmcs.gstate */
+    for ( i = 0; i < ARRAY_SIZE(vmcs_gstate_field); i++ )
+        vvmcs_to_shadow(nest->vvmcs, vmcs_gstate_field[i]);
+
+    hvm_set_cr0(__get_vvmcs(nest->vvmcs, GUEST_CR0));
+    hvm_set_cr4(__get_vvmcs(nest->vvmcs, GUEST_CR4));
+    hvm_set_cr3(__get_vvmcs(nest->vvmcs, GUEST_CR3));
+
+    vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INTR_INFO);
+    vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_EXCEPTION_ERROR_CODE);
+    vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INSTRUCTION_LEN);
+
+    /* XXX: should refer to GUEST_HOST_MASK of both L0 and L1 */
+    vvmcs_to_shadow(nest->vvmcs, CR0_READ_SHADOW);
+    vvmcs_to_shadow(nest->vvmcs, CR4_READ_SHADOW);
+    vvmcs_to_shadow(nest->vvmcs, CR0_GUEST_HOST_MASK);
+    vvmcs_to_shadow(nest->vvmcs, CR4_GUEST_HOST_MASK);
+
+    /* TODO: PDPTRs for nested ept */
+    /* TODO: CR3 target control */
+}
+
+static void virtual_vmentry(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+#ifdef __x86_64__
+    unsigned long lm_l1, lm_l2;
+#endif
+
+    vmx_vmcs_switch_current(v, v->arch.hvm_vmx.vmcs, nest->svmcs);
+
+    v->arch.hvm_vcpu.in_nesting = 1;
+    nest->vmresume_pending = 0;
+    nest->vmresume_in_progress = 1;
+
+#ifdef __x86_64__
+    /*
+     * EFER handling:
+     * hvm_set_efer won't work if CR0.PG = 1, so we change the value
+     * directly to make hvm_long_mode_enabled(v) work in L2.
+     * An additional update_paging_modes is also needed is
+     * there is 32/64 switch. v->arch.hvm_vcpu.guest_efer doesn't
+     * need to be saved, since its value on vmexit is determined by
+     * L1 exit_controls
+     */
+    lm_l1 = !!hvm_long_mode_enabled(v);
+    lm_l2 = !!(__get_vvmcs(nest->vvmcs, VM_ENTRY_CONTROLS) &
+                           VM_ENTRY_IA32E_MODE);
+
+    if ( lm_l2 )
+        v->arch.hvm_vcpu.guest_efer |= EFER_LMA | EFER_LME;
+    else
+        v->arch.hvm_vcpu.guest_efer &= ~(EFER_LMA | EFER_LME);
+#endif
+
+    load_vvmcs_control(nest);
+    load_vvmcs_guest_state(nest);
+
+#ifdef __x86_64__
+    if ( lm_l1 != lm_l2 )
+    {
+        paging_update_paging_modes(v);
+    }
+#endif
+
+    regs->rip = __get_vvmcs(nest->vvmcs, GUEST_RIP);
+    regs->rsp = __get_vvmcs(nest->vvmcs, GUEST_RSP);
+    regs->rflags = __get_vvmcs(nest->vvmcs, GUEST_RFLAGS);
+
+    /* TODO: EPT_POINTER */
+}
+
+static void sync_vvmcs_guest_state(struct vmx_nest_struct *nest)
+{
+    int i;
+    unsigned long mask;
+    unsigned long cr;
+
+    /* copy svmcs.gstate back to vvmcs.gstate */
+    for ( i = 0; i < ARRAY_SIZE(vmcs_gstate_field); i++ )
+        vvmcs_from_shadow(nest->vvmcs, vmcs_gstate_field[i]);
+
+    /* SDM 20.6.6: L2 guest execution may change GUEST CR0/CR4 */
+    mask = __get_vvmcs(nest->vvmcs, CR0_GUEST_HOST_MASK);
+    if ( ~mask )
+    {
+        cr = __get_vvmcs(nest->vvmcs, GUEST_CR0);
+        cr = (cr & mask) | (__vmread(GUEST_CR4) & ~mask);
+        __set_vvmcs(nest->vvmcs, GUEST_CR0, cr);
+    }
+
+    mask = __get_vvmcs(nest->vvmcs, CR4_GUEST_HOST_MASK);
+    if ( ~mask )
+    {
+        cr = __get_vvmcs(nest->vvmcs, GUEST_CR4);
+        cr = (cr & mask) | (__vmread(GUEST_CR4) & ~mask);
+        __set_vvmcs(nest->vvmcs, GUEST_CR4, cr);
+    }
+
+    /* CR3 sync if exec doesn't want cr3 load exiting: i.e. nested EPT */
+    if ( !(__get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL) &
+           CPU_BASED_CR3_LOAD_EXITING) )
+        vvmcs_from_shadow(nest->vvmcs, GUEST_CR3);
+}
+
+static void sync_vvmcs_ro(struct vmx_nest_struct *nest)
+{
+    int i;
+
+    for ( i = 0; i < ARRAY_SIZE(vmcs_ro_field); i++ )
+        vvmcs_from_shadow(nest->vvmcs, vmcs_ro_field[i]);
+}
+
+static void load_vvmcs_host_state(struct vmx_nest_struct *nest)
+{
+    int i;
+    u64 r;
+
+    for ( i = 0; i < ARRAY_SIZE(vmcs_h2g_field); i++ )
+    {
+        r = __get_vvmcs(nest->vvmcs, vmcs_h2g_field[i].host_field);
+        __vmwrite(vmcs_h2g_field[i].guest_field, r);
+    }
+
+    hvm_set_cr0(__get_vvmcs(nest->vvmcs, HOST_CR0));
+    hvm_set_cr4(__get_vvmcs(nest->vvmcs, HOST_CR4));
+    hvm_set_cr3(__get_vvmcs(nest->vvmcs, HOST_CR3));
+
+    __set_vvmcs(nest->vvmcs, VM_ENTRY_INTR_INFO, 0);
+}
+
+static void virtual_vmexit(struct cpu_user_regs *regs)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+#ifdef __x86_64__
+    unsigned long lm_l1, lm_l2;
+#endif
+
+    sync_vvmcs_ro(nest);
+    sync_vvmcs_guest_state(nest);
+
+    vmx_vmcs_switch_current(v, v->arch.hvm_vmx.vmcs, nest->hvmcs);
+
+    v->arch.hvm_vcpu.in_nesting = 0;
+    nest->vmexit_pending = 0;
+
+#ifdef __x86_64__
+    lm_l2 = !!hvm_long_mode_enabled(v);
+    lm_l1 = !!(__get_vvmcs(nest->vvmcs, VM_EXIT_CONTROLS) &
+                           VM_EXIT_IA32E_MODE);
+
+    if ( lm_l1 )
+        v->arch.hvm_vcpu.guest_efer |= EFER_LMA | EFER_LME;
+    else
+        v->arch.hvm_vcpu.guest_efer &= ~(EFER_LMA | EFER_LME);
+#endif
+
+    vmx_update_cpu_exec_control(v);
+    vmx_update_exception_bitmap(v);
+
+    load_vvmcs_host_state(nest);
+
+#ifdef __x86_64__
+    if ( lm_l1 != lm_l2 )
+        paging_update_paging_modes(v);
+#endif
+
+    regs->rip = __get_vvmcs(nest->vvmcs, HOST_RIP);
+    regs->rsp = __get_vvmcs(nest->vvmcs, HOST_RSP);
+    regs->rflags = __vmread(GUEST_RFLAGS);
+
+    vmreturn(regs, VMSUCCEED);
+}
+
+asmlinkage void vmx_nest_switch_mode(void)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    struct cpu_user_regs *regs = guest_cpu_user_regs();
+
+    /*
+     * a softirq may interrupt us between a virtual vmentry is
+     * just handled and the true vmentry. If during this window,
+     * a L1 virtual interrupt causes another virtual vmexit, we
+     * cannot let that happen or VM_ENTRY_INTR_INFO will be lost.
+     */
+    if ( unlikely(nest->vmresume_in_progress) )
+        return;
+
+    if ( v->arch.hvm_vcpu.in_nesting && nest->vmexit_pending )
+    {
+        local_irq_enable();
+        virtual_vmexit(regs);
+    }
+    else if ( !v->arch.hvm_vcpu.in_nesting && nest->vmresume_pending )
+    {
+        local_irq_enable();
+        virtual_vmentry(regs);
+    }
+}
diff -r 86c36f5c38f2 xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:07:15 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:11:52 2010 +0800
@@ -56,6 +56,8 @@
     int                  vmresume_in_progress;
 };
 
+asmlinkage void vmx_nest_switch_mode(void);
+
 int vmx_nest_handle_vmxon(struct cpu_user_regs *regs);
 int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 11/16] vmx: nest: interrupt handling
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (9 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 10/16] vmx: nest: L1 <-> L2 context switch Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-08 15:22 ` [PATCH 12/16] vmx: nest: VMExit handler in L2 Qing He
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

This patch adds interrupt handling for nested, mainly includes:
  - virtual interrupt when running in nested mode,
  - idtv handling in L2.
  - interrupt blocking handling in L2

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r 4934d8db96bf xen/arch/x86/hvm/vmx/intr.c
--- a/xen/arch/x86/hvm/vmx/intr.c	Wed Sep 08 22:11:52 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/intr.c	Wed Sep 08 22:14:26 2010 +0800
@@ -33,6 +33,7 @@
 #include <asm/hvm/support.h>
 #include <asm/hvm/vmx/vmx.h>
 #include <asm/hvm/vmx/vmcs.h>
+#include <asm/hvm/vmx/vvmcs.h>
 #include <asm/hvm/vpic.h>
 #include <asm/hvm/vlapic.h>
 #include <public/hvm/ioreq.h>
@@ -110,6 +111,103 @@
     }
 }
 
+/*
+ * Injecting interrupts for nested virtualization
+ *
+ *  When injecting virtual interrupts (originated from L0), there are
+ *  two major possibilities, within L1 context and within L2 context
+ *   1. L1 context (in_nesting == 0)
+ *     Everything is the same as without nested, check RFLAGS.IF to
+ *     see if the injection can be done, using VMCS to inject the
+ *     interrupt
+ *
+ *   2. L2 context (in_nesting == 1)
+ *     Causes a virtual VMExit, RFLAGS.IF is ignored, whether to ack
+ *     irq according to intr_ack_on_exit, shouldn't block normally,
+ *     except for:
+ *    a. context transition
+ *     interrupt needs to be blocked at virtual VMEntry time
+ *    b. L2 idtv reinjection
+ *     if L2 idtv is handled within L0 (e.g. L0 shadow page fault),
+ *     it needs to be reinjected without exiting to L1, interrupt
+ *     injection should be blocked as well at this point.
+ *
+ *  Unfortunately, interrupt blocking in L2 won't work with simple
+ *  intr_window_open (which depends on L2's IF). To solve this,
+ *  the following algorithm can be used:
+ *   v->arch.hvm_vmx.exec_control.VIRTUAL_INTR_PENDING now denotes
+ *   only L0 control, physical control may be different from it.
+ *       - if in L1, it behaves normally, intr window is written
+ *         to physical control as it is
+ *       - if in L2, replace it to MTF (or NMI window) if possible
+ *       - if MTF/NMI window is not used, intr window can still be
+ *         used but may have negative impact on interrupt performance.
+ */
+
+static int nest_intr_blocked(struct vcpu *v, struct hvm_intack intack)
+{
+    int r = 0;
+
+    if ( !v->arch.hvm_vcpu.in_nesting &&
+         v->arch.hvm_vmx.nest.vmresume_pending )
+        r = 1;
+
+    if ( v->arch.hvm_vcpu.in_nesting )
+    {
+        if ( v->arch.hvm_vmx.nest.vmexit_pending ||
+             v->arch.hvm_vmx.nest.vmresume_in_progress ||
+             (__vmread(VM_ENTRY_INTR_INFO) & INTR_INFO_VALID_MASK) )
+            r = 1;
+    }
+
+    return r;
+}
+
+static int vmx_nest_intr_intercept(struct vcpu *v, struct hvm_intack intack)
+{
+    u32 exit_ctrl;
+
+    /*
+     * TODO:
+     *   - if L1 intr-window exiting == 0
+     *   - vNMI
+     */
+
+    if ( nest_intr_blocked(v, intack) )
+    {
+        enable_intr_window(v, intack);
+        return 1;
+    }
+
+    if ( v->arch.hvm_vcpu.in_nesting )
+    {
+        if ( intack.source == hvm_intsrc_pic ||
+                 intack.source == hvm_intsrc_lapic )
+        {
+            vmx_inject_extint(intack.vector);
+
+            exit_ctrl = __get_vvmcs(v->arch.hvm_vmx.nest.vvmcs,
+                            VM_EXIT_CONTROLS);
+            if ( exit_ctrl & VM_EXIT_ACK_INTR_ON_EXIT )
+            {
+                /* for now, duplicate the ack path in vmx_intr_assist */
+                hvm_vcpu_ack_pending_irq(v, intack);
+                pt_intr_post(v, intack);
+
+                intack = hvm_vcpu_has_pending_irq(v);
+                if ( unlikely(intack.source != hvm_intsrc_none) )
+                    enable_intr_window(v, intack);
+            }
+            else
+                enable_intr_window(v, intack);
+
+            return 1;
+        }
+    }
+
+    return 0;
+}
+
 asmlinkage void vmx_intr_assist(void)
 {
     struct hvm_intack intack;
@@ -133,6 +231,9 @@
         if ( likely(intack.source == hvm_intsrc_none) )
             goto out;
 
+        if ( unlikely(vmx_nest_intr_intercept(v, intack)) )
+            goto out;
+
         intblk = hvm_interrupt_blocked(v, intack);
         if ( intblk == hvm_intblk_tpr )
         {
diff -r 4934d8db96bf xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:11:52 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:14:26 2010 +0800
@@ -680,6 +680,7 @@
 {
     struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
 
+    /* TODO: change L0 intr window to MTF or NMI window */
     set_shadow_control(nest, CPU_BASED_VM_EXEC_CONTROL, value);
 }
 
@@ -973,6 +974,33 @@
     __set_vvmcs(nest->vvmcs, VM_ENTRY_INTR_INFO, 0);
 }
 
+static void vmx_nest_intr_exit(struct vmx_nest_struct *nest)
+{
+    if ( !(nest->intr_info & INTR_INFO_VALID_MASK) )
+        return;
+
+    switch ( nest->intr_info & INTR_INFO_INTR_TYPE_MASK )
+    {
+    case X86_EVENTTYPE_EXT_INTR:
+        /* rename exit_reason to EXTERNAL_INTERRUPT */
+        __set_vvmcs(nest->vvmcs, VM_EXIT_REASON, EXIT_REASON_EXTERNAL_INTERRUPT);
+        __set_vvmcs(nest->vvmcs, EXIT_QUALIFICATION, 0);
+        __set_vvmcs(nest->vvmcs, VM_EXIT_INTR_INFO, nest->intr_info);
+        break;
+
+    case X86_EVENTTYPE_HW_EXCEPTION:
+    case X86_EVENTTYPE_SW_INTERRUPT:
+    case X86_EVENTTYPE_SW_EXCEPTION:
+        /* throw to L1 */
+        __set_vvmcs(nest->vvmcs, VM_EXIT_INTR_INFO, nest->intr_info);
+        __set_vvmcs(nest->vvmcs, VM_EXIT_INTR_ERROR_CODE, nest->error_code);
+        break;
+    case X86_EVENTTYPE_NMI:
+    default:
+        break;
+    }
+}
+
 static void virtual_vmexit(struct cpu_user_regs *regs)
 {
     struct vcpu *v = current;
@@ -982,6 +1010,8 @@
 #endif
 
     sync_vvmcs_ro(nest);
+    vmx_nest_intr_exit(nest);
+
     sync_vvmcs_guest_state(nest);
 
     vmx_vmcs_switch_current(v, v->arch.hvm_vmx.vmcs, nest->hvmcs);
@@ -1043,3 +1073,39 @@
         virtual_vmentry(regs);
     }
 }
+
+void vmx_nest_idtv_handling(void)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    unsigned int idtv_info = __vmread(IDT_VECTORING_INFO);
+
+    if ( likely(!(idtv_info & INTR_INFO_VALID_MASK)) )
+        return;
+
+    /*
+     * If L0 can solve the fault that causes idt vectoring, it should
+     * be reinjected, otherwise, pass to L1.
+     */
+    if ( (__vmread(VM_EXIT_REASON) != EXIT_REASON_EPT_VIOLATION &&
+          !(nest->intr_info & INTR_INFO_VALID_MASK)) ||
+         (__vmread(VM_EXIT_REASON) == EXIT_REASON_EPT_VIOLATION &&
+          !nest->vmexit_pending) )
+    {
+        __vmwrite(VM_ENTRY_INTR_INFO, idtv_info & ~INTR_INFO_RESVD_BITS_MASK);
+        if ( idtv_info & INTR_INFO_DELIVER_CODE_MASK )
+            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE,
+                        __vmread(IDT_VECTORING_ERROR_CODE));
+        /*
+         * SDM 23.2.4, if L1 tries to inject a software interrupt
+         * and the delivery fails, VM_EXIT_INSTRUCTION_LEN receives
+         * the value of previous VM_ENTRY_INSTRUCTION_LEN.
+         *
+         * This means EXIT_INSTRUCTION_LEN is always valid here, for
+         * software interrupts both injected by L1, and generated in L2.
+         */
+        __vmwrite(VM_ENTRY_INSTRUCTION_LEN, __vmread(VM_EXIT_INSTRUCTION_LEN));
+    }
+
+    /* TODO: NMI */
+}
diff -r 4934d8db96bf xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:11:52 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:14:26 2010 +0800
@@ -1270,6 +1270,7 @@
 {
     unsigned long intr_fields;
     struct vcpu *curr = current;
+    struct vmx_nest_struct *nest = &curr->arch.hvm_vmx.nest;
 
     /*
      * NB. Callers do not need to worry about clearing STI/MOV-SS blocking:
@@ -1281,11 +1282,21 @@
 
     intr_fields = (INTR_INFO_VALID_MASK | (type<<8) | trap);
     if ( error_code != HVM_DELIVER_NO_ERROR_CODE ) {
-        __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
         intr_fields |= INTR_INFO_DELIVER_CODE_MASK;
+        if ( curr->arch.hvm_vcpu.in_nesting )
+            nest->error_code = error_code;
+        else
+            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
     }
 
-    __vmwrite(VM_ENTRY_INTR_INFO, intr_fields);
+    if ( curr->arch.hvm_vcpu.in_nesting )
+    {
+        nest->intr_info = intr_fields;
+        nest->vmexit_pending = 1;
+        return;
+    }
+    else
+        __vmwrite(VM_ENTRY_INTR_INFO, intr_fields);
 
     /* Can't inject exceptions in virtual 8086 mode because they would 
      * use the protected-mode IDT.  Emulate at the next vmenter instead. */
@@ -1295,9 +1306,14 @@
 
 void vmx_inject_hw_exception(int trap, int error_code)
 {
-    unsigned long intr_info = __vmread(VM_ENTRY_INTR_INFO);
+    unsigned long intr_info;
     struct vcpu *curr = current;
 
+    if ( curr->arch.hvm_vcpu.in_nesting )
+        intr_info = curr->arch.hvm_vmx.nest.intr_info;
+    else
+        intr_info = __vmread(VM_ENTRY_INTR_INFO);
+
     switch ( trap )
     {
     case TRAP_debug:
@@ -2287,9 +2303,31 @@
     return -1;
 }
 
+static void vmx_idtv_reinject(unsigned long idtv_info)
+{
+    if ( hvm_event_needs_reinjection((idtv_info>>8)&7, idtv_info&0xff) )
+    {
+        /* See SDM 3B 25.7.1.1 and .2 for info about masking resvd bits. */
+        __vmwrite(VM_ENTRY_INTR_INFO,
+                  idtv_info & ~INTR_INFO_RESVD_BITS_MASK);
+        if ( idtv_info & INTR_INFO_DELIVER_CODE_MASK )
+            __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE,
+                      __vmread(IDT_VECTORING_ERROR_CODE));
+    }
+
+    /*
+     * Clear NMI-blocking interruptibility info if an NMI delivery faulted.
+     * Re-delivery will re-set it (see SDM 3B 25.7.1.2).
+     */
+    if ( (idtv_info & INTR_INFO_INTR_TYPE_MASK) == (X86_EVENTTYPE_NMI<<8) )
+        __vmwrite(GUEST_INTERRUPTIBILITY_INFO,
+                  __vmread(GUEST_INTERRUPTIBILITY_INFO) &
+                  ~VMX_INTR_SHADOW_NMI);
+}
+
 asmlinkage void vmx_vmexit_handler(struct cpu_user_regs *regs)
 {
-    unsigned int exit_reason, idtv_info, intr_info = 0, vector = 0;
+    unsigned int exit_reason, idtv_info = 0, intr_info = 0, vector = 0;
     unsigned long exit_qualification, inst_len = 0;
     struct vcpu *v = current;
 
@@ -2374,29 +2412,14 @@
 
     hvm_maybe_deassert_evtchn_irq();
 
-    /* Event delivery caused this intercept? Queue for redelivery. */
-    idtv_info = __vmread(IDT_VECTORING_INFO);
-    if ( unlikely(idtv_info & INTR_INFO_VALID_MASK) &&
-         (exit_reason != EXIT_REASON_TASK_SWITCH) )
+    /* TODO: consolidate nested idtv handling with ordinary one */
+    if ( !v->arch.hvm_vcpu.in_nesting )
     {
-        if ( hvm_event_needs_reinjection((idtv_info>>8)&7, idtv_info&0xff) )
-        {
-            /* See SDM 3B 25.7.1.1 and .2 for info about masking resvd bits. */
-            __vmwrite(VM_ENTRY_INTR_INFO,
-                      idtv_info & ~INTR_INFO_RESVD_BITS_MASK);
-            if ( idtv_info & INTR_INFO_DELIVER_CODE_MASK )
-                __vmwrite(VM_ENTRY_EXCEPTION_ERROR_CODE,
-                          __vmread(IDT_VECTORING_ERROR_CODE));
-        }
-
-        /*
-         * Clear NMI-blocking interruptibility info if an NMI delivery faulted.
-         * Re-delivery will re-set it (see SDM 3B 25.7.1.2).
-         */
-        if ( (idtv_info & INTR_INFO_INTR_TYPE_MASK) == (X86_EVENTTYPE_NMI<<8) )
-            __vmwrite(GUEST_INTERRUPTIBILITY_INFO,
-                      __vmread(GUEST_INTERRUPTIBILITY_INFO) &
-                      ~VMX_INTR_SHADOW_NMI);
+        /* Event delivery caused this intercept? Queue for redelivery. */
+        idtv_info = __vmread(IDT_VECTORING_INFO);
+        if ( unlikely(idtv_info & INTR_INFO_VALID_MASK) &&
+             (exit_reason != EXIT_REASON_TASK_SWITCH) )
+            vmx_idtv_reinject(idtv_info);
     }
 
     switch ( exit_reason )
@@ -2721,6 +2744,9 @@
         domain_crash(v->domain);
         break;
     }
+
+    if ( v->arch.hvm_vcpu.in_nesting )
+        vmx_nest_idtv_handling();
 }
 
 asmlinkage void vmx_vmenter_helper(void)
diff -r 4934d8db96bf xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:11:52 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:14:26 2010 +0800
@@ -54,6 +54,9 @@
      * with vmresume_in_progress
      */
     int                  vmresume_in_progress;
+
+    unsigned long        intr_info;
+    unsigned long        error_code;
 };
 
 asmlinkage void vmx_nest_switch_mode(void);
@@ -76,4 +79,6 @@
                                             unsigned long value);
 void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value);
 
+void vmx_nest_idtv_handling(void);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 12/16] vmx: nest: VMExit handler in L2
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (10 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 11/16] vmx: nest: interrupt handling Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-08 15:22 ` [PATCH 13/16] vmx: nest: L2 tsc Qing He
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

handles VMExits happened in L2

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r 7a9edf7654ad xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:14:26 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:15:00 2010 +0800
@@ -1109,3 +1109,224 @@
 
     /* TODO: NMI */
 }
+
+/*
+ * L2 VMExit handling
+ */
+
+int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs,
+                               unsigned int exit_reason)
+{
+    struct vcpu *v = current;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+    u32 ctrl;
+    int bypass_l0 = 0;
+
+    nest->vmexit_pending = 0;
+    nest->intr_info = 0;
+    nest->error_code = 0;
+
+    switch (exit_reason) {
+    case EXIT_REASON_EXCEPTION_NMI:
+    {
+        u32 intr_info = __vmread(VM_EXIT_INTR_INFO);
+        u32 valid_mask = (X86_EVENTTYPE_HW_EXCEPTION << 8) |
+                         INTR_INFO_VALID_MASK;
+        u64 exec_bitmap;
+        int vector = intr_info & INTR_INFO_VECTOR_MASK;
+
+        /*
+         * decided by L0 and L1 exception bitmap, if the vetor is set by
+         * both, L0 has priority on #PF, L1 has priority on others
+         */
+        if ( vector == TRAP_page_fault )
+        {
+            if ( paging_mode_hap(v->domain) )
+                nest->vmexit_pending = 1;
+        }
+        else if ( (intr_info & valid_mask) == valid_mask )
+        {
+            exec_bitmap =__get_vvmcs(nest->vvmcs, EXCEPTION_BITMAP);
+
+            if ( exec_bitmap & (1 << vector) )
+                nest->vmexit_pending = 1;
+        }
+        break;
+    }
+
+    case EXIT_REASON_WBINVD:
+    case EXIT_REASON_EPT_VIOLATION:
+    case EXIT_REASON_EPT_MISCONFIG:
+    case EXIT_REASON_EXTERNAL_INTERRUPT:
+        /* pass to L0 handler */
+        break;
+
+    case VMX_EXIT_REASONS_FAILED_VMENTRY:
+    case EXIT_REASON_TRIPLE_FAULT:
+    case EXIT_REASON_TASK_SWITCH:
+    case EXIT_REASON_IO_INSTRUCTION:
+    case EXIT_REASON_CPUID:
+    case EXIT_REASON_MSR_READ:
+    case EXIT_REASON_MSR_WRITE:
+    case EXIT_REASON_VMCALL:
+    case EXIT_REASON_VMCLEAR:
+    case EXIT_REASON_VMLAUNCH:
+    case EXIT_REASON_VMPTRLD:
+    case EXIT_REASON_VMPTRST:
+    case EXIT_REASON_VMREAD:
+    case EXIT_REASON_VMRESUME:
+    case EXIT_REASON_VMWRITE:
+    case EXIT_REASON_VMXOFF:
+    case EXIT_REASON_VMXON:
+    case EXIT_REASON_INVEPT:
+        /* inject to L1 */
+        nest->vmexit_pending = 1;
+        break;
+
+    case EXIT_REASON_PENDING_VIRT_INTR:
+    {
+        ctrl = v->arch.hvm_vmx.exec_control;
+
+        /*
+         * if both open intr/nmi window, L0 has priority.
+         *
+         * Note that this is not strictly correct, in L2 context,
+         * L0's intr/nmi window flag should be replaced to MTF,
+         * causing an imediate VMExit, but MTF may not be available
+         * on all hardware.
+         */
+        if ( !(ctrl & CPU_BASED_VIRTUAL_INTR_PENDING) )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+    case EXIT_REASON_PENDING_VIRT_NMI:
+    {
+        ctrl = v->arch.hvm_vmx.exec_control;
+
+        if ( !(ctrl & CPU_BASED_VIRTUAL_NMI_PENDING) )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    /* L1 has priority handling several other types of exits */
+    case EXIT_REASON_HLT:
+    {
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( ctrl & CPU_BASED_HLT_EXITING )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    case EXIT_REASON_RDTSC:
+    {
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( ctrl & CPU_BASED_RDTSC_EXITING )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    case EXIT_REASON_RDPMC:
+    {
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( ctrl & CPU_BASED_RDPMC_EXITING )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    case EXIT_REASON_MWAIT_INSTRUCTION:
+    {
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( ctrl & CPU_BASED_MWAIT_EXITING )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    case EXIT_REASON_PAUSE_INSTRUCTION:
+    {
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( ctrl & CPU_BASED_PAUSE_EXITING )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    case EXIT_REASON_MONITOR_INSTRUCTION:
+    {
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( ctrl & CPU_BASED_MONITOR_EXITING )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    case EXIT_REASON_DR_ACCESS:
+    {
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( ctrl & CPU_BASED_MOV_DR_EXITING )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    case EXIT_REASON_INVLPG:
+    {
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( ctrl & CPU_BASED_INVLPG_EXITING )
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+
+    case EXIT_REASON_CR_ACCESS:
+    {
+        u64 exit_qualification = __vmread(EXIT_QUALIFICATION);
+        int cr = exit_qualification & 15;
+        int write = (exit_qualification >> 4) & 3;
+        u32 mask = 0;
+
+        /* also according to guest exec_control */
+        ctrl = __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL);
+
+        if ( cr == 3 )
+        {
+            mask = write? CPU_BASED_CR3_STORE_EXITING:
+                          CPU_BASED_CR3_LOAD_EXITING;
+            if ( ctrl & mask )
+                nest->vmexit_pending = 1;
+        }
+        else if ( cr == 8 )
+        {
+            mask = write? CPU_BASED_CR8_STORE_EXITING:
+                          CPU_BASED_CR8_LOAD_EXITING;
+            if ( ctrl & mask )
+                nest->vmexit_pending = 1;
+        }
+        else  /* CR0, CR4, CLTS, LMSW */
+            nest->vmexit_pending = 1;
+
+        break;
+    }
+    default:
+        gdprintk(XENLOG_WARNING, "Unknown nested vmexit reason %x.\n",
+                 exit_reason);
+    }
+
+    if ( nest->vmexit_pending )
+        bypass_l0 = 1;
+
+    return bypass_l0;
+}
diff -r 7a9edf7654ad xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:14:26 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:15:00 2010 +0800
@@ -2373,6 +2373,11 @@
      * any pending vmresume has really happened
      */
     v->arch.hvm_vmx.nest.vmresume_in_progress = 0;
+    if ( v->arch.hvm_vcpu.in_nesting )
+    {
+        if ( vmx_nest_l2_vmexit_handler(regs, exit_reason) )
+            goto out;
+    }
 
     if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
         return vmx_failed_vmentry(exit_reason, regs);
@@ -2745,6 +2750,7 @@
         break;
     }
 
+out:
     if ( v->arch.hvm_vcpu.in_nesting )
         vmx_nest_idtv_handling();
 }
diff -r 7a9edf7654ad xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:14:26 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:15:00 2010 +0800
@@ -81,4 +81,7 @@
 
 void vmx_nest_idtv_handling(void);
 
+int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs,
+                               unsigned int exit_reason);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */
diff -r 7a9edf7654ad xen/include/asm-x86/hvm/vmx/vmx.h
--- a/xen/include/asm-x86/hvm/vmx/vmx.h	Wed Sep 08 22:14:26 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h	Wed Sep 08 22:15:00 2010 +0800
@@ -112,6 +112,7 @@
 #define EXIT_REASON_APIC_ACCESS         44
 #define EXIT_REASON_EPT_VIOLATION       48
 #define EXIT_REASON_EPT_MISCONFIG       49
+#define EXIT_REASON_INVEPT              50
 #define EXIT_REASON_RDTSCP              51
 #define EXIT_REASON_WBINVD              54
 #define EXIT_REASON_XSETBV              55

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 13/16] vmx: nest: L2 tsc
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (11 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 12/16] vmx: nest: VMExit handler in L2 Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-08 15:22 ` [PATCH 14/16] vmx: nest: CR0.TS and #NM Qing He
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

L2 TSC needs special handling, either rdtsc exiting is
turned on or off

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r 0f6400481299 xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 18:43:13 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 18:52:00 2010 +0800
@@ -647,6 +647,18 @@
  * Nested VMX context switch
  */
 
+u64 vmx_nest_get_tsc_offset(struct vcpu *v)
+{
+    u64 offset = 0;
+    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
+
+    if ( __get_vvmcs(nest->vvmcs, CPU_BASED_VM_EXEC_CONTROL) &
+         CPU_BASED_USE_TSC_OFFSETING )
+        offset = __get_vvmcs(nest->vvmcs, TSC_OFFSET);
+
+    return offset;
+}
+
 static unsigned long vmcs_gstate_field[] = {
     /* 16 BITS */
     GUEST_ES_SELECTOR,
@@ -818,6 +830,7 @@
 
 static void load_vvmcs_guest_state(struct vmx_nest_struct *nest)
 {
+    struct vcpu *v = current;
     int i;
 
     /* vvmcs.gstate to svmcs.gstate */
@@ -828,6 +841,8 @@
     hvm_set_cr4(__get_vvmcs(nest->vvmcs, GUEST_CR4));
     hvm_set_cr3(__get_vvmcs(nest->vvmcs, GUEST_CR3));
 
+    hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset);
+
     vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INTR_INFO);
     vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_EXCEPTION_ERROR_CODE);
     vvmcs_to_shadow(nest->vvmcs, VM_ENTRY_INSTRUCTION_LEN);
@@ -936,6 +951,7 @@
 
 static void load_vvmcs_host_state(struct vmx_nest_struct *nest)
 {
+    struct vcpu *v = current;
     int i;
     u64 r;
 
@@ -949,6 +965,8 @@
     hvm_set_cr4(__get_vvmcs(nest->vvmcs, HOST_CR4));
     hvm_set_cr3(__get_vvmcs(nest->vvmcs, HOST_CR3));
 
+    hvm_funcs.set_tsc_offset(v, v->arch.hvm_vcpu.cache_tsc_offset);
+
     __set_vvmcs(nest->vvmcs, VM_ENTRY_INTR_INFO, 0);
 }
 
@@ -1205,6 +1223,21 @@
 
         if ( ctrl & CPU_BASED_RDTSC_EXITING )
             nest->vmexit_pending = 1;
+        else
+        {
+            uint64_t tsc;
+
+            /*
+             * special handler is needed if L1 doesn't intercept rdtsc,
+             * avoiding changing guest_tsc and messing up timekeeping in L1
+             */
+            tsc = hvm_get_guest_tsc(v);
+            tsc += __get_vvmcs(nest->vvmcs, TSC_OFFSET);
+            regs->eax = (uint32_t)tsc;
+            regs->edx = (uint32_t)(tsc >> 32);
+
+            bypass_l0 = 1;
+        }
 
         break;
     }
diff -r 0f6400481299 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 18:43:13 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 18:52:00 2010 +0800
@@ -969,6 +969,10 @@
 static void vmx_set_tsc_offset(struct vcpu *v, u64 offset)
 {
     vmx_vmcs_enter(v);
+
+    if ( v->arch.hvm_vcpu.in_nesting )
+        offset += vmx_nest_get_tsc_offset(v);
+
     __vmwrite(TSC_OFFSET, offset);
 #if defined (__i386__)
     __vmwrite(TSC_OFFSET_HIGH, offset >> 32);
diff -r 0f6400481299 xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 18:43:13 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 18:52:00 2010 +0800
@@ -69,6 +69,8 @@
                                             unsigned long value);
 void vmx_nest_update_exception_bitmap(struct vcpu *v, unsigned long value);
 
+u64 vmx_nest_get_tsc_offset(struct vcpu *v);
+
 void vmx_nest_idtv_handling(void);
 
 int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs,

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 14/16] vmx: nest: CR0.TS and #NM
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (12 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 13/16] vmx: nest: L2 tsc Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-08 15:22 ` [PATCH 15/16] vmx: nest: capability reporting MSRs Qing He
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

1. #NM exits from L2 should be handled by L0 if it wants.
2. HOST_CR0.TS may need to be updated after L1<->L2 switch

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r a5f6653f4c5a xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 21:47:57 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Thu Apr 22 21:49:08 2010 +0800
@@ -791,6 +791,9 @@
     regs->rsp = __get_vvmcs(nest->vvmcs, GUEST_RSP);
     regs->rflags = __get_vvmcs(nest->vvmcs, GUEST_RFLAGS);
 
+    /* updating host cr0 to sync TS bit */
+    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
+
     /* TODO: EPT_POINTER */
 }
 
@@ -927,6 +930,9 @@
     regs->rsp = __get_vvmcs(nest->vvmcs, HOST_RSP);
     regs->rflags = __vmread(GUEST_RFLAGS);
 
+    /* updating host cr0 to sync TS bit */
+    __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
+
     vmreturn(regs, VMSUCCEED);
 }
 
@@ -1036,13 +1042,18 @@
 
         /*
          * decided by L0 and L1 exception bitmap, if the vetor is set by
-         * both, L0 has priority on #PF, L1 has priority on others
+         * both, L0 has priority on #PF and #NM, L1 has priority on others
          */
         if ( vector == TRAP_page_fault )
         {
             if ( paging_mode_hap(v->domain) )
                 nest->vmexit_pending = 1;
         }
+        else if ( vector == TRAP_no_device )
+        {
+            if ( v->fpu_dirtied )
+                nest->vmexit_pending = 1;
+        }
         else if ( (intr_info & valid_mask) == valid_mask )
         {
             exec_bitmap =__get_vvmcs(nest->vvmcs, EXCEPTION_BITMAP);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 15/16] vmx: nest: capability reporting MSRs
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (13 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 14/16] vmx: nest: CR0.TS and #NM Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-13 12:45   ` Tim Deegan
  2010-09-15 10:05   ` Christoph Egger
  2010-09-08 15:22 ` [PATCH 16/16] vmx: nest: expose cpuid and CR4.VMXE Qing He
  2010-09-13 13:10 ` [PATCH 00/16] Nested virtualization for VMX Tim Deegan
  16 siblings, 2 replies; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

handles VMX capability reporting MSRs.
Some features are masked so L1 would see a rather
simple configuration

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r 694dcf6c3f06 xen/arch/x86/hvm/vmx/nest.c
--- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 19:47:14 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 19:47:39 2010 +0800
@@ -1352,3 +1352,91 @@
 
     return bypass_l0;
 }
+
+/*
+ * Capability reporting
+ */
+int vmx_nest_msr_read_intercept(unsigned int msr, u64 *msr_content)
+{
+    u32 eax, edx;
+    u64 data = 0;
+    int r = 1;
+    u32 mask = 0;
+
+    if ( !is_nested_avail(current->domain) )
+        return 0;
+
+    switch (msr) {
+    case MSR_IA32_VMX_BASIC:
+        rdmsr(msr, eax, edx);
+        data = edx;
+        data = (data & ~0x1fff) | 0x1000;     /* request 4KB for guest VMCS */
+        data &= ~(1 << 23);                   /* disable TRUE_xxx_CTLS */
+        data = (data << 32) | VVMCS_REVISION; /* VVMCS revision */
+        break;
+    case MSR_IA32_VMX_PINBASED_CTLS:
+#define REMOVED_PIN_CONTROL_CAP (PIN_BASED_PREEMPT_TIMER)
+        rdmsr(msr, eax, edx);
+        data = edx;
+        data = (data << 32) | eax;
+        break;
+    case MSR_IA32_VMX_PROCBASED_CTLS:
+        rdmsr(msr, eax, edx);
+#define REMOVED_EXEC_CONTROL_CAP (CPU_BASED_TPR_SHADOW \
+            | CPU_BASED_ACTIVATE_MSR_BITMAP            \
+            | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)
+        data = edx & ~REMOVED_EXEC_CONTROL_CAP;
+        data = (data << 32) | eax;
+        break;
+    case MSR_IA32_VMX_EXIT_CTLS:
+        rdmsr(msr, eax, edx);
+#define REMOVED_EXIT_CONTROL_CAP (VM_EXIT_SAVE_GUEST_PAT \
+            | VM_EXIT_LOAD_HOST_PAT                      \
+            | VM_EXIT_SAVE_GUEST_EFER                    \
+            | VM_EXIT_LOAD_HOST_EFER                     \
+            | VM_EXIT_SAVE_PREEMPT_TIMER)
+        data = edx & ~REMOVED_EXIT_CONTROL_CAP;
+        data = (data << 32) | eax;
+        break;
+    case MSR_IA32_VMX_ENTRY_CTLS:
+        rdmsr(msr, eax, edx);
+#define REMOVED_ENTRY_CONTROL_CAP (VM_ENTRY_LOAD_GUEST_PAT \
+            | VM_ENTRY_LOAD_GUEST_EFER)
+        data = edx & ~REMOVED_ENTRY_CONTROL_CAP;
+        data = (data << 32) | eax;
+        break;
+    case MSR_IA32_VMX_PROCBASED_CTLS2:
+        mask = 0;
+
+        rdmsr(msr, eax, edx);
+        data = edx & mask;
+        data = (data << 32) | eax;
+        break;
+
+    /* pass through MSRs */
+    case IA32_FEATURE_CONTROL_MSR:
+    case MSR_IA32_VMX_MISC:
+    case MSR_IA32_VMX_CR0_FIXED0:
+    case MSR_IA32_VMX_CR0_FIXED1:
+    case MSR_IA32_VMX_CR4_FIXED0:
+    case MSR_IA32_VMX_CR4_FIXED1:
+    case MSR_IA32_VMX_VMCS_ENUM:
+        rdmsr(msr, eax, edx);
+        data = edx;
+        data = (data << 32) | eax;
+        break;
+
+    default:
+        r = 0;
+        break;
+    }
+
+    *msr_content = data;
+    return r;
+}
+
+int vmx_nest_msr_write_intercept(unsigned int msr, u64 msr_content)
+{
+    /* silently ignore for now */
+    return 1;
+}
diff -r 694dcf6c3f06 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 19:47:14 2010 +0800
+++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 19:47:39 2010 +0800
@@ -1877,8 +1877,11 @@
         *msr_content |= (u64)__vmread(GUEST_IA32_DEBUGCTL_HIGH) << 32;
 #endif
         break;
-    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_PROCBASED_CTLS2:
-        goto gp_fault;
+    case IA32_FEATURE_CONTROL_MSR:
+    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_TRUE_ENTRY_CTLS:
+        if ( !vmx_nest_msr_read_intercept(msr, msr_content) )
+            goto gp_fault;
+        break;
     case MSR_IA32_MISC_ENABLE:
         rdmsrl(MSR_IA32_MISC_ENABLE, *msr_content);
         /* Debug Trace Store is not supported. */
@@ -2043,8 +2046,11 @@
 
         break;
     }
-    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_PROCBASED_CTLS2:
-        goto gp_fault;
+    case IA32_FEATURE_CONTROL_MSR:
+    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_TRUE_ENTRY_CTLS:
+        if ( !vmx_nest_msr_write_intercept(msr, msr_content) )
+            goto gp_fault;
+        break;
     default:
         if ( vpmu_do_wrmsr(msr, msr_content) )
             return X86EMUL_OKAY;
diff -r 694dcf6c3f06 xen/include/asm-x86/hvm/vmx/nest.h
--- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 19:47:14 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 19:47:39 2010 +0800
@@ -76,4 +76,9 @@
 int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs,
                                unsigned int exit_reason);
 
+int vmx_nest_msr_read_intercept(unsigned int msr,
+                                u64 *msr_content);
+int vmx_nest_msr_write_intercept(unsigned int msr,
+                                 u64 msr_content);
+
 #endif /* __ASM_X86_HVM_NEST_H__ */
diff -r 694dcf6c3f06 xen/include/asm-x86/hvm/vmx/vmcs.h
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 19:47:14 2010 +0800
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 19:47:39 2010 +0800
@@ -161,18 +161,23 @@
 #define PIN_BASED_EXT_INTR_MASK         0x00000001
 #define PIN_BASED_NMI_EXITING           0x00000008
 #define PIN_BASED_VIRTUAL_NMIS          0x00000020
+#define PIN_BASED_PREEMPT_TIMER         0x00000040
 extern u32 vmx_pin_based_exec_control;
 
 #define VM_EXIT_IA32E_MODE              0x00000200
 #define VM_EXIT_ACK_INTR_ON_EXIT        0x00008000
 #define VM_EXIT_SAVE_GUEST_PAT          0x00040000
 #define VM_EXIT_LOAD_HOST_PAT           0x00080000
+#define VM_EXIT_SAVE_GUEST_EFER         0x00100000
+#define VM_EXIT_LOAD_HOST_EFER          0x00200000
+#define VM_EXIT_SAVE_PREEMPT_TIMER      0x00400000
 extern u32 vmx_vmexit_control;
 
 #define VM_ENTRY_IA32E_MODE             0x00000200
 #define VM_ENTRY_SMM                    0x00000400
 #define VM_ENTRY_DEACT_DUAL_MONITOR     0x00000800
 #define VM_ENTRY_LOAD_GUEST_PAT         0x00004000
+#define VM_ENTRY_LOAD_GUEST_EFER        0x00008000
 extern u32 vmx_vmentry_control;
 
 #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001
diff -r 694dcf6c3f06 xen/include/asm-x86/msr-index.h
--- a/xen/include/asm-x86/msr-index.h	Wed Sep 08 19:47:14 2010 +0800
+++ b/xen/include/asm-x86/msr-index.h	Wed Sep 08 19:47:39 2010 +0800
@@ -172,6 +172,7 @@
 #define MSR_IA32_VMX_CR0_FIXED1                 0x487
 #define MSR_IA32_VMX_CR4_FIXED0                 0x488
 #define MSR_IA32_VMX_CR4_FIXED1                 0x489
+#define MSR_IA32_VMX_VMCS_ENUM                  0x48a
 #define MSR_IA32_VMX_PROCBASED_CTLS2            0x48b
 #define MSR_IA32_VMX_EPT_VPID_CAP               0x48c
 #define MSR_IA32_VMX_TRUE_PINBASED_CTLS         0x48d

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 16/16] vmx: nest: expose cpuid and CR4.VMXE
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (14 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 15/16] vmx: nest: capability reporting MSRs Qing He
@ 2010-09-08 15:22 ` Qing He
  2010-09-15  9:43   ` Christoph Egger
  2010-09-13 13:10 ` [PATCH 00/16] Nested virtualization for VMX Tim Deegan
  16 siblings, 1 reply; 68+ messages in thread
From: Qing He @ 2010-09-08 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

expose VMX cpuid and allow guest to enable VMX.

Signed-off-by: Qing He <qing.he@intel.com>
Signed-off-by: Eddie Dong <eddie.dong@intel.com>

---

diff -r 3f40a1f79cf8 tools/libxc/xc_cpuid_x86.c
--- a/tools/libxc/xc_cpuid_x86.c	Wed Sep 08 19:47:39 2010 +0800
+++ b/tools/libxc/xc_cpuid_x86.c	Wed Sep 08 19:49:06 2010 +0800
@@ -128,8 +128,17 @@
     const unsigned int *input, unsigned int *regs,
     int is_pae)
 {
+    unsigned long nest;
+
     switch ( input[0] )
     {
+    case 0x00000001:
+        /* ECX[5] is availability of VMX */
+        xc_get_hvm_param(xch, domid, HVM_PARAM_NESTEDHVM, &nest);
+        if (nest)
+            regs[2] |= 0x20;
+        break;
+
     case 0x00000004:
         /*
          * EAX[31:26] is Maximum Cores Per Package (minus one).
diff -r 3f40a1f79cf8 xen/include/asm-x86/hvm/hvm.h
--- a/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 19:47:39 2010 +0800
+++ b/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 19:49:06 2010 +0800
@@ -295,7 +295,8 @@
         X86_CR4_DE  | X86_CR4_PSE | X86_CR4_PAE |       \
         X86_CR4_MCE | X86_CR4_PGE | X86_CR4_PCE |       \
         X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT |           \
-        (cpu_has_xsave ? X86_CR4_OSXSAVE : 0))))
+        (cpu_has_xsave ? X86_CR4_OSXSAVE : 0)   |       \
+        X86_CR4_VMXE)))
 
 /* These exceptions must always be intercepted. */
 #define HVM_TRAP_MASK ((1U << TRAP_machine_check) | (1U << TRAP_invalid_op))

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 04/16] vmx: nest: nested control structure
  2010-09-08 15:22 ` [PATCH 04/16] vmx: nest: nested control structure Qing He
@ 2010-09-09  6:13   ` Dong, Eddie
  2010-09-15 11:27   ` Christoph Egger
  1 sibling, 0 replies; 68+ messages in thread
From: Dong, Eddie @ 2010-09-09  6:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Dong, Eddie, He, Qing

Qing He wrote:
> v->arch.hvm_vmx.nest as control structure
> 
> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
> 
> ---
> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/nest.h
> --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> +++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:03:41 2010
> +0800 @@ -0,0 +1,45 @@
> +/*
> + * nest.h: nested virtualization for VMX.
> + *
> + * Copyright (c) 2010, Intel Corporation.
> + * Author: Qing He <qing.he@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or
> modify it + * under the terms and conditions of the GNU General
> Public License, + * version 2, as published by the Free Software
> Foundation. + *
> + * This program is distributed in the hope it will be useful, but
> WITHOUT + * ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> General Public License for + * more details.
> + *
> + * You should have received a copy of the GNU General Public License
> along with + * this program; if not, write to the Free Software
> Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA
> 02111-1307 USA. + *
> + */
> +#ifndef __ASM_X86_HVM_NEST_H__
> +#define __ASM_X86_HVM_NEST_H__
> +
> +struct vmcs_struct;
> +
> +struct vmx_nest_struct {
> +    paddr_t              guest_vmxon_pa;
> +
> +    /* Saved host vmcs for vcpu itself */
> +    struct vmcs_struct  *hvmcs;
> +
> +    /*
> +     * Guest's `current vmcs' of vcpu
> +     *  - gvmcs_pa: guest VMCS region physical address
> +     *  - vvmcs:    (guest) virtual vmcs
> +     *  - svmcs:    effective vmcs for the guest of this vcpu
> +     *  - valid:    launch state: invalid on clear, valid on ld
> +     */
> +    paddr_t              gvmcs_pa;
> +    void                *vvmcs;
> +    struct vmcs_struct  *svmcs;
> +    int                  vmcs_valid;
> +};
> +
> +#endif /* __ASM_X86_HVM_NEST_H__ */
> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/vmcs.h
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:00:00 2010
> +0800 +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:03:41
> 2010 +0800 @@ -22,6 +22,7 @@
>  #include <asm/config.h>
>  #include <asm/hvm/io.h>
>  #include <asm/hvm/vpmu.h>
> +#include <asm/hvm/vmx/nest.h>
> 
>  extern void vmcs_dump_vcpu(struct vcpu *v);
>  extern void setup_vmcs_dump(void);
> @@ -99,6 +100,9 @@
>      u32                  secondary_exec_control;
>      u32                  exception_bitmap;
> 
> +    /* nested virtualization */
> +    struct vmx_nest_struct nest;
> +
>  #ifdef __x86_64__
>      struct vmx_msr_state msr_state;
>      unsigned long        shadow_gs;
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

Hi, 

This one remind me something when I was reviewing other VMM side patch.
The terminology here is pretty hard. One naming solution from nested virtualization layer point of view is to use subscript (IBM guys uses this in their to-be-coming OSDI paper as well). Here is an example:

L2 guest
L1 guest (VMM)
L0 VMM

vmcs01 is the VMCS that L0 VMM uses for L1 guest, like vmcs in single layer virtualization.   (hvmcs in current patch)
vmcs12 is the VMCS that L1 VMM uses for L2 guest. But this vmcs12 is visible to L0 VMM.	(vvmcs in current patch)
vmcs02 is the VMCS that L0 VMM uses for L2 guest when L2 guest is executing. It is a kind of shadow of vmcs12. (svmcs in current patch)

Will this be more clear? Appreciate for your comments (except Qing and me).

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-08 15:22 ` [PATCH 06/16] vmx: nest: handling VMX instruction exits Qing He
@ 2010-09-10  7:05   ` Dong, Eddie
  2010-09-13 11:11     ` Tim Deegan
  2010-09-13 11:10   ` Tim Deegan
  1 sibling, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-10  7:05 UTC (permalink / raw)
  To: xen-devel; +Cc: Dong, Eddie, He, Qing

Qing He wrote:
> add a VMX instruction decoder and handle simple VMX instructions
> except vmlaunch/vmresume and invept
> 
> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
> 
> ---
> 

> +static int __clear_current_vvmcs(struct vmx_nest_struct *nest)
> +{
> +    int rc;
> +
> +    if ( nest->svmcs )
> +        __vmpclear(virt_to_maddr(nest->svmcs));
> +
> +#if !CONFIG_VVMCS_MAPPING
> +    rc = hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs,


Qing:
	Why this may be failure? The only reason may be nest->gvmcs_pa, but I guess we already verified the address.

Thx, Eddie

> PAGE_SIZE); +    if ( rc != HVMCOPY_okay )
> +        return X86EMUL_EXCEPTION;
> +#endif
> +
> +    nest->vmcs_valid = 0;
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +/*

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 01/16] vmx: nest: rename host_vmcs
  2010-09-08 15:22 ` [PATCH 01/16] vmx: nest: rename host_vmcs Qing He
@ 2010-09-10 13:27   ` Christoph Egger
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Egger @ 2010-09-10 13:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

On Wednesday 08 September 2010 17:22:09 Qing He wrote:
> the VMCS region used for vmxon is named host_vmcs, which is
> somewhat misleading in nested virtualization context, rename it
> to vmxon_vmcs.
>
> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>

Acked-by: Christoph Egger <Christoph.Egger@amd.com>

> ---
>
> diff -r d6a8d49f3526 xen/arch/x86/hvm/vmx/vmcs.c
> --- a/xen/arch/x86/hvm/vmx/vmcs.c	Mon Jul 26 14:42:21 2010 +0800
> +++ b/xen/arch/x86/hvm/vmx/vmcs.c	Wed Aug 04 16:30:40 2010 +0800
> @@ -67,7 +67,7 @@
>  u64 vmx_ept_vpid_cap __read_mostly;
>  bool_t cpu_has_vmx_ins_outs_instr_info __read_mostly;
>
> -static DEFINE_PER_CPU_READ_MOSTLY(struct vmcs_struct *, host_vmcs);
> +static DEFINE_PER_CPU_READ_MOSTLY(struct vmcs_struct *, vmxon_vmcs);
>  static DEFINE_PER_CPU(struct vmcs_struct *, current_vmcs);
>  static DEFINE_PER_CPU(struct list_head, active_vmcs_list);
>
> @@ -427,11 +427,11 @@
>
>  int vmx_cpu_up_prepare(unsigned int cpu)
>  {
> -    if ( per_cpu(host_vmcs, cpu) != NULL )
> +    if ( per_cpu(vmxon_vmcs, cpu) != NULL )
>          return 0;
>
> -    per_cpu(host_vmcs, cpu) = vmx_alloc_vmcs();
> -    if ( per_cpu(host_vmcs, cpu) != NULL )
> +    per_cpu(vmxon_vmcs, cpu) = vmx_alloc_vmcs();
> +    if ( per_cpu(vmxon_vmcs, cpu) != NULL )
>          return 0;
>
>      printk("CPU%d: Could not allocate host VMCS\n", cpu);
> @@ -440,8 +440,8 @@
>
>  void vmx_cpu_dead(unsigned int cpu)
>  {
> -    vmx_free_vmcs(per_cpu(host_vmcs, cpu));
> -    per_cpu(host_vmcs, cpu) = NULL;
> +    vmx_free_vmcs(per_cpu(vmxon_vmcs, cpu));
> +    per_cpu(vmxon_vmcs, cpu) = NULL;
>  }
>
>  int vmx_cpu_up(void)
> @@ -498,7 +498,7 @@
>      if ( (rc = vmx_cpu_up_prepare(cpu)) != 0 )
>          return rc;
>
> -    switch ( __vmxon(virt_to_maddr(this_cpu(host_vmcs))) )
> +    switch ( __vmxon(virt_to_maddr(this_cpu(vmxon_vmcs))) )
>      {
>      case -2: /* #UD or #GP */
>          if ( bios_locked &&
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 02/16] vmx: nest: wrapper for control update
  2010-09-08 15:22 ` [PATCH 02/16] vmx: nest: wrapper for control update Qing He
@ 2010-09-10 13:29   ` Christoph Egger
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Egger @ 2010-09-10 13:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

On Wednesday 08 September 2010 17:22:10 Qing He wrote:
> In nested virtualization, the L0 controls may not be the same
> with controls in physical VMCS.
> Explict maintain guest controls in variables and use wrappers
> for control update, do not rely on physical control value.
>
> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>

Acked-By: Christoph Egger <Christoph.Egger@amd.com>

> ---
>
> diff -r 905ca9cc0596 xen/arch/x86/hvm/vmx/intr.c
> --- a/xen/arch/x86/hvm/vmx/intr.c       Wed Aug 04 16:30:40 2010 +0800
> +++ b/xen/arch/x86/hvm/vmx/intr.c       Thu Aug 05 15:32:24 2010 +0800
> @@ -106,7 +106,7 @@
>      if ( !(*cpu_exec_control & ctl) )
>      {
>          *cpu_exec_control |= ctl;
> -        __vmwrite(CPU_BASED_VM_EXEC_CONTROL, *cpu_exec_control);
> +        vmx_update_cpu_exec_control(v);
>      }
>  }
>
> @@ -121,7 +121,7 @@
>      if ( unlikely(v->arch.hvm_vcpu.single_step) )
>      {
>          v->arch.hvm_vmx.exec_control |= CPU_BASED_MONITOR_TRAP_FLAG;
> -        __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
> v->arch.hvm_vmx.exec_control); +        vmx_update_cpu_exec_control(v);
>          return;
>      }
>
> diff -r 905ca9cc0596 xen/arch/x86/hvm/vmx/vmcs.c
> --- a/xen/arch/x86/hvm/vmx/vmcs.c       Wed Aug 04 16:30:40 2010 +0800
> +++ b/xen/arch/x86/hvm/vmx/vmcs.c       Thu Aug 05 15:32:24 2010 +0800
> @@ -839,10 +839,10 @@
>      __vmwrite(VMCS_LINK_POINTER_HIGH, ~0UL);
>  #endif
>
> -    __vmwrite(EXCEPTION_BITMAP,
> -              HVM_TRAP_MASK
> +    v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK
>
>                | (paging_mode_hap(d) ? 0 : (1U << TRAP_page_fault))
>
> -              | (1U << TRAP_no_device));
> +              | (1U << TRAP_no_device);
> +    vmx_update_exception_bitmap(v);
>
>      v->arch.hvm_vcpu.guest_cr[0] = X86_CR0_PE | X86_CR0_ET;
>      hvm_update_guest_cr(v, 0);
> diff -r 905ca9cc0596 xen/arch/x86/hvm/vmx/vmx.c
> --- a/xen/arch/x86/hvm/vmx/vmx.c        Wed Aug 04 16:30:40 2010 +0800
> +++ b/xen/arch/x86/hvm/vmx/vmx.c        Thu Aug 05 15:32:24 2010 +0800
> @@ -385,6 +385,22 @@
>
>  #endif /* __i386__ */
>
> +void vmx_update_cpu_exec_control(struct vcpu *v)
> +{
> +    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
> +}
> +
> +void vmx_update_secondary_exec_control(struct vcpu *v)
> +{
> +    __vmwrite(SECONDARY_VM_EXEC_CONTROL,
> +              v->arch.hvm_vmx.secondary_exec_control);
> +}
> +
> +void vmx_update_exception_bitmap(struct vcpu *v)
> +{
> +    __vmwrite(EXCEPTION_BITMAP, v->arch.hvm_vmx.exception_bitmap);
> +}
> +
>  static int vmx_guest_x86_mode(struct vcpu *v)
>  {
>      unsigned int cs_ar_bytes;
> @@ -408,7 +424,7 @@
>      /* Clear the DR dirty flag and re-enable intercepts for DR accesses.
> */ v->arch.hvm_vcpu.flag_dr_dirty = 0;
>      v->arch.hvm_vmx.exec_control |= CPU_BASED_MOV_DR_EXITING;
> -    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
> +    vmx_update_cpu_exec_control(v);
>
>      v->arch.guest_context.debugreg[0] = read_debugreg(0);
>      v->arch.guest_context.debugreg[1] = read_debugreg(1);
> @@ -622,7 +638,8 @@
>  static void vmx_fpu_enter(struct vcpu *v)
>  {
>      setup_fpu(v);
> -    __vm_clear_bit(EXCEPTION_BITMAP, TRAP_no_device);
> +    v->arch.hvm_vmx.exception_bitmap &= ~(1u << TRAP_no_device);
> +    vmx_update_exception_bitmap(v);
>      v->arch.hvm_vmx.host_cr0 &= ~X86_CR0_TS;
>      __vmwrite(HOST_CR0, v->arch.hvm_vmx.host_cr0);
>  }
> @@ -648,7 +665,8 @@
>      {
>          v->arch.hvm_vcpu.hw_cr[0] |= X86_CR0_TS;
>          __vmwrite(GUEST_CR0, v->arch.hvm_vcpu.hw_cr[0]);
> -        __vm_set_bit(EXCEPTION_BITMAP, TRAP_no_device);
> +        v->arch.hvm_vmx.exception_bitmap |= (1u << TRAP_no_device);
> +        vmx_update_exception_bitmap(v);
>      }
>  }
>
> @@ -954,7 +972,7 @@
>      v->arch.hvm_vmx.exec_control &= ~CPU_BASED_RDTSC_EXITING;
>      if ( enable )
>          v->arch.hvm_vmx.exec_control |= CPU_BASED_RDTSC_EXITING;
> -    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
> +    vmx_update_cpu_exec_control(v);
>      vmx_vmcs_exit(v);
>  }
>
> @@ -1047,7 +1065,7 @@
>
>  void vmx_update_debug_state(struct vcpu *v)
>  {
> -    unsigned long intercepts, mask;
> +    unsigned long mask;
>
>      ASSERT(v == current);
>
> @@ -1055,12 +1073,11 @@
>      if ( !cpu_has_monitor_trap_flag )
>          mask |= 1u << TRAP_debug;
>
> -    intercepts = __vmread(EXCEPTION_BITMAP);
>      if ( v->arch.hvm_vcpu.debug_state_latch )
> -        intercepts |= mask;
> +        v->arch.hvm_vmx.exception_bitmap |= mask;
>      else
> -        intercepts &= ~mask;
> -    __vmwrite(EXCEPTION_BITMAP, intercepts);
> +        v->arch.hvm_vmx.exception_bitmap &= ~mask;
> +    vmx_update_exception_bitmap(v);
>  }
>
>  static void vmx_update_guest_cr(struct vcpu *v, unsigned int cr)
> @@ -1087,7 +1104,7 @@
>              v->arch.hvm_vmx.exec_control &= ~cr3_ctls;
>              if ( !hvm_paging_enabled(v) )
>                  v->arch.hvm_vmx.exec_control |= cr3_ctls;
> -            __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
> v->arch.hvm_vmx.exec_control); +            vmx_update_cpu_exec_control(v);
>
>              /* Changing CR0.PE can change some bits in real CR4. */
>              vmx_update_guest_cr(v, 4);
> @@ -1122,7 +1139,8 @@
>                      vmx_set_segment_register(v, s, &reg[s]);
>                  v->arch.hvm_vcpu.hw_cr[4] |= X86_CR4_VME;
>                  __vmwrite(GUEST_CR4, v->arch.hvm_vcpu.hw_cr[4]);
> -                __vmwrite(EXCEPTION_BITMAP, 0xffffffff);
> +                v->arch.hvm_vmx.exception_bitmap = 0xffffffff;
> +                vmx_update_exception_bitmap(v);
>              }
>              else
>              {
> @@ -1134,11 +1152,11 @@
>                      ((v->arch.hvm_vcpu.hw_cr[4] & ~X86_CR4_VME)
>
>                       |(v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_VME));
>
>                  __vmwrite(GUEST_CR4, v->arch.hvm_vcpu.hw_cr[4]);
> -                __vmwrite(EXCEPTION_BITMAP,
> -                          HVM_TRAP_MASK
> +                v->arch.hvm_vmx.exception_bitmap = HVM_TRAP_MASK
>
>                            | (paging_mode_hap(v->domain) ?
>
>                               0 : (1U << TRAP_page_fault))
> -                          | (1U << TRAP_no_device));
> +                          | (1U << TRAP_no_device);
> +                vmx_update_exception_bitmap(v);
>                  vmx_update_debug_state(v);
>              }
>          }
> @@ -1544,7 +1562,7 @@
>
>      /* Allow guest direct access to DR registers */
>      v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MOV_DR_EXITING;
> -    __vmwrite(CPU_BASED_VM_EXEC_CONTROL, v->arch.hvm_vmx.exec_control);
> +    vmx_update_cpu_exec_control(v);
>  }
>
>  static void vmx_invlpg_intercept(unsigned long vaddr)
> @@ -1928,18 +1946,18 @@
>  void vmx_vlapic_msr_changed(struct vcpu *v)
>  {
>      struct vlapic *vlapic = vcpu_vlapic(v);
> -    uint32_t ctl;
>
>      if ( !cpu_has_vmx_virtualize_apic_accesses )
>          return;
>
>      vmx_vmcs_enter(v);
> -    ctl  = __vmread(SECONDARY_VM_EXEC_CONTROL);
> -    ctl &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +    v->arch.hvm_vmx.secondary_exec_control
> +        &= ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
>      if ( !vlapic_hw_disabled(vlapic) &&
>           (vlapic_base_address(vlapic) == APIC_DEFAULT_PHYS_BASE) )
> -        ctl |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> -    __vmwrite(SECONDARY_VM_EXEC_CONTROL, ctl);
> +        v->arch.hvm_vmx.secondary_exec_control
> +            |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> +    vmx_update_secondary_exec_control(v);
>      vmx_vmcs_exit(v);
>  }
>
> @@ -2469,14 +2487,12 @@
>      case EXIT_REASON_PENDING_VIRT_INTR:
>          /* Disable the interrupt window. */
>          v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING;
> -        __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
> -                  v->arch.hvm_vmx.exec_control);
> +        vmx_update_cpu_exec_control(v);
>          break;
>      case EXIT_REASON_PENDING_VIRT_NMI:
>          /* Disable the NMI window. */
>          v->arch.hvm_vmx.exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING;
> -        __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
> -                  v->arch.hvm_vmx.exec_control);
> +        vmx_update_cpu_exec_control(v);
>          break;
>      case EXIT_REASON_TASK_SWITCH: {
>          const enum hvm_task_switch_reason reasons[] = {
> @@ -2627,7 +2643,7 @@
>
>      case EXIT_REASON_MONITOR_TRAP_FLAG:
>          v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
> -        __vmwrite(CPU_BASED_VM_EXEC_CONTROL,
> v->arch.hvm_vmx.exec_control); +        vmx_update_cpu_exec_control(v);
>          if ( v->domain->debugger_attached && v->arch.hvm_vcpu.single_step
> ) domain_pause_for_debugger();
>          break;
> @@ -2677,16 +2693,14 @@
>              /* VPID was disabled: now enabled. */
>              curr->arch.hvm_vmx.secondary_exec_control |=
>                  SECONDARY_EXEC_ENABLE_VPID;
> -            __vmwrite(SECONDARY_VM_EXEC_CONTROL,
> -                      curr->arch.hvm_vmx.secondary_exec_control);
> +            vmx_update_secondary_exec_control(curr);
>          }
>          else if ( old_asid && !new_asid )
>          {
>              /* VPID was enabled: now disabled. */
>              curr->arch.hvm_vmx.secondary_exec_control &=
>                  ~SECONDARY_EXEC_ENABLE_VPID;
> -            __vmwrite(SECONDARY_VM_EXEC_CONTROL,
> -                      curr->arch.hvm_vmx.secondary_exec_control);
> +            vmx_update_secondary_exec_control(curr);
>          }
>      }
>
> diff -r 905ca9cc0596 xen/include/asm-x86/hvm/vmx/vmcs.h
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h        Wed Aug 04 16:30:40 2010
> +0800 +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h        Thu Aug 05 15:32:24
> 2010 +0800 @@ -97,6 +97,7 @@
>      /* Cache of cpu execution control. */
>      u32                  exec_control;
>      u32                  secondary_exec_control;
> +    u32                  exception_bitmap;
>
>  #ifdef __x86_64__
>      struct vmx_msr_state msr_state;
> diff -r 905ca9cc0596 xen/include/asm-x86/hvm/vmx/vmx.h
> --- a/xen/include/asm-x86/hvm/vmx/vmx.h Wed Aug 04 16:30:40 2010 +0800
> +++ b/xen/include/asm-x86/hvm/vmx/vmx.h Thu Aug 05 15:32:24 2010 +0800
> @@ -60,6 +60,9 @@
>  void vmx_vlapic_msr_changed(struct vcpu *v);
>  void vmx_realmode(struct cpu_user_regs *regs);
>  void vmx_update_debug_state(struct vcpu *v);
> +void vmx_update_cpu_exec_control(struct vcpu *v);
> +void vmx_update_secondary_exec_control(struct vcpu *v);
> +void vmx_update_exception_bitmap(struct vcpu *v);
>
>  /*
>   * Exit Reasons
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 05/16] vmx: nest: virtual vmcs layout
  2010-09-08 15:22 ` [PATCH 05/16] vmx: nest: virtual vmcs layout Qing He
@ 2010-09-13 10:29   ` Tim Deegan
  0 siblings, 0 replies; 68+ messages in thread
From: Tim Deegan @ 2010-09-13 10:29 UTC (permalink / raw)
  To: Qing He; +Cc: Eddie, xen-devel, Dong

At 16:22 +0100 on 08 Sep (1283962933), Qing He wrote:
> + * Since physical VMCS layout is unknown, a custom layout is used
> + * for virtual VMCS seen by guest. It occupies a 4k page, and the
> + * field is offset by an 9-bit offset into u64[], The offset is as
> + * follow, which means every <width, type> pair has a max of 32
> + * fields available.

A question - is this layout likely to have to change often in future?
(I guess the answer's no but you'll know best.)  We'll have to carry
compatibility code indefinitely so that VMs can be migrated to newer
Xens without breaking.

> + *             9       7      5               0
> + *             --------------------------------
> + *     offset: | width | type |     index     |
> + *             --------------------------------
> + *
> + * Also, since the lower range <width=0, type={0,1}> has only one
> + * field: VPID, it is moved to a higher offset (63), and leaves the
> + * lower range to non-indexed field like VMCS revision.
> + *
> + */
> +
> +#define VVMCS_REVISION 0x40000001u
> +
> +struct vvmcs_header {
> +    u32 revision;
> +    u32 abort;
> +};

This structure isn't used anywhere in the rest of the patch series.
Oversight?

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-08 15:22 ` [PATCH 06/16] vmx: nest: handling VMX instruction exits Qing He
  2010-09-10  7:05   ` Dong, Eddie
@ 2010-09-13 11:10   ` Tim Deegan
  2010-09-15  4:55     ` Dong, Eddie
  1 sibling, 1 reply; 68+ messages in thread
From: Tim Deegan @ 2010-09-13 11:10 UTC (permalink / raw)
  To: Qing He; +Cc: Eddie, xen-devel, Dong

At 16:22 +0100 on 08 Sep (1283962934), Qing He wrote:
> diff -r f1c1d3077337 xen/arch/x86/hvm/vmx/nest.c
> +/*
> + * VMX instructions support functions
> + */
> +
> +enum vmx_regs_enc {
> +    VMX_REG_RAX,
> +    VMX_REG_RCX,
> +    VMX_REG_RDX,
> +    VMX_REG_RBX,
> +    VMX_REG_RSP,
> +    VMX_REG_RBP,
> +    VMX_REG_RSI,
> +    VMX_REG_RDI,
> +#ifdef CONFIG_X86_64
> +    VMX_REG_R8,
> +    VMX_REG_R9,
> +    VMX_REG_R10,
> +    VMX_REG_R11,
> +    VMX_REG_R12,
> +    VMX_REG_R13,
> +    VMX_REG_R14,
> +    VMX_REG_R15,
> +#endif
> +};
> +
> +enum vmx_sregs_enc {
> +    VMX_SREG_ES,
> +    VMX_SREG_CS,
> +    VMX_SREG_SS,
> +    VMX_SREG_DS,
> +    VMX_SREG_FS,
> +    VMX_SREG_GS,
> +};
> +
> +enum x86_segment sreg_to_index[] = {
> +    [VMX_SREG_ES] = x86_seg_es,
> +    [VMX_SREG_CS] = x86_seg_cs,
> +    [VMX_SREG_SS] = x86_seg_ss,
> +    [VMX_SREG_DS] = x86_seg_ds,
> +    [VMX_SREG_FS] = x86_seg_fs,
> +    [VMX_SREG_GS] = x86_seg_gs,
> +};

Since you dislike adding new namespaces and translations, I'm sure you
can get rid of these. :)  It might even simplify some of the macros
below. 

> +union vmx_inst_info {
> +    struct {
> +        unsigned int scaling           :2; /* bit 0-1 */
> +        unsigned int __rsvd0           :1; /* bit 2 */
> +        unsigned int reg1              :4; /* bit 3-6 */
> +        unsigned int addr_size         :3; /* bit 7-9 */
> +        unsigned int memreg            :1; /* bit 10 */
> +        unsigned int __rsvd1           :4; /* bit 11-14 */
> +        unsigned int segment           :3; /* bit 15-17 */
> +        unsigned int index_reg         :4; /* bit 18-21 */
> +        unsigned int index_reg_invalid :1; /* bit 22 */
> +        unsigned int base_reg          :4; /* bit 23-26 */
> +        unsigned int base_reg_invalid  :1; /* bit 27 */
> +        unsigned int reg2              :4; /* bit 28-31 */
> +    } fields;
> +    u32 word;
> +};
> +
> +struct vmx_inst_decoded {
> +#define VMX_INST_MEMREG_TYPE_MEMORY 0
> +#define VMX_INST_MEMREG_TYPE_REG    1
> +    int type;
> +    union {
> +        struct {
> +            unsigned long mem;
> +            unsigned int  len;
> +        };
> +        enum vmx_regs_enc reg1;
> +    };
> +
> +    enum vmx_regs_enc reg2;
> +};
> +
> +enum vmx_ops_result {
> +    VMSUCCEED,
> +    VMFAIL_VALID,
> +    VMFAIL_INVALID,
> +};
> +
> +#define CASE_SET_REG(REG, reg)      \
> +    case VMX_REG_ ## REG: regs->reg = value; break
> +#define CASE_GET_REG(REG, reg)      \
> +    case VMX_REG_ ## REG: value = regs->reg; break
> +
> +#define CASE_EXTEND_SET_REG         \
> +    CASE_EXTEND_REG(S)
> +#define CASE_EXTEND_GET_REG         \
> +    CASE_EXTEND_REG(G)
> +
> +#ifdef __i386__
> +#define CASE_EXTEND_REG(T)
> +#else
> +#define CASE_EXTEND_REG(T)          \
> +    CASE_ ## T ## ET_REG(R8, r8);   \
> +    CASE_ ## T ## ET_REG(R9, r9);   \
> +    CASE_ ## T ## ET_REG(R10, r10); \
> +    CASE_ ## T ## ET_REG(R11, r11); \
> +    CASE_ ## T ## ET_REG(R12, r12); \
> +    CASE_ ## T ## ET_REG(R13, r13); \
> +    CASE_ ## T ## ET_REG(R14, r14); \
> +    CASE_ ## T ## ET_REG(R15, r15)
> +#endif
> +
> +static unsigned long reg_read(struct cpu_user_regs *regs,
> +                              enum vmx_regs_enc index)
> +{
> +    unsigned long value = 0;
> +
> +    switch ( index ) {
> +    CASE_GET_REG(RAX, eax);
> +    CASE_GET_REG(RCX, ecx);
> +    CASE_GET_REG(RDX, edx);
> +    CASE_GET_REG(RBX, ebx);
> +    CASE_GET_REG(RBP, ebp);
> +    CASE_GET_REG(RSI, esi);
> +    CASE_GET_REG(RDI, edi);
> +    CASE_GET_REG(RSP, esp);
> +    CASE_EXTEND_GET_REG;
> +    default:
> +        break;
> +    }
> +
> +    return value;
> +}
> +
> +static void reg_write(struct cpu_user_regs *regs,
> +                      enum vmx_regs_enc index,
> +                      unsigned long value)
> +{
> +    switch ( index ) {
> +    CASE_SET_REG(RAX, eax);
> +    CASE_SET_REG(RCX, ecx);
> +    CASE_SET_REG(RDX, edx);
> +    CASE_SET_REG(RBX, ebx);
> +    CASE_SET_REG(RBP, ebp);
> +    CASE_SET_REG(RSI, esi);
> +    CASE_SET_REG(RDI, edi);
> +    CASE_SET_REG(RSP, esp);
> +    CASE_EXTEND_SET_REG;
> +    default:
> +        break;
> +    }
> +}
> +
> +static int decode_vmx_inst(struct cpu_user_regs *regs,
> +                           struct vmx_inst_decoded *decode)
> +{
> +    struct vcpu *v = current;
> +    union vmx_inst_info info;
> +    struct segment_register seg;
> +    unsigned long base, index, seg_base, disp, offset;
> +    int scale;
> +
> +    info.word = __vmread(VMX_INSTRUCTION_INFO);
> +
> +    if ( info.fields.memreg ) {
> +        decode->type = VMX_INST_MEMREG_TYPE_REG;
> +        decode->reg1 = info.fields.reg1;
> +    }
> +    else
> +    {
> +        decode->type = VMX_INST_MEMREG_TYPE_MEMORY;
> +        hvm_get_segment_register(v, sreg_to_index[info.fields.segment], &seg);
> +        /* TODO: segment type check */

Indeed. :)

> +        seg_base = seg.base;
> +
> +        base = info.fields.base_reg_invalid ? 0 :
> +            reg_read(regs, info.fields.base_reg);
> +
> +        index = info.fields.index_reg_invalid ? 0 :
> +            reg_read(regs, info.fields.index_reg);
> +
> +        scale = 1 << info.fields.scaling;
> +
> +        disp = __vmread(EXIT_QUALIFICATION);
> +
> +        offset = base + index * scale + disp;
> +        if ( offset > seg.limit )

DYM if ( offset + len > limit )?

Would it be worth trying to fold this into the existing x86_emulate
code, which already has careful memory access checks?

> +            goto gp_fault;
> +
> +        decode->mem = seg_base + base + index * scale + disp;
> +        decode->len = 1 << (info.fields.addr_size + 1);
> +    }
> +
> +    decode->reg2 = info.fields.reg2;
> +
> +    return X86EMUL_OKAY;
> +
> +gp_fault:
> +    hvm_inject_exception(TRAP_gp_fault, 0, 0);
> +    return X86EMUL_EXCEPTION;
> +}
> +
> +static int vmx_inst_check_privilege(struct cpu_user_regs *regs)
> +{
> +    struct vcpu *v = current;
> +    struct segment_register cs;
> +
> +    hvm_get_segment_register(v, x86_seg_cs, &cs);
> +
> +    if ( !(v->arch.hvm_vcpu.guest_cr[0] & X86_CR0_PE) ||
> +         !(v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_VMXE) ||
> +         (regs->eflags & X86_EFLAGS_VM) ||
> +         (hvm_long_mode_enabled(v) && cs.attr.fields.l == 0) )
> +        goto invalid_op;
> +
> +    if ( (cs.sel & 3) > 0 )
> +        goto gp_fault;
> +
> +    return X86EMUL_OKAY;
> +
> +invalid_op:
> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
> +    return X86EMUL_EXCEPTION;
> +
> +gp_fault:
> +    hvm_inject_exception(TRAP_gp_fault, 0, 0);
> +    return X86EMUL_EXCEPTION;
> +}
> +
> +static void vmreturn(struct cpu_user_regs *regs, enum vmx_ops_result res)
> +{
> +    unsigned long eflags = regs->eflags;
> +    unsigned long mask = X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
> +                         X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF;
> +
> +    eflags &= ~mask;
> +
> +    switch ( res ) {
> +    case VMSUCCEED:
> +        break;
> +    case VMFAIL_VALID:
> +        /* TODO: error number, useful for guest VMM debugging */
> +        eflags |= X86_EFLAGS_ZF;
> +        break;
> +    case VMFAIL_INVALID:
> +    default:
> +        eflags |= X86_EFLAGS_CF;
> +        break;
> +    }
> +
> +    regs->eflags = eflags;
> +}
> +
> +static int __clear_current_vvmcs(struct vmx_nest_struct *nest)
> +{
> +    int rc;
> +
> +    if ( nest->svmcs )
> +        __vmpclear(virt_to_maddr(nest->svmcs));
> +
> +#if !CONFIG_VVMCS_MAPPING
> +    rc = hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs, PAGE_SIZE);
> +    if ( rc != HVMCOPY_okay )
> +        return X86EMUL_EXCEPTION;
> +#endif
> +
> +    nest->vmcs_valid = 0;
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +/*
> + * VMX instructions handling
> + */
> +
> +int vmx_nest_handle_vmxon(struct cpu_user_regs *regs)
> +{
> +    struct vcpu *v = current;
> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
> +    struct vmx_inst_decoded decode;
> +    unsigned long gpa = 0;
> +    int rc;
> +
> +    if ( !is_nested_avail(v->domain) )
> +        goto invalid_op;
> + 
> +    rc = vmx_inst_check_privilege(regs);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;

I think you could fold these checks and the goto target into
decode_vmx_inst(), just to avoid repeating them in every
vmx_nest_handle_* function. 

> +    rc = decode_vmx_inst(regs, &decode);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;
> +
> +    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
> +    rc = hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
> +    if ( rc != HVMCOPY_okay )
> +        return X86EMUL_EXCEPTION;
> +
> +    nest->guest_vmxon_pa = gpa;
> +    nest->gvmcs_pa = 0;
> +    nest->vmcs_valid = 0;
> +#if !CONFIG_VVMCS_MAPPING
> +    nest->vvmcs = alloc_xenheap_page();
> +    if ( !nest->vvmcs )
> +    {
> +        gdprintk(XENLOG_ERR, "nest: allocation for virtual vmcs failed\n");
> +        vmreturn(regs, VMFAIL_INVALID);
> +        goto out;
> +    }
> +#endif
> +    nest->svmcs = alloc_xenheap_page();
> +    if ( !nest->svmcs )
> +    {
> +        gdprintk(XENLOG_ERR, "nest: allocation for shadow vmcs failed\n");
> +        free_xenheap_page(nest->vvmcs);
> +        vmreturn(regs, VMFAIL_INVALID);
> +        goto out;
> +    }
> +
> +    /*
> +     * `fork' the host vmcs to shadow_vmcs
> +     * vmcs_lock is not needed since we are on current
> +     */
> +    nest->hvmcs = v->arch.hvm_vmx.vmcs;
> +    __vmpclear(virt_to_maddr(nest->hvmcs));
> +    memcpy(nest->svmcs, nest->hvmcs, PAGE_SIZE);
> +    __vmptrld(virt_to_maddr(nest->hvmcs));
> +    v->arch.hvm_vmx.launched = 0;
> +
> +    vmreturn(regs, VMSUCCEED);
> +
> +out:
> +    return X86EMUL_OKAY;
> +
> +invalid_op:
> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
> +    return X86EMUL_EXCEPTION;
> +}
> +
> +int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs)
> +{
> +    struct vcpu *v = current;
> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
> +    int rc;
> +
> +    if ( unlikely(!nest->guest_vmxon_pa) )
> +        goto invalid_op;
> +
> +    rc = vmx_inst_check_privilege(regs);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;
> +
> +    nest->guest_vmxon_pa = 0;
> +    __vmpclear(virt_to_maddr(nest->svmcs));
> +
> +#if !CONFIG_VVMCS_MAPPING
> +    free_xenheap_page(nest->vvmcs);
> +#endif
> +    free_xenheap_page(nest->svmcs);

These also need to be freed on domain teardown.

> +    vmreturn(regs, VMSUCCEED);
> +    return X86EMUL_OKAY;
> +
> +invalid_op:
> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
> +    return X86EMUL_EXCEPTION;
> +}
> +
> +int vmx_nest_handle_vmptrld(struct cpu_user_regs *regs)
> +{
> +    struct vcpu *v = current;
> +    struct vmx_inst_decoded decode;
> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
> +    unsigned long gpa = 0;
> +    int rc;
> +
> +    if ( unlikely(!nest->guest_vmxon_pa) )
> +        goto invalid_op;
> +
> +    rc = vmx_inst_check_privilege(regs);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;
> +
> +    rc = decode_vmx_inst(regs, &decode);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;
> +
> +    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
> +    rc = hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
> +    if ( rc != HVMCOPY_okay )
> +        return X86EMUL_EXCEPTION;
> +
> +    if ( gpa == nest->guest_vmxon_pa || gpa & 0xfff )
> +    {
> +        vmreturn(regs, VMFAIL_INVALID);
> +        goto out;
> +    }
> +
> +    if ( nest->gvmcs_pa != gpa )
> +    {
> +        if ( nest->vmcs_valid )
> +        {
> +            rc = __clear_current_vvmcs(nest);
> +            if ( rc != X86EMUL_OKAY )
> +                return rc;
> +        }
> +        nest->gvmcs_pa = gpa;
> +        ASSERT(nest->vmcs_valid == 0);
> +    }
> +
> +
> +    if ( !nest->vmcs_valid )
> +    {
> +#if CONFIG_VVMCS_MAPPING
> +        unsigned long mfn;
> +        p2m_type_t p2mt;
> +
> +        mfn = mfn_x(gfn_to_mfn(p2m_get_hostp2m(v->domain),
> +                               nest->gvmcs_pa >> PAGE_SHIFT, &p2mt));
> +        nest->vvmcs = map_domain_page_global(mfn);
> +#else
> +        rc = hvm_copy_from_guest_phys(nest->vvmcs, nest->gvmcs_pa, PAGE_SIZE);
> +        if ( rc != HVMCOPY_okay )
> +            return X86EMUL_EXCEPTION;
> +#endif
> +        nest->vmcs_valid = 1;
> +    }
> +
> +    vmreturn(regs, VMSUCCEED);
> +
> +out:
> +    return X86EMUL_OKAY;
> +
> +invalid_op:
> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
> +    return X86EMUL_EXCEPTION;
> +}
> +
> +int vmx_nest_handle_vmptrst(struct cpu_user_regs *regs)
> +{
> +    struct vcpu *v = current;
> +    struct vmx_inst_decoded decode;
> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
> +    unsigned long gpa = 0;
> +    int rc;
> +
> +    if ( unlikely(!nest->guest_vmxon_pa) )
> +        goto invalid_op;
> +
> +    rc = vmx_inst_check_privilege(regs);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;
> +
> +    rc = decode_vmx_inst(regs, &decode);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;
> +
> +    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
> +
> +    gpa = nest->gvmcs_pa;
> +
> +    rc = hvm_copy_to_guest_virt(decode.mem, &gpa, decode.len, 0);
> +    if ( rc != HVMCOPY_okay )
> +        return X86EMUL_EXCEPTION;
> +
> +    vmreturn(regs, VMSUCCEED);
> +    return X86EMUL_OKAY;
> +
> +invalid_op:
> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
> +    return X86EMUL_EXCEPTION;
> +}
> +
> +int vmx_nest_handle_vmclear(struct cpu_user_regs *regs)
> +{
> +    struct vcpu *v = current;
> +    struct vmx_inst_decoded decode;
> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
> +    unsigned long gpa = 0;
> +    int rc;
> +
> +    if ( unlikely(!nest->guest_vmxon_pa) )
> +        goto invalid_op;
> +
> +    rc = vmx_inst_check_privilege(regs);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;
> +
> +    rc = decode_vmx_inst(regs, &decode);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;
> +
> +    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
> +    rc = hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);

Is it guaranteed that decode.len is always <= sizeof gpa here, even with
a malicious guest?

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-10  7:05   ` Dong, Eddie
@ 2010-09-13 11:11     ` Tim Deegan
  2010-09-13 14:29       ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Tim Deegan @ 2010-09-13 11:11 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: xen-devel, He, Qing

At 08:05 +0100 on 10 Sep (1284105901), Dong, Eddie wrote:
> Qing He wrote:
> > +static int __clear_current_vvmcs(struct vmx_nest_struct *nest)
> > +{
> > +    int rc;
> > +
> > +    if ( nest->svmcs )
> > +        __vmpclear(virt_to_maddr(nest->svmcs));
> > +
> > +#if !CONFIG_VVMCS_MAPPING
> > +    rc = hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs,
> 
> 
> Qing:
> 	Why this may be failure? The only reason may be nest->gvmcs_pa, but I guess we already verified the address.
> 

It was verified at load time, but the guest could have ballooned it out
in the meantime. 

Cheers,

Tim

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 15/16] vmx: nest: capability reporting MSRs
  2010-09-08 15:22 ` [PATCH 15/16] vmx: nest: capability reporting MSRs Qing He
@ 2010-09-13 12:45   ` Tim Deegan
  2010-09-15 10:05   ` Christoph Egger
  1 sibling, 0 replies; 68+ messages in thread
From: Tim Deegan @ 2010-09-13 12:45 UTC (permalink / raw)
  To: Qing He; +Cc: Eddie, xen-devel, Dong

At 16:22 +0100 on 08 Sep (1283962943), Qing He wrote:
> handles VMX capability reporting MSRs.
> Some features are masked so L1 would see a rather
> simple configuration

As I said last time, would it be better to whitelist features that we
know are safely virtualized?

> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
> 
> ---
> 
> diff -r 694dcf6c3f06 xen/arch/x86/hvm/vmx/nest.c
> --- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 19:47:14 2010 +0800
> +++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 19:47:39 2010 +0800
> @@ -1352,3 +1352,91 @@
>  
>      return bypass_l0;
>  }
> +
> +/*
> + * Capability reporting
> + */
> +int vmx_nest_msr_read_intercept(unsigned int msr, u64 *msr_content)
> +{
> +    u32 eax, edx;
> +    u64 data = 0;
> +    int r = 1;
> +    u32 mask = 0;
> +
> +    if ( !is_nested_avail(current->domain) )
> +        return 0;
> +
> +    switch (msr) {
> +    case MSR_IA32_VMX_BASIC:
> +        rdmsr(msr, eax, edx);
> +        data = edx;
> +        data = (data & ~0x1fff) | 0x1000;     /* request 4KB for guest VMCS */
> +        data &= ~(1 << 23);                   /* disable TRUE_xxx_CTLS */

Magic number - please use a macro to define it. 

> +        data = (data << 32) | VVMCS_REVISION; /* VVMCS revision */
> +        break;
> +    case MSR_IA32_VMX_PINBASED_CTLS:
> +#define REMOVED_PIN_CONTROL_CAP (PIN_BASED_PREEMPT_TIMER)

You define this mask but don't actually mask anything with it.

Cheers, 

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 00/16] Nested virtualization for VMX
  2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
                   ` (15 preceding siblings ...)
  2010-09-08 15:22 ` [PATCH 16/16] vmx: nest: expose cpuid and CR4.VMXE Qing He
@ 2010-09-13 13:10 ` Tim Deegan
  16 siblings, 0 replies; 68+ messages in thread
From: Tim Deegan @ 2010-09-13 13:10 UTC (permalink / raw)
  To: Qing He; +Cc: xen-devel, eddie.dong

Hi, 

Thanks for this updated patchset, and for addressing a lot of what I
mentioned last time.  I've made a few comments on individual patches;
since the nested EPT changes aren't in the series this time I don't
think any of it needs an explicit ack from me with my x86-mm-maintainer
hat on.

Apart from that I'll say again that some common code should be found
between these and the SVM patches, if that's possible.  Not necessarily
the interface that Christoph is proposing, but since he's gone to the
effort it seems like a good starting point.

I would certainly like to see common interrupt-routing changes next time
these patches are posted, even if we're still discussing the vmexit
path.

Cheers,

Tim.

At 16:22 +0100 on 08 Sep (1283962928), Qing He wrote:
> This patch set is the upgraded version of nested
> virtualization for VMX, that allows a VMX guest (L1) to
> run other VMX guests (L2).
> 
> The nested virtualization for vmx is built on homogeneous
> L1 and L2 for better performance and minimal emulation,
> the common code involved is small and contained in patch
> 03/16, for two flags, one for feature availability, the
> other for indicating current mode.
> 
> The userspace components (xend/xm/xl) is not included since
> Christopher's userspace patch has similar coverage. vEPT is
> not included as well because it's still WIP.
> 
> Major changes to last version:
>  - address Tim's comments on error handling and others
>  - split context switch into smaller pieces with certain
>    restructure for better readability
>  - update interrupt handling, rewrite comments
>  - move cpuid into userspace
>  - etc.
> 
> The patch set includes the following patches.
> 
> [PATCH 01/16] vmx: nest: rename host_vmcs
> [PATCH 02/16] vmx: nest: wrapper for control update
> [PATCH 03/16] vmx: nest: nested availability and status flags
> [PATCH 04/16] vmx: nest: nested control structure
> [PATCH 05/16] vmx: nest: virtual vmcs layout
> [PATCH 06/16] vmx: nest: handling VMX instruction exits
> [PATCH 07/16] vmx: nest: switch current vmcs
> [PATCH 08/16] vmx: nest: vmresume/vmlaunch
> [PATCH 09/16] vmx: nest: shadow controls
> [PATCH 10/16] vmx: nest: L1 <-> L2 context switch
> [PATCH 11/16] vmx: nest: interrupt handling
> [PATCH 12/16] vmx: nest: VMExit handler in L2
> [PATCH 13/16] vmx: nest: L2 tsc
> [PATCH 14/16] vmx: nest: CR0.TS and #NM
> [PATCH 15/16] vmx: nest: capability reporting MSRs
> [PATCH 16/16] vmx: nest: expose cpuid and CR4.VMXE
> 
> Thanks,
> Qing He
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-13 11:11     ` Tim Deegan
@ 2010-09-13 14:29       ` Dong, Eddie
  2010-09-13 14:46         ` Tim Deegan
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-13 14:29 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Dong, Eddie, He, Qing

Tim Deegan wrote:
> At 08:05 +0100 on 10 Sep (1284105901), Dong, Eddie wrote:
>> Qing He wrote:
>>> +static int __clear_current_vvmcs(struct vmx_nest_struct *nest) +{
>>> +    int rc;
>>> +
>>> +    if ( nest->svmcs )
>>> +        __vmpclear(virt_to_maddr(nest->svmcs));
>>> +
>>> +#if !CONFIG_VVMCS_MAPPING
>>> +    rc = hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs,
>> 
>> 
>> Qing:
>> 	Why this may be failure? The only reason may be nest->gvmcs_pa, but
>> I guess we already verified the address. 
>> 
> 
> It was verified at load time, but the guest could have ballooned it
> out in the meantime.

If the L1 guest allocated that GPA as VMCS memory, it can't balloon them out.
If L1 is a mallicious guest and ballooned the VMCS memory out, it is worthy to do. Not?

> 
> Cheers,
> 

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-13 14:29       ` Dong, Eddie
@ 2010-09-13 14:46         ` Tim Deegan
  0 siblings, 0 replies; 68+ messages in thread
From: Tim Deegan @ 2010-09-13 14:46 UTC (permalink / raw)
  To: Dong, Eddie, g; +Cc: xen-devel, He, Qing

At 15:29 +0100 on 13 Sep (1284391777), Dong, Eddie wrote:
> Tim Deegan wrote:
> > At 08:05 +0100 on 10 Sep (1284105901), Dong, Eddie wrote:
> >> Qing He wrote:
> >>> +static int __clear_current_vvmcs(struct vmx_nest_struct *nest) +{
> >>> +    int rc;
> >>> +
> >>> +    if ( nest->svmcs )
> >>> +        __vmpclear(virt_to_maddr(nest->svmcs));
> >>> +
> >>> +#if !CONFIG_VVMCS_MAPPING
> >>> +    rc = hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs,
> >> 
> >> 
> >> Qing:
> >> 	Why this may be failure? The only reason may be nest->gvmcs_pa, but
> >> I guess we already verified the address. 
> >> 
> > 
> > It was verified at load time, but the guest could have ballooned it
> > out in the meantime.
> 
> If the L1 guest allocated that GPA as VMCS memory, it can't balloon them out.
> If L1 is a mallicious guest and ballooned the VMCS memory out, it is worthy to do. Not?
> 

Yes, in this case it looks like there's probably no harm in ignoring a
failure, but it seems reasonable to handle it.

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-13 11:10   ` Tim Deegan
@ 2010-09-15  4:55     ` Dong, Eddie
  2010-09-15  6:40       ` Keir Fraser
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15  4:55 UTC (permalink / raw)
  To: Tim Deegan, He, Qing; +Cc: xen-devel, Dong, Eddie

Tim Deegan wrote:
> At 16:22 +0100 on 08 Sep (1283962934), Qing He wrote:
>> diff -r f1c1d3077337 xen/arch/x86/hvm/vmx/nest.c
>> +/*
>> + * VMX instructions support functions
>> + */
>> +
>> +enum vmx_regs_enc {
>> +    VMX_REG_RAX,
>> +    VMX_REG_RCX,
>> +    VMX_REG_RDX,
>> +    VMX_REG_RBX,
>> +    VMX_REG_RSP,
>> +    VMX_REG_RBP,
>> +    VMX_REG_RSI,
>> +    VMX_REG_RDI,
>> +#ifdef CONFIG_X86_64
>> +    VMX_REG_R8,
>> +    VMX_REG_R9,
>> +    VMX_REG_R10,
>> +    VMX_REG_R11,
>> +    VMX_REG_R12,
>> +    VMX_REG_R13,
>> +    VMX_REG_R14,
>> +    VMX_REG_R15,
>> +#endif
>> +};
>> +
>> +enum vmx_sregs_enc {
>> +    VMX_SREG_ES,
>> +    VMX_SREG_CS,
>> +    VMX_SREG_SS,
>> +    VMX_SREG_DS,
>> +    VMX_SREG_FS,
>> +    VMX_SREG_GS,
>> +};
>> +
>> +enum x86_segment sreg_to_index[] = {
>> +    [VMX_SREG_ES] = x86_seg_es,
>> +    [VMX_SREG_CS] = x86_seg_cs,
>> +    [VMX_SREG_SS] = x86_seg_ss,
>> +    [VMX_SREG_DS] = x86_seg_ds,
>> +    [VMX_SREG_FS] = x86_seg_fs,
>> +    [VMX_SREG_GS] = x86_seg_gs,
>> +};
>
> Since you dislike adding new namespaces and translations, I'm sure you
> can get rid of these. :)  It might even simplify some of the macros
> below.

True, some dupcation here. Regarding following definition in x86_emulate.c, we can reuse.


static enum x86_segment
decode_segment(uint8_t modrm_reg)
{
    switch ( modrm_reg )
    {
    case 0: return x86_seg_es;
    case 1: return x86_seg_cs;
    case 2: return x86_seg_ss;
    case 3: return x86_seg_ds;
    case 4: return x86_seg_fs;
    case 5: return x86_seg_gs;
    default: break;
    }
    return decode_segment_failed;
}

BTW, should we use MACROs rather than digital # here? and can we modify x86_segment structure to use same naming space?

>
>> +union vmx_inst_info {
>> +    struct {
>> +        unsigned int scaling           :2; /* bit 0-1 */
>> +        unsigned int __rsvd0           :1; /* bit 2 */
>> +        unsigned int reg1              :4; /* bit 3-6 */
>> +        unsigned int addr_size         :3; /* bit 7-9 */
>> +        unsigned int memreg            :1; /* bit 10 */
>> +        unsigned int __rsvd1           :4; /* bit 11-14 */
>> +        unsigned int segment           :3; /* bit 15-17 */
>> +        unsigned int index_reg         :4; /* bit 18-21 */
>> +        unsigned int index_reg_invalid :1; /* bit 22 */
>> +        unsigned int base_reg          :4; /* bit 23-26 */
>> +        unsigned int base_reg_invalid  :1; /* bit 27 */
>> +        unsigned int reg2              :4; /* bit 28-31 */ +    }
>> fields; +    u32 word;
>> +};
>> +
>> +struct vmx_inst_decoded {
>> +#define VMX_INST_MEMREG_TYPE_MEMORY 0
>> +#define VMX_INST_MEMREG_TYPE_REG    1
>> +    int type;
>> +    union {
>> +        struct {
>> +            unsigned long mem;
>> +            unsigned int  len;
>> +        };
>> +        enum vmx_regs_enc reg1;
>> +    };
>> +
>> +    enum vmx_regs_enc reg2;
>> +};
>> +
>> +enum vmx_ops_result {
>> +    VMSUCCEED,
>> +    VMFAIL_VALID,
>> +    VMFAIL_INVALID,
>> +};
>> +
>> +#define CASE_SET_REG(REG, reg)      \
>> +    case VMX_REG_ ## REG: regs->reg = value; break
>> +#define CASE_GET_REG(REG, reg)      \
>> +    case VMX_REG_ ## REG: value = regs->reg; break +
>> +#define CASE_EXTEND_SET_REG         \
>> +    CASE_EXTEND_REG(S)
>> +#define CASE_EXTEND_GET_REG         \
>> +    CASE_EXTEND_REG(G)
>> +
>> +#ifdef __i386__
>> +#define CASE_EXTEND_REG(T)
>> +#else
>> +#define CASE_EXTEND_REG(T)          \
>> +    CASE_ ## T ## ET_REG(R8, r8);   \
>> +    CASE_ ## T ## ET_REG(R9, r9);   \
>> +    CASE_ ## T ## ET_REG(R10, r10); \
>> +    CASE_ ## T ## ET_REG(R11, r11); \
>> +    CASE_ ## T ## ET_REG(R12, r12); \
>> +    CASE_ ## T ## ET_REG(R13, r13); \
>> +    CASE_ ## T ## ET_REG(R14, r14); \
>> +    CASE_ ## T ## ET_REG(R15, r15)
>> +#endif
>> +
>> +static unsigned long reg_read(struct cpu_user_regs *regs,
>> +                              enum vmx_regs_enc index) +{
>> +    unsigned long value = 0;
>> +
>> +    switch ( index ) {
>> +    CASE_GET_REG(RAX, eax);
>> +    CASE_GET_REG(RCX, ecx);
>> +    CASE_GET_REG(RDX, edx);
>> +    CASE_GET_REG(RBX, ebx);
>> +    CASE_GET_REG(RBP, ebp);
>> +    CASE_GET_REG(RSI, esi);
>> +    CASE_GET_REG(RDI, edi);
>> +    CASE_GET_REG(RSP, esp);
>> +    CASE_EXTEND_GET_REG;
>> +    default:
>> +        break;
>> +    }
>> +
>> +    return value;
>> +}
>> +
>> +static void reg_write(struct cpu_user_regs *regs,
>> +                      enum vmx_regs_enc index,
>> +                      unsigned long value)
>> +{
>> +    switch ( index ) {
>> +    CASE_SET_REG(RAX, eax);
>> +    CASE_SET_REG(RCX, ecx);
>> +    CASE_SET_REG(RDX, edx);
>> +    CASE_SET_REG(RBX, ebx);
>> +    CASE_SET_REG(RBP, ebp);
>> +    CASE_SET_REG(RSI, esi);
>> +    CASE_SET_REG(RDI, edi);
>> +    CASE_SET_REG(RSP, esp);
>> +    CASE_EXTEND_SET_REG;
>> +    default:
>> +        break;
>> +    }
>> +}
>> +
>> +static int decode_vmx_inst(struct cpu_user_regs *regs,
>> +                           struct vmx_inst_decoded *decode) +{
>> +    struct vcpu *v = current;
>> +    union vmx_inst_info info;
>> +    struct segment_register seg;
>> +    unsigned long base, index, seg_base, disp, offset; +    int
>> scale; +
>> +    info.word = __vmread(VMX_INSTRUCTION_INFO);
>> +
>> +    if ( info.fields.memreg ) {
>> +        decode->type = VMX_INST_MEMREG_TYPE_REG;
>> +        decode->reg1 = info.fields.reg1;
>> +    }
>> +    else
>> +    {
>> +        decode->type = VMX_INST_MEMREG_TYPE_MEMORY;
>> +        hvm_get_segment_register(v,
>> sreg_to_index[info.fields.segment], &seg); +        /* TODO: segment
>> type check */
>
> Indeed. :)
>
>> +        seg_base = seg.base;
>> +
>> +        base = info.fields.base_reg_invalid ? 0 :
>> +            reg_read(regs, info.fields.base_reg);
>> +
>> +        index = info.fields.index_reg_invalid ? 0 :
>> +            reg_read(regs, info.fields.index_reg); +
>> +        scale = 1 << info.fields.scaling;
>> +
>> +        disp = __vmread(EXIT_QUALIFICATION);
>> +
>> +        offset = base + index * scale + disp;
>> +        if ( offset > seg.limit )
>
> DYM if ( offset + len > limit )?
>
> Would it be worth trying to fold this into the existing x86_emulate
> code, which already has careful memory access checks?


Can you give more details? Re-construct hvm_virtual_to_linear_addr?


>
>> +            goto gp_fault;
>> +
>> +        decode->mem = seg_base + base + index * scale + disp;
>> +        decode->len = 1 << (info.fields.addr_size + 1); +    }
>> +
>> +    decode->reg2 = info.fields.reg2;
>> +
>> +    return X86EMUL_OKAY;
>> +
>> +gp_fault:
>> +    hvm_inject_exception(TRAP_gp_fault, 0, 0);
>> +    return X86EMUL_EXCEPTION;
>> +}
>> +
>> +static int vmx_inst_check_privilege(struct cpu_user_regs *regs) +{
>> +    struct vcpu *v = current;
>> +    struct segment_register cs;
>> +
>> +    hvm_get_segment_register(v, x86_seg_cs, &cs);
>> +
>> +    if ( !(v->arch.hvm_vcpu.guest_cr[0] & X86_CR0_PE) ||
>> +         !(v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_VMXE) ||
>> +         (regs->eflags & X86_EFLAGS_VM) ||
>> +         (hvm_long_mode_enabled(v) && cs.attr.fields.l == 0) ) +
>> goto invalid_op; +
>> +    if ( (cs.sel & 3) > 0 )
>> +        goto gp_fault;
>> +
>> +    return X86EMUL_OKAY;
>> +
>> +invalid_op:
>> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
>> +    return X86EMUL_EXCEPTION;
>> +
>> +gp_fault:
>> +    hvm_inject_exception(TRAP_gp_fault, 0, 0);
>> +    return X86EMUL_EXCEPTION;
>> +}
>> +
>> +static void vmreturn(struct cpu_user_regs *regs, enum
>> vmx_ops_result res) +{ +    unsigned long eflags = regs->eflags;
>> +    unsigned long mask = X86_EFLAGS_CF | X86_EFLAGS_PF |
>> X86_EFLAGS_AF | +                         X86_EFLAGS_ZF |
>> X86_EFLAGS_SF | X86_EFLAGS_OF; + +    eflags &= ~mask;
>> +
>> +    switch ( res ) {
>> +    case VMSUCCEED:
>> +        break;
>> +    case VMFAIL_VALID:
>> +        /* TODO: error number, useful for guest VMM debugging */
>> +        eflags |= X86_EFLAGS_ZF;
>> +        break;
>> +    case VMFAIL_INVALID:
>> +    default:
>> +        eflags |= X86_EFLAGS_CF;
>> +        break;
>> +    }
>> +
>> +    regs->eflags = eflags;
>> +}
>> +
>> +static int __clear_current_vvmcs(struct vmx_nest_struct *nest) +{
>> +    int rc;
>> +
>> +    if ( nest->svmcs )
>> +        __vmpclear(virt_to_maddr(nest->svmcs));
>> +
>> +#if !CONFIG_VVMCS_MAPPING
>> +    rc = hvm_copy_to_guest_phys(nest->gvmcs_pa, nest->vvmcs,
>> PAGE_SIZE); +    if ( rc != HVMCOPY_okay )
>> +        return X86EMUL_EXCEPTION;
>> +#endif
>> +
>> +    nest->vmcs_valid = 0;
>> +
>> +    return X86EMUL_OKAY;
>> +}
>> +
>> +/*
>> + * VMX instructions handling
>> + */
>> +
>> +int vmx_nest_handle_vmxon(struct cpu_user_regs *regs) +{
>> +    struct vcpu *v = current;
>> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
>> +    struct vmx_inst_decoded decode;
>> +    unsigned long gpa = 0;
>> +    int rc;
>> +
>> +    if ( !is_nested_avail(v->domain) )
>> +        goto invalid_op;
>> +
>> +    rc = vmx_inst_check_privilege(regs);
>> +    if ( rc != X86EMUL_OKAY )
>> +        return rc;
>
> I think you could fold these checks and the goto target into
> decode_vmx_inst(), just to avoid repeating them in every
> vmx_nest_handle_* function.
>
>> +    rc = decode_vmx_inst(regs, &decode);
>> +    if ( rc != X86EMUL_OKAY )
>> +        return rc;
>> +
>> +    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
>> +    rc = hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
>> +    if ( rc != HVMCOPY_okay )
>> +        return X86EMUL_EXCEPTION;
>> +
>> +    nest->guest_vmxon_pa = gpa;
>> +    nest->gvmcs_pa = 0;
>> +    nest->vmcs_valid = 0;
>> +#if !CONFIG_VVMCS_MAPPING
>> +    nest->vvmcs = alloc_xenheap_page();
>> +    if ( !nest->vvmcs )
>> +    {
>> +        gdprintk(XENLOG_ERR, "nest: allocation for virtual vmcs
>> failed\n"); +        vmreturn(regs, VMFAIL_INVALID);
>> +        goto out;
>> +    }
>> +#endif
>> +    nest->svmcs = alloc_xenheap_page();
>> +    if ( !nest->svmcs )
>> +    {
>> +        gdprintk(XENLOG_ERR, "nest: allocation for shadow vmcs
>> failed\n"); +        free_xenheap_page(nest->vvmcs);
>> +        vmreturn(regs, VMFAIL_INVALID);
>> +        goto out;
>> +    }
>> +
>> +    /*
>> +     * `fork' the host vmcs to shadow_vmcs
>> +     * vmcs_lock is not needed since we are on current +     */
>> +    nest->hvmcs = v->arch.hvm_vmx.vmcs;
>> +    __vmpclear(virt_to_maddr(nest->hvmcs));
>> +    memcpy(nest->svmcs, nest->hvmcs, PAGE_SIZE);
>> +    __vmptrld(virt_to_maddr(nest->hvmcs));
>> +    v->arch.hvm_vmx.launched = 0;
>> +
>> +    vmreturn(regs, VMSUCCEED);
>> +
>> +out:
>> +    return X86EMUL_OKAY;
>> +
>> +invalid_op:
>> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
>> +    return X86EMUL_EXCEPTION;
>> +}
>> +
>> +int vmx_nest_handle_vmxoff(struct cpu_user_regs *regs) +{
>> +    struct vcpu *v = current;
>> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; +    int
>> rc; +
>> +    if ( unlikely(!nest->guest_vmxon_pa) )
>> +        goto invalid_op;
>> +
>> +    rc = vmx_inst_check_privilege(regs);
>> +    if ( rc != X86EMUL_OKAY )
>> +        return rc;
>> +
>> +    nest->guest_vmxon_pa = 0;
>> +    __vmpclear(virt_to_maddr(nest->svmcs));
>> +
>> +#if !CONFIG_VVMCS_MAPPING
>> +    free_xenheap_page(nest->vvmcs);
>> +#endif
>> +    free_xenheap_page(nest->svmcs);
>
> These also need to be freed on domain teardown.
>
>> +    vmreturn(regs, VMSUCCEED);
>> +    return X86EMUL_OKAY;
>> +
>> +invalid_op:
>> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
>> +    return X86EMUL_EXCEPTION;
>> +}
>> +
>> +int vmx_nest_handle_vmptrld(struct cpu_user_regs *regs) +{
>> +    struct vcpu *v = current;
>> +    struct vmx_inst_decoded decode;
>> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; +
>> unsigned long gpa = 0; +    int rc;
>> +
>> +    if ( unlikely(!nest->guest_vmxon_pa) )
>> +        goto invalid_op;
>> +
>> +    rc = vmx_inst_check_privilege(regs);
>> +    if ( rc != X86EMUL_OKAY )
>> +        return rc;
>> +
>> +    rc = decode_vmx_inst(regs, &decode);
>> +    if ( rc != X86EMUL_OKAY )
>> +        return rc;
>> +
>> +    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
>> +    rc = hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
>> +    if ( rc != HVMCOPY_okay )
>> +        return X86EMUL_EXCEPTION;
>> +
>> +    if ( gpa == nest->guest_vmxon_pa || gpa & 0xfff ) +    {
>> +        vmreturn(regs, VMFAIL_INVALID);
>> +        goto out;
>> +    }
>> +
>> +    if ( nest->gvmcs_pa != gpa )
>> +    {
>> +        if ( nest->vmcs_valid )
>> +        {
>> +            rc = __clear_current_vvmcs(nest);
>> +            if ( rc != X86EMUL_OKAY )
>> +                return rc;
>> +        }
>> +        nest->gvmcs_pa = gpa;
>> +        ASSERT(nest->vmcs_valid == 0);
>> +    }
>> +
>> +
>> +    if ( !nest->vmcs_valid )
>> +    {
>> +#if CONFIG_VVMCS_MAPPING
>> +        unsigned long mfn;
>> +        p2m_type_t p2mt;
>> +
>> +        mfn = mfn_x(gfn_to_mfn(p2m_get_hostp2m(v->domain),
>> +                               nest->gvmcs_pa >> PAGE_SHIFT,
>> &p2mt)); +        nest->vvmcs = map_domain_page_global(mfn); +#else
>> +        rc = hvm_copy_from_guest_phys(nest->vvmcs, nest->gvmcs_pa,
>> PAGE_SIZE); +        if ( rc != HVMCOPY_okay )
>> +            return X86EMUL_EXCEPTION;
>> +#endif
>> +        nest->vmcs_valid = 1;
>> +    }
>> +
>> +    vmreturn(regs, VMSUCCEED);
>> +
>> +out:
>> +    return X86EMUL_OKAY;
>> +
>> +invalid_op:
>> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
>> +    return X86EMUL_EXCEPTION;
>> +}
>> +
>> +int vmx_nest_handle_vmptrst(struct cpu_user_regs *regs) +{
>> +    struct vcpu *v = current;
>> +    struct vmx_inst_decoded decode;
>> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; +
>> unsigned long gpa = 0; +    int rc;
>> +
>> +    if ( unlikely(!nest->guest_vmxon_pa) )
>> +        goto invalid_op;
>> +
>> +    rc = vmx_inst_check_privilege(regs);
>> +    if ( rc != X86EMUL_OKAY )
>> +        return rc;
>> +
>> +    rc = decode_vmx_inst(regs, &decode);
>> +    if ( rc != X86EMUL_OKAY )
>> +        return rc;
>> +
>> +    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY); +
>> +    gpa = nest->gvmcs_pa;
>> +
>> +    rc = hvm_copy_to_guest_virt(decode.mem, &gpa, decode.len, 0);
>> +    if ( rc != HVMCOPY_okay )
>> +        return X86EMUL_EXCEPTION;
>> +
>> +    vmreturn(regs, VMSUCCEED);
>> +    return X86EMUL_OKAY;
>> +
>> +invalid_op:
>> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
>> +    return X86EMUL_EXCEPTION;
>> +}
>> +
>> +int vmx_nest_handle_vmclear(struct cpu_user_regs *regs) +{
>> +    struct vcpu *v = current;
>> +    struct vmx_inst_decoded decode;
>> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; +
>> unsigned long gpa = 0; +    int rc;
>> +
>> +    if ( unlikely(!nest->guest_vmxon_pa) )
>> +        goto invalid_op;
>> +
>> +    rc = vmx_inst_check_privilege(regs);
>> +    if ( rc != X86EMUL_OKAY )
>> +        return rc;
>> +
>> +    rc = decode_vmx_inst(regs, &decode);
>> +    if ( rc != X86EMUL_OKAY )
>> +        return rc;
>> +
>> +    ASSERT(decode.type == VMX_INST_MEMREG_TYPE_MEMORY);
>> +    rc = hvm_copy_from_guest_virt(&gpa, decode.mem, decode.len, 0);
>
> Is it guaranteed that decode.len is always <= sizeof gpa here, even
> with a malicious guest?

Reusing hvm_virtual_to_linear_addr w/ consideration of last byte may be the best choice :)

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  4:55     ` Dong, Eddie
@ 2010-09-15  6:40       ` Keir Fraser
  2010-09-15  6:49         ` Dong, Eddie
  2010-09-15  7:17         ` Qing He
  0 siblings, 2 replies; 68+ messages in thread
From: Keir Fraser @ 2010-09-15  6:40 UTC (permalink / raw)
  To: Dong, Eddie, Tim Deegan, He, Qing; +Cc: xen-devel

On 15/09/2010 05:55, "Dong, Eddie" <eddie.dong@intel.com> wrote:

>>> +enum x86_segment sreg_to_index[] = {
>>> +    [VMX_SREG_ES] = x86_seg_es,
>>> +    [VMX_SREG_CS] = x86_seg_cs,
>>> +    [VMX_SREG_SS] = x86_seg_ss,
>>> +    [VMX_SREG_DS] = x86_seg_ds,
>>> +    [VMX_SREG_FS] = x86_seg_fs,
>>> +    [VMX_SREG_GS] = x86_seg_gs,
>>> +};
>> 
>> Since you dislike adding new namespaces and translations, I'm sure you
>> can get rid of these. :)  It might even simplify some of the macros
>> below.
> 
> True, some dupcation here. Regarding following definition in x86_emulate.c, we
> can reuse.

AFAICS if you must have your own extra instruction decoder, a few register
translation definitions and arrays is the least of it really. I'd rather
keep x86_emulate clean and separate rather than become intertwined with
another emulator.

What is wrong with simply extending x86_emulate to handle these VMX-related
instructions? We've dealt with emulators provided by Intel guys in the past
and frankly they were full of holes.

 -- Keir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  6:40       ` Keir Fraser
@ 2010-09-15  6:49         ` Dong, Eddie
  2010-09-15  7:31           ` Keir Fraser
  2010-09-15  7:17         ` Qing He
  1 sibling, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15  6:49 UTC (permalink / raw)
  To: Keir Fraser, Tim Deegan, He, Qing; +Cc: xen-devel, Dong, Eddie

Keir Fraser wrote:
> On 15/09/2010 05:55, "Dong, Eddie" <eddie.dong@intel.com> wrote:
> 
>>>> +enum x86_segment sreg_to_index[] = {
>>>> +    [VMX_SREG_ES] = x86_seg_es,
>>>> +    [VMX_SREG_CS] = x86_seg_cs,
>>>> +    [VMX_SREG_SS] = x86_seg_ss,
>>>> +    [VMX_SREG_DS] = x86_seg_ds,
>>>> +    [VMX_SREG_FS] = x86_seg_fs,
>>>> +    [VMX_SREG_GS] = x86_seg_gs,
>>>> +};
>>> 
>>> Since you dislike adding new namespaces and translations, I'm sure
>>> you can get rid of these. :)  It might even simplify some of the
>>> macros below.
>> 
>> True, some dupcation here. Regarding following definition in
>> x86_emulate.c, we can reuse.
> 
> AFAICS if you must have your own extra instruction decoder, a few
> register translation definitions and arrays is the least of it
> really. I'd rather keep x86_emulate clean and separate rather than
> become intertwined with another emulator.
> 
> What is wrong with simply extending x86_emulate to handle these
> VMX-related instructions? We've dealt with emulators provided by
> Intel guys in the past and frankly they were full of holes.
> 
Certainly fine to move those VMX instruction emulation to hvm/emulate.c as if you don't think that is VMX specific :)

Will do.

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  6:40       ` Keir Fraser
  2010-09-15  6:49         ` Dong, Eddie
@ 2010-09-15  7:17         ` Qing He
  2010-09-15  7:38           ` Keir Fraser
  1 sibling, 1 reply; 68+ messages in thread
From: Qing He @ 2010-09-15  7:17 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Tim Deegan, xen-devel, Dong, Eddie

On Wed, 2010-09-15 at 14:40 +0800, Keir Fraser wrote:
> On 15/09/2010 05:55, "Dong, Eddie" <eddie.dong@intel.com> wrote:
> 
> >>> +enum x86_segment sreg_to_index[] = {
> >>> +    [VMX_SREG_ES] = x86_seg_es,
> >>> +    [VMX_SREG_CS] = x86_seg_cs,
> >>> +    [VMX_SREG_SS] = x86_seg_ss,
> >>> +    [VMX_SREG_DS] = x86_seg_ds,
> >>> +    [VMX_SREG_FS] = x86_seg_fs,
> >>> +    [VMX_SREG_GS] = x86_seg_gs,
> >>> +};
> >> 
> >> Since you dislike adding new namespaces and translations, I'm sure you
> >> can get rid of these. :)  It might even simplify some of the macros
> >> below.
> > 
> > True, some dupcation here. Regarding following definition in x86_emulate.c, we
> > can reuse.
> 
> AFAICS if you must have your own extra instruction decoder, a few register
> translation definitions and arrays is the least of it really. I'd rather
> keep x86_emulate clean and separate rather than become intertwined with
> another emulator.
> 
> What is wrong with simply extending x86_emulate to handle these VMX-related
> instructions? We've dealt with emulators provided by Intel guys in the past
> and frankly they were full of holes.

That needs additional callback when handling vmcs and state change,
doesn't it? I'm a little worried that it's too vmx-specific to get
into x86_emulate, and that's why we used a separate decoder in the
first place (I know it's ugly, though).

And if we are to use x86_emulate, is it possible to avoid redecoding the
opcode, because exit reason is already there?

Thanks,
Qing

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  6:49         ` Dong, Eddie
@ 2010-09-15  7:31           ` Keir Fraser
  2010-09-15  8:15             ` Christoph Egger
  0 siblings, 1 reply; 68+ messages in thread
From: Keir Fraser @ 2010-09-15  7:31 UTC (permalink / raw)
  To: Dong, Eddie, Tim Deegan, He, Qing; +Cc: xen-devel

On 15/09/2010 07:49, "Dong, Eddie" <eddie.dong@intel.com> wrote:

>> What is wrong with simply extending x86_emulate to handle these
>> VMX-related instructions? We've dealt with emulators provided by
>> Intel guys in the past and frankly they were full of holes.
>> 
> Certainly fine to move those VMX instruction emulation to hvm/emulate.c as if
> you don't think that is VMX specific :)

It's the right place to put all instruction emulation, if at all possible.
You will then presumably require at least one or two call-back hooks to
caller context, at least to read/write VMCS, and that would be the place to
determine whether these VMX instructions are executable. For example, SVM
and PV emulation contexts would either leave the VMX callback hooks as NULL,
and/or there will be checks for is-nested-VMX-guest in the VMX callback
hooks, injecting #UD otherwise.

The main trick with x86_emulate extensions is determining the correct neat
small set of callback hooks to add, which is somewhat driven by deciding
what should be emulated within x86_emulate and what should be left without
for implementation in the caller's context.

 -- Keir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  7:17         ` Qing He
@ 2010-09-15  7:38           ` Keir Fraser
  2010-09-15  7:56             ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Keir Fraser @ 2010-09-15  7:38 UTC (permalink / raw)
  To: Qing He; +Cc: Tim Deegan, xen-devel, Dong, Eddie

On 15/09/2010 08:17, "Qing He" <qing.he@intel.com> wrote:

>> What is wrong with simply extending x86_emulate to handle these VMX-related
>> instructions? We've dealt with emulators provided by Intel guys in the past
>> and frankly they were full of holes.
> 
> That needs additional callback when handling vmcs and state change,
> doesn't it? I'm a little worried that it's too vmx-specific to get
> into x86_emulate, and that's why we used a separate decoder in the
> first place (I know it's ugly, though).

A few VMX-specific callbacks would be fine. Extra callbacks are cheap. Just
focus on making the callback interface clean and tidy. I'd *much* rather
have VMX-specific callbacks than an extra emulator.

> And if we are to use x86_emulate, is it possible to avoid redecoding the
> opcode, because exit reason is already there?

If vmexit reason fully decodes the instruction for you then I would agree
that skipping x86_emulate could make sense. And then your instruction
emulator would be really simple and fast -- vmx_io_instruction() is a good
example of this. If you still need to parse the instruction to decode ModRM
and the like, then I don't see that the partial decode from vmexit reason
saves you much at all, and you might as well go the whole hog and do full
decode. I don't see much saving from a hacky middle-ground.

 -- Keir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  7:38           ` Keir Fraser
@ 2010-09-15  7:56             ` Dong, Eddie
  2010-09-15  8:15               ` Keir Fraser
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15  7:56 UTC (permalink / raw)
  To: Keir Fraser, He, Qing; +Cc: Tim Deegan, xen-devel, Dong, Eddie

Keir Fraser wrote:
> On 15/09/2010 08:17, "Qing He" <qing.he@intel.com> wrote:
> 
>>> What is wrong with simply extending x86_emulate to handle these
>>> VMX-related instructions? We've dealt with emulators provided by
>>> Intel guys in the past and frankly they were full of holes.
>> 
>> That needs additional callback when handling vmcs and state change,
>> doesn't it? I'm a little worried that it's too vmx-specific to get
>> into x86_emulate, and that's why we used a separate decoder in the
>> first place (I know it's ugly, though).
> 
> A few VMX-specific callbacks would be fine. Extra callbacks are
> cheap. Just focus on making the callback interface clean and tidy.
> I'd *much* rather have VMX-specific callbacks than an extra emulator.
> 
>> And if we are to use x86_emulate, is it possible to avoid redecoding
>> the opcode, because exit reason is already there?
> 
> If vmexit reason fully decodes the instruction for you then I would

Yes, VM exit reason + VMX_INSTRUCTION_INFO includes everything we needs :)
It is mainly because VMX instruction is simple.

> agree that skipping x86_emulate could make sense. And then your
> instruction emulator would be really simple and fast --
> vmx_io_instruction() is a good example of this. If you still need to

Sure, the emulation is very simple and fast, I cab put all privilege check into one function as well.

> parse the instruction to decode ModRM and the like, then I don't see

No need to decode ModRM.

> that the partial decode from vmexit reason saves you much at all, and
> you might as well go the whole hog and do full decode. I don't see
> much saving from a hacky middle-ground. 

So how about we reuse some functions in x86 emulate like this one?

static enum x86_segment
decode_segment(uint8_t modrm_reg)
{
    switch ( modrm_reg )
    {
    case 0: return x86_seg_es;
    case 1: return x86_seg_cs;
    case 2: return x86_seg_ss;
    case 3: return x86_seg_ds;
    case 4: return x86_seg_fs;
    case 5: return x86_seg_gs;
    default: break;
    }
    return decode_segment_failed;
}

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  7:56             ` Dong, Eddie
@ 2010-09-15  8:15               ` Keir Fraser
  2010-09-15  9:26                 ` Tim Deegan
  0 siblings, 1 reply; 68+ messages in thread
From: Keir Fraser @ 2010-09-15  8:15 UTC (permalink / raw)
  To: Dong, Eddie, He, Qing; +Cc: Tim Deegan, xen-devel

On 15/09/2010 08:56, "Dong, Eddie" <eddie.dong@intel.com> wrote:

>> that the partial decode from vmexit reason saves you much at all, and
>> you might as well go the whole hog and do full decode. I don't see
>> much saving from a hacky middle-ground.
> 
> So how about we reuse some functions in x86 emulate like this one?

Ah, well, now I look at your patch 06/16 properly, I think it's clear and
self-contained as it is. Your private enumerations within nest.c simply
serve to document the format of the decoded instruction provided to you via
fields in the VMCS. I wouldn't be inclined to change it at all, unless Tim
really has strong objections about it. It's not like you're defining
namespaces for new abstractions you have conjured from thin air -- they
correspond directly to a hardware-defined decode format. Defining
enumerations on top of that is *good*, imo. I would take 06/16 as it stands.

 -- Keir

> static enum x86_segment
> decode_segment(uint8_t modrm_reg)
> {
>     switch ( modrm_reg )
>     {
>     case 0: return x86_seg_es;
>     case 1: return x86_seg_cs;
>     case 2: return x86_seg_ss;
>     case 3: return x86_seg_ds;
>     case 4: return x86_seg_fs;
>     case 5: return x86_seg_gs;
>     default: break;
>     }
>     return decode_segment_failed;
> }
> 
> Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  7:31           ` Keir Fraser
@ 2010-09-15  8:15             ` Christoph Egger
  2010-09-15  8:23               ` Keir Fraser
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-15  8:15 UTC (permalink / raw)
  To: xen-devel; +Cc: Tim Deegan, Dong, Eddie, Keir Fraser, He, Qing

On Wednesday 15 September 2010 09:31:13 Keir Fraser wrote:
> On 15/09/2010 07:49, "Dong, Eddie" <eddie.dong@intel.com> wrote:
> >> What is wrong with simply extending x86_emulate to handle these
> >> VMX-related instructions? We've dealt with emulators provided by
> >> Intel guys in the past and frankly they were full of holes.
> >
> > Certainly fine to move those VMX instruction emulation to hvm/emulate.c
> > as if you don't think that is VMX specific :)
>
> It's the right place to put all instruction emulation, if at all possible.
> You will then presumably require at least one or two call-back hooks to
> caller context, at least to read/write VMCS, and that would be the place to
> determine whether these VMX instructions are executable. For example, SVM
> and PV emulation contexts would either leave the VMX callback hooks as
> NULL, and/or there will be checks for is-nested-VMX-guest in the VMX
> callback hooks, injecting #UD otherwise.
>
> The main trick with x86_emulate extensions is determining the correct neat
> small set of callback hooks to add, which is somewhat driven by deciding
> what should be emulated within x86_emulate and what should be left without
> for implementation in the caller's context.

There is a case where the host must emulate an instruction of the l2 guest
when the l1 guest doesn't intercept it.

When the vcpu is in guest mode, the fields in struct hvm_vcpu and
guest_cpu_user_regs() represent the l2 guest state in my patch series.

That way the instruction emulator works out-of-the box.
You need to add instructions to the emulator that are missing there.

Christoph


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  8:15             ` Christoph Egger
@ 2010-09-15  8:23               ` Keir Fraser
  2010-09-15  9:08                 ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Keir Fraser @ 2010-09-15  8:23 UTC (permalink / raw)
  To: Christoph Egger, xen-devel; +Cc: Tim Deegan, Dong, Eddie, He, Qing

On 15/09/2010 09:15, "Christoph Egger" <Christoph.Egger@amd.com> wrote:

>> The main trick with x86_emulate extensions is determining the correct neat
>> small set of callback hooks to add, which is somewhat driven by deciding
>> what should be emulated within x86_emulate and what should be left without
>> for implementation in the caller's context.
> 
> There is a case where the host must emulate an instruction of the l2 guest
> when the l1 guest doesn't intercept it.
> 
> When the vcpu is in guest mode, the fields in struct hvm_vcpu and
> guest_cpu_user_regs() represent the l2 guest state in my patch series.
> 
> That way the instruction emulator works out-of-the box.

Well in this specific case, all VMX-related instructions executed by an L2
guest would properly cause vmexit to the L1 guest for emulation there. We
wouldn't want to emulate in Xen.

But yes I can see that emulation of L2 guest instructions is needed in some
other cases. Like instructions performing I/O in areas which L1 thinks it
has given L2 direct unmediated access to, but which Xen is actually
filtering or emulating.

 -- Keir

> You need to add instructions to the emulator that are missing there.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  8:23               ` Keir Fraser
@ 2010-09-15  9:08                 ` Dong, Eddie
  2010-09-15 11:39                   ` Keir Fraser
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15  9:08 UTC (permalink / raw)
  To: Keir Fraser, Christoph Egger, xen-devel; +Cc: Tim Deegan, Dong, Eddie, He, Qing

Keir Fraser wrote:
> On 15/09/2010 09:15, "Christoph Egger" <Christoph.Egger@amd.com>
> wrote: 
> 
>>> The main trick with x86_emulate extensions is determining the
>>> correct neat small set of callback hooks to add, which is somewhat
>>> driven by deciding what should be emulated within x86_emulate and
>>> what should be left without for implementation in the caller's
>>> context. 
>> 
>> There is a case where the host must emulate an instruction of the l2
>> guest when the l1 guest doesn't intercept it.
>> 
>> When the vcpu is in guest mode, the fields in struct hvm_vcpu and
>> guest_cpu_user_regs() represent the l2 guest state in my patch
>> series. 
>> 
>> That way the instruction emulator works out-of-the box.
> 
> Well in this specific case, all VMX-related instructions executed by
> an L2 guest would properly cause vmexit to the L1 guest for emulation
> there. We wouldn't want to emulate in Xen.

Yes, in nested VMX side, we L0 VMM won't emulate L2 VMX instruction.

> 
> But yes I can see that emulation of L2 guest instructions is needed
> in some other cases. Like instructions performing I/O in areas which
> L1 thinks it has given L2 direct unmediated access to, but which Xen
> is actually filtering or emulating.

That may be an issue in plan when we support virtual VT-d for nested I/O performance. But not now :)
I suggest we leave that to future at least for nested VMX side where L0 VMM doesn't directly emulate L0 VMM instruction.

We can see if there are other needs for that kind of case.

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  8:15               ` Keir Fraser
@ 2010-09-15  9:26                 ` Tim Deegan
  2010-09-15  9:56                   ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Tim Deegan @ 2010-09-15  9:26 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Dong, Eddie, He, Qing

At 09:15 +0100 on 15 Sep (1284542116), Keir Fraser wrote:
> On 15/09/2010 08:56, "Dong, Eddie" <eddie.dong@intel.com> wrote:
> 
> >> that the partial decode from vmexit reason saves you much at all, and
> >> you might as well go the whole hog and do full decode. I don't see
> >> much saving from a hacky middle-ground.
> > 
> > So how about we reuse some functions in x86 emulate like this one?
> 
> Ah, well, now I look at your patch 06/16 properly, I think it's clear and
> self-contained as it is. Your private enumerations within nest.c simply
> serve to document the format of the decoded instruction provided to you via
> fields in the VMCS. I wouldn't be inclined to change it at all, unless Tim
> really has strong objections about it.

No, that's OK.

> It's not like you're defining
> namespaces for new abstractions you have conjured from thin air -- they
> correspond directly to a hardware-defined decode format. Defining
> enumerations on top of that is *good*, imo. I would take 06/16 as it stands.

Fair enough, but I'd like the memory leak fixed too (svmcs and vvmcs are
only freed if the N1 guest executes VMXOFF).

Cheers,

Tim.

> > static enum x86_segment
> > decode_segment(uint8_t modrm_reg)
> > {
> >     switch ( modrm_reg )
> >     {
> >     case 0: return x86_seg_es;
> >     case 1: return x86_seg_cs;
> >     case 2: return x86_seg_ss;
> >     case 3: return x86_seg_ds;
> >     case 4: return x86_seg_fs;
> >     case 5: return x86_seg_gs;
> >     default: break;
> >     }
> >     return decode_segment_failed;
> > }
> > 
> > Thx, Eddie
> 
> 

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, XenServer Engineering
Citrix Systems UK Ltd.  (Company #02937203, SL9 0BG)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 16/16] vmx: nest: expose cpuid and CR4.VMXE
  2010-09-08 15:22 ` [PATCH 16/16] vmx: nest: expose cpuid and CR4.VMXE Qing He
@ 2010-09-15  9:43   ` Christoph Egger
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Egger @ 2010-09-15  9:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

On Wednesday 08 September 2010 17:22:24 Qing He wrote:
> expose VMX cpuid and allow guest to enable VMX.
>
> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
>
> ---
>
> diff -r 3f40a1f79cf8 tools/libxc/xc_cpuid_x86.c
> --- a/tools/libxc/xc_cpuid_x86.c	Wed Sep 08 19:47:39 2010 +0800
> +++ b/tools/libxc/xc_cpuid_x86.c	Wed Sep 08 19:49:06 2010 +0800
> @@ -128,8 +128,17 @@
>      const unsigned int *input, unsigned int *regs,
>      int is_pae)
>  {
> +    unsigned long nest;
> +
>      switch ( input[0] )
>      {
> +    case 0x00000001:
> +        /* ECX[5] is availability of VMX */
> +        xc_get_hvm_param(xch, domid, HVM_PARAM_NESTEDHVM, &nest);
> +        if (nest)
> +            regs[2] |= 0x20;
> +        break;
> +

I merged this part into my tools patch this way:

              /* ECX[5] is availability of VMX */
              if (is_nstedhvm)
                  set_bit(X86_FEATURE_VMXE, regs[2]);
              break;


>      case 0x00000004:
>          /*
>           * EAX[31:26] is Maximum Cores Per Package (minus one).
> diff -r 3f40a1f79cf8 xen/include/asm-x86/hvm/hvm.h
> --- a/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 19:47:39 2010 +0800
> +++ b/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 19:49:06 2010 +0800
> @@ -295,7 +295,8 @@
>          X86_CR4_DE  | X86_CR4_PSE | X86_CR4_PAE |       \
>          X86_CR4_MCE | X86_CR4_PGE | X86_CR4_PCE |       \
>          X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT |           \
> -        (cpu_has_xsave ? X86_CR4_OSXSAVE : 0))))
> +        (cpu_has_xsave ? X86_CR4_OSXSAVE : 0)   |       \
> +        X86_CR4_VMXE)))


I changed this to

         (cpu_has_vmx ? X86_CR4_VMXE : 0))))

where cpu_has vmx is in <asm/cpufeature.h> defined as

 #define cpu_has_vmx  boot_cpu_has(X86_FEATURE_VMXE)

Christoph

>  /* These exceptions must always be intercepted. */
>  #define HVM_TRAP_MASK ((1U << TRAP_machine_check) | (1U <<
> TRAP_invalid_op))
>



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 08/16] vmx: nest: vmresume/vmlaunch
  2010-09-08 15:22 ` [PATCH 08/16] vmx: nest: vmresume/vmlaunch Qing He
@ 2010-09-15  9:52   ` Christoph Egger
  2010-09-15 11:30     ` Christoph Egger
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-15  9:52 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

On Wednesday 08 September 2010 17:22:16 Qing He wrote:
> vmresume and vmlaunch instructions and transitional states
>
> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
>
> ---
>
> diff -r e828d55c10bb xen/arch/x86/hvm/vmx/nest.c
> --- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 21:42:10 2010 +0800
> +++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:04:16 2010 +0800
> @@ -633,3 +633,33 @@
>      hvm_inject_exception(TRAP_invalid_op, 0, 0);
>      return X86EMUL_EXCEPTION;
>  }
> +
> +int vmx_nest_handle_vmresume(struct cpu_user_regs *regs)
> +{
> +    struct vcpu *v = current;
> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
> +    int rc;
> +
> +    if ( unlikely(!nest->guest_vmxon_pa) )
> +        goto invalid_op;
> +
> +    rc = vmx_inst_check_privilege(regs);
> +    if ( rc != X86EMUL_OKAY )
> +        return rc;
> +
> +    if ( nest->vmcs_valid == 1 )
> +        nest->vmresume_pending = 1;
> +    else
> +        vmreturn(regs, VMFAIL_INVALID);
> +
> +    return X86EMUL_OKAY;
> +
> +invalid_op:
> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
> +    return X86EMUL_EXCEPTION;
> +}
> +
> +int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs)
> +{
> +    return vmx_nest_handle_vmresume(regs);
> +}
> diff -r e828d55c10bb xen/arch/x86/hvm/vmx/vmx.c
> --- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 21:42:10 2010 +0800
> +++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:04:16 2010 +0800
> @@ -2321,6 +2321,11 @@
>      /* Now enable interrupts so it's safe to take locks. */
>      local_irq_enable();
>
> +    /* XXX: This looks ugly, but we need a mechanism to ensure
> +     * any pending vmresume has really happened
> +     */
> +    v->arch.hvm_vmx.nest.vmresume_in_progress = 0;
> +
>      if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
>          return vmx_failed_vmentry(exit_reason, regs);
>
> @@ -2592,6 +2597,11 @@
>          if ( vmx_nest_handle_vmclear(regs) == X86EMUL_OKAY )
>              __update_guest_eip(inst_len);
>          break;
> +    case EXIT_REASON_VMLAUNCH:
> +        inst_len = __get_instruction_length();
> +        if ( vmx_nest_handle_vmlaunch(regs) == X86EMUL_OKAY )
> +            __update_guest_eip(inst_len);
> +        break;
>      case EXIT_REASON_VMPTRLD:
>          inst_len = __get_instruction_length();
>          if ( vmx_nest_handle_vmptrld(regs) == X86EMUL_OKAY )
> @@ -2607,6 +2617,11 @@
>          if ( vmx_nest_handle_vmread(regs) == X86EMUL_OKAY )
>              __update_guest_eip(inst_len);
>          break;
> +    case EXIT_REASON_VMRESUME:
> +        inst_len = __get_instruction_length();
> +        if ( vmx_nest_handle_vmresume(regs) == X86EMUL_OKAY )
> +            __update_guest_eip(inst_len);
> +        break;
>      case EXIT_REASON_VMWRITE:
>          inst_len = __get_instruction_length();
>          if ( vmx_nest_handle_vmwrite(regs) == X86EMUL_OKAY )
> @@ -2625,8 +2640,6 @@
>
>      case EXIT_REASON_MWAIT_INSTRUCTION:
>      case EXIT_REASON_MONITOR_INSTRUCTION:
> -    case EXIT_REASON_VMLAUNCH:
> -    case EXIT_REASON_VMRESUME:
>          vmx_inject_hw_exception(TRAP_invalid_op,
> HVM_DELIVER_NO_ERROR_CODE); break;
>
> diff -r e828d55c10bb xen/include/asm-x86/hvm/vmx/nest.h
> --- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:42:10 2010 +0800
> +++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:04:16 2010 +0800
> @@ -40,6 +40,20 @@
>      void                *vvmcs;
>      struct vmcs_struct  *svmcs;
>      int                  vmcs_valid;
> +
> +    /*
> +     * vmexit_pending and vmresume_pending is to mark pending
> +     * switches, they are cleared when physical vmcs is changed.
> +     */
> +    int                  vmexit_pending;
> +    int                  vmresume_pending;

This is functional equal to the vmentry flag in struct nestedhvm.

> +    /*
> +     * upon L1->L2, there is a window between context switch and
> +     * the physical vmentry of the shadow vmcs, protect against it
> +     * with vmresume_in_progress
> +     */
> +    int                  vmresume_in_progress;
>  };

What is the window you describe in the comment ? And which problem
does it arise?

You use this to check if you can inject interrupts into the l2 guest.
To merge the interrupt code as Tim said in
http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00747.html
you can probably use the nh_gif field for this.


Christoph


>  int vmx_nest_handle_vmxon(struct cpu_user_regs *regs);
> @@ -52,4 +66,7 @@
>  int vmx_nest_handle_vmread(struct cpu_user_regs *regs);
>  int vmx_nest_handle_vmwrite(struct cpu_user_regs *regs);
>
> +int vmx_nest_handle_vmresume(struct cpu_user_regs *regs);
> +int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs);
> +
>  #endif /* __ASM_X86_HVM_NEST_H__ */
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  9:26                 ` Tim Deegan
@ 2010-09-15  9:56                   ` Dong, Eddie
  2010-09-15 11:46                     ` Keir Fraser
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15  9:56 UTC (permalink / raw)
  To: Tim Deegan, Keir Fraser; +Cc: xen-devel, Dong, Eddie, He, Qing


>> It's not like you're defining
>> namespaces for new abstractions you have conjured from thin air --
>> they correspond directly to a hardware-defined decode format.
>> Defining enumerations on top of that is *good*, imo. I would take
>> 06/16 as it stands. 
> 
> Fair enough, but I'd like the memory leak fixed too (svmcs and vvmcs
> are only freed if the N1 guest executes VMXOFF).
> 
Sure. Fixed it locally at vmx_destroy_vmcs.

BTW, how do you like CONFIG_VVMCS_MAPPING ? I feel a little bit more complicated.

And how about rename vvmcs to vmcs12 (VMCS used by L1 VMM for L2 guest), and
rename svmcs as vmcs02 (VMCS used by L0 VMM for L2 guest).
Of course hvmcs becomes vmcs01 then, (VMCS used by L0 VMM for L1 guest).

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 15/16] vmx: nest: capability reporting MSRs
  2010-09-08 15:22 ` [PATCH 15/16] vmx: nest: capability reporting MSRs Qing He
  2010-09-13 12:45   ` Tim Deegan
@ 2010-09-15 10:05   ` Christoph Egger
  2010-09-15 14:28     ` Dong, Eddie
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-15 10:05 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

On Wednesday 08 September 2010 17:22:23 Qing He wrote:
> handles VMX capability reporting MSRs.
> Some features are masked so L1 would see a rather
> simple configuration
>
> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>


Are there any vmx capability features that are read out via cpuid ?
If yes, then that code belongs into the tools patch.

In SVM the nestedhvm_vcpu_features hook is empty and for MSRs
there are already two msr hooks namely msr_read_intercept and
msr_write_intercept. I assume the functions below all called from there
directly or indirectly.

That renders the nestedhvm_vcpu_features hook useless and will remove it.

Christoph

> ---
>
> diff -r 694dcf6c3f06 xen/arch/x86/hvm/vmx/nest.c
> --- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 19:47:14 2010 +0800
> +++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 19:47:39 2010 +0800
> @@ -1352,3 +1352,91 @@
>
>      return bypass_l0;
>  }
> +
> +/*
> + * Capability reporting
> + */
> +int vmx_nest_msr_read_intercept(unsigned int msr, u64 *msr_content)
> +{
> +    u32 eax, edx;
> +    u64 data = 0;
> +    int r = 1;
> +    u32 mask = 0;
> +
> +    if ( !is_nested_avail(current->domain) )
> +        return 0;
> +
> +    switch (msr) {
> +    case MSR_IA32_VMX_BASIC:
> +        rdmsr(msr, eax, edx);
> +        data = edx;
> +        data = (data & ~0x1fff) | 0x1000;     /* request 4KB for guest
> VMCS */ +        data &= ~(1 << 23);                   /* disable
> TRUE_xxx_CTLS */ +        data = (data << 32) | VVMCS_REVISION; /* VVMCS
> revision */ +        break;
> +    case MSR_IA32_VMX_PINBASED_CTLS:
> +#define REMOVED_PIN_CONTROL_CAP (PIN_BASED_PREEMPT_TIMER)
> +        rdmsr(msr, eax, edx);
> +        data = edx;
> +        data = (data << 32) | eax;
> +        break;
> +    case MSR_IA32_VMX_PROCBASED_CTLS:
> +        rdmsr(msr, eax, edx);
> +#define REMOVED_EXEC_CONTROL_CAP (CPU_BASED_TPR_SHADOW \
> +            | CPU_BASED_ACTIVATE_MSR_BITMAP            \
> +            | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)
> +        data = edx & ~REMOVED_EXEC_CONTROL_CAP;
> +        data = (data << 32) | eax;
> +        break;
> +    case MSR_IA32_VMX_EXIT_CTLS:
> +        rdmsr(msr, eax, edx);
> +#define REMOVED_EXIT_CONTROL_CAP (VM_EXIT_SAVE_GUEST_PAT \
> +            | VM_EXIT_LOAD_HOST_PAT                      \
> +            | VM_EXIT_SAVE_GUEST_EFER                    \
> +            | VM_EXIT_LOAD_HOST_EFER                     \
> +            | VM_EXIT_SAVE_PREEMPT_TIMER)
> +        data = edx & ~REMOVED_EXIT_CONTROL_CAP;
> +        data = (data << 32) | eax;
> +        break;
> +    case MSR_IA32_VMX_ENTRY_CTLS:
> +        rdmsr(msr, eax, edx);
> +#define REMOVED_ENTRY_CONTROL_CAP (VM_ENTRY_LOAD_GUEST_PAT \
> +            | VM_ENTRY_LOAD_GUEST_EFER)
> +        data = edx & ~REMOVED_ENTRY_CONTROL_CAP;
> +        data = (data << 32) | eax;
> +        break;
> +    case MSR_IA32_VMX_PROCBASED_CTLS2:
> +        mask = 0;
> +
> +        rdmsr(msr, eax, edx);
> +        data = edx & mask;
> +        data = (data << 32) | eax;
> +        break;
> +
> +    /* pass through MSRs */
> +    case IA32_FEATURE_CONTROL_MSR:
> +    case MSR_IA32_VMX_MISC:
> +    case MSR_IA32_VMX_CR0_FIXED0:
> +    case MSR_IA32_VMX_CR0_FIXED1:
> +    case MSR_IA32_VMX_CR4_FIXED0:
> +    case MSR_IA32_VMX_CR4_FIXED1:
> +    case MSR_IA32_VMX_VMCS_ENUM:
> +        rdmsr(msr, eax, edx);
> +        data = edx;
> +        data = (data << 32) | eax;
> +        break;
> +
> +    default:
> +        r = 0;
> +        break;
> +    }
> +
> +    *msr_content = data;
> +    return r;
> +}
> +
> +int vmx_nest_msr_write_intercept(unsigned int msr, u64 msr_content)
> +{
> +    /* silently ignore for now */
> +    return 1;
> +}
> diff -r 694dcf6c3f06 xen/arch/x86/hvm/vmx/vmx.c
> --- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 19:47:14 2010 +0800
> +++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 19:47:39 2010 +0800
> @@ -1877,8 +1877,11 @@
>          *msr_content |= (u64)__vmread(GUEST_IA32_DEBUGCTL_HIGH) << 32;
>  #endif
>          break;
> -    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_PROCBASED_CTLS2:
> -        goto gp_fault;
> +    case IA32_FEATURE_CONTROL_MSR:
> +    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_TRUE_ENTRY_CTLS:
> +        if ( !vmx_nest_msr_read_intercept(msr, msr_content) )
> +            goto gp_fault;
> +        break;
>      case MSR_IA32_MISC_ENABLE:
>          rdmsrl(MSR_IA32_MISC_ENABLE, *msr_content);
>          /* Debug Trace Store is not supported. */
> @@ -2043,8 +2046,11 @@
>
>          break;
>      }
> -    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_PROCBASED_CTLS2:
> -        goto gp_fault;
> +    case IA32_FEATURE_CONTROL_MSR:
> +    case MSR_IA32_VMX_BASIC...MSR_IA32_VMX_TRUE_ENTRY_CTLS:
> +        if ( !vmx_nest_msr_write_intercept(msr, msr_content) )
> +            goto gp_fault;
> +        break;
>      default:
>          if ( vpmu_do_wrmsr(msr, msr_content) )
>              return X86EMUL_OKAY;
> diff -r 694dcf6c3f06 xen/include/asm-x86/hvm/vmx/nest.h
> --- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 19:47:14 2010 +0800
> +++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 19:47:39 2010 +0800
> @@ -76,4 +76,9 @@
>  int vmx_nest_l2_vmexit_handler(struct cpu_user_regs *regs,
>                                 unsigned int exit_reason);
>
> +int vmx_nest_msr_read_intercept(unsigned int msr,
> +                                u64 *msr_content);
> +int vmx_nest_msr_write_intercept(unsigned int msr,
> +                                 u64 msr_content);
> +
>  #endif /* __ASM_X86_HVM_NEST_H__ */
> diff -r 694dcf6c3f06 xen/include/asm-x86/hvm/vmx/vmcs.h
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 19:47:14 2010 +0800
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 19:47:39 2010 +0800
> @@ -161,18 +161,23 @@
>  #define PIN_BASED_EXT_INTR_MASK         0x00000001
>  #define PIN_BASED_NMI_EXITING           0x00000008
>  #define PIN_BASED_VIRTUAL_NMIS          0x00000020
> +#define PIN_BASED_PREEMPT_TIMER         0x00000040
>  extern u32 vmx_pin_based_exec_control;
>
>  #define VM_EXIT_IA32E_MODE              0x00000200
>  #define VM_EXIT_ACK_INTR_ON_EXIT        0x00008000
>  #define VM_EXIT_SAVE_GUEST_PAT          0x00040000
>  #define VM_EXIT_LOAD_HOST_PAT           0x00080000
> +#define VM_EXIT_SAVE_GUEST_EFER         0x00100000
> +#define VM_EXIT_LOAD_HOST_EFER          0x00200000
> +#define VM_EXIT_SAVE_PREEMPT_TIMER      0x00400000
>  extern u32 vmx_vmexit_control;
>
>  #define VM_ENTRY_IA32E_MODE             0x00000200
>  #define VM_ENTRY_SMM                    0x00000400
>  #define VM_ENTRY_DEACT_DUAL_MONITOR     0x00000800
>  #define VM_ENTRY_LOAD_GUEST_PAT         0x00004000
> +#define VM_ENTRY_LOAD_GUEST_EFER        0x00008000
>  extern u32 vmx_vmentry_control;
>
>  #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x00000001
> diff -r 694dcf6c3f06 xen/include/asm-x86/msr-index.h
> --- a/xen/include/asm-x86/msr-index.h	Wed Sep 08 19:47:14 2010 +0800
> +++ b/xen/include/asm-x86/msr-index.h	Wed Sep 08 19:47:39 2010 +0800
> @@ -172,6 +172,7 @@
>  #define MSR_IA32_VMX_CR0_FIXED1                 0x487
>  #define MSR_IA32_VMX_CR4_FIXED0                 0x488
>  #define MSR_IA32_VMX_CR4_FIXED1                 0x489
> +#define MSR_IA32_VMX_VMCS_ENUM                  0x48a
>  #define MSR_IA32_VMX_PROCBASED_CTLS2            0x48b
>  #define MSR_IA32_VMX_EPT_VPID_CAP               0x48c
>  #define MSR_IA32_VMX_TRUE_PINBASED_CTLS         0x48d
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 04/16] vmx: nest: nested control structure
  2010-09-08 15:22 ` [PATCH 04/16] vmx: nest: nested control structure Qing He
  2010-09-09  6:13   ` Dong, Eddie
@ 2010-09-15 11:27   ` Christoph Egger
  2010-09-15 13:06     ` Dong, Eddie
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-15 11:27 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

On Wednesday 08 September 2010 17:22:12 Qing He wrote:
> v->arch.hvm_vmx.nest as control structure
>
> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
>
> ---
> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/nest.h
> --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> +++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:03:41 2010 +0800
> @@ -0,0 +1,45 @@
> +/*
> + * nest.h: nested virtualization for VMX.
> + *
> + * Copyright (c) 2010, Intel Corporation.
> + * Author: Qing He <qing.he@intel.com>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> for + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> with + * this program; if not, write to the Free Software Foundation, Inc.,
> 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + *
> + */
> +#ifndef __ASM_X86_HVM_NEST_H__
> +#define __ASM_X86_HVM_NEST_H__
> +
> +struct vmcs_struct;
> +
> +struct vmx_nest_struct {

Is it ok to name it 'struct nestedvmx' ?

> +    paddr_t              guest_vmxon_pa;
> +
> +    /* Saved host vmcs for vcpu itself */
> +    struct vmcs_struct  *hvmcs;
> +
> +    /*
> +     * Guest's `current vmcs' of vcpu
> +     *  - gvmcs_pa: guest VMCS region physical address
> +     *  - vvmcs:    (guest) virtual vmcs
> +     *  - svmcs:    effective vmcs for the guest of this vcpu
> +     *  - valid:    launch state: invalid on clear, valid on ld
> +     */
> +    paddr_t              gvmcs_pa;
> +    void                *vvmcs;
> +    struct vmcs_struct  *svmcs;
> +    int                  vmcs_valid;
> +};
> +
> +#endif /* __ASM_X86_HVM_NEST_H__ */
> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/vmcs.h
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:00:00 2010 +0800
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:03:41 2010 +0800
> @@ -22,6 +22,7 @@
>  #include <asm/config.h>
>  #include <asm/hvm/io.h>
>  #include <asm/hvm/vpmu.h>
> +#include <asm/hvm/vmx/nest.h>
>
>  extern void vmcs_dump_vcpu(struct vcpu *v);
>  extern void setup_vmcs_dump(void);
> @@ -99,6 +100,9 @@
>      u32                  secondary_exec_control;
>      u32                  exception_bitmap;
>
> +    /* nested virtualization */
> +    struct vmx_nest_struct nest;
> +
>  #ifdef __x86_64__
>      struct vmx_msr_state msr_state;
>      unsigned long        shadow_gs;


I think, the structure should be allocated in the nestedhvm_vcpu_initialise()
function hook and assigned to the nh_arch pointer in struct nestedhvm.

Christoph

-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 08/16] vmx: nest: vmresume/vmlaunch
  2010-09-15  9:52   ` Christoph Egger
@ 2010-09-15 11:30     ` Christoph Egger
  2010-09-20  5:19       ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-15 11:30 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

On Wednesday 15 September 2010 11:52:26 Christoph Egger wrote:
> On Wednesday 08 September 2010 17:22:16 Qing He wrote:
> > vmresume and vmlaunch instructions and transitional states
> >
> > Signed-off-by: Qing He <qing.he@intel.com>
> > Signed-off-by: Eddie Dong <eddie.dong@intel.com>
> >
> > ---
> >
> > diff -r e828d55c10bb xen/arch/x86/hvm/vmx/nest.c
> > --- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 21:42:10 2010 +0800
> > +++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:04:16 2010 +0800
> > @@ -633,3 +633,33 @@
> >      hvm_inject_exception(TRAP_invalid_op, 0, 0);
> >      return X86EMUL_EXCEPTION;
> >  }
> > +
> > +int vmx_nest_handle_vmresume(struct cpu_user_regs *regs)
> > +{
> > +    struct vcpu *v = current;
> > +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest;
> > +    int rc;
> > +
> > +    if ( unlikely(!nest->guest_vmxon_pa) )
> > +        goto invalid_op;
> > +
> > +    rc = vmx_inst_check_privilege(regs);
> > +    if ( rc != X86EMUL_OKAY )
> > +        return rc;
> > +
> > +    if ( nest->vmcs_valid == 1 )
> > +        nest->vmresume_pending = 1;
> > +    else
> > +        vmreturn(regs, VMFAIL_INVALID);
> > +
> > +    return X86EMUL_OKAY;
> > +
> > +invalid_op:
> > +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
> > +    return X86EMUL_EXCEPTION;
> > +}
> > +
> > +int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs)
> > +{
> > +    return vmx_nest_handle_vmresume(regs);
> > +}
> > diff -r e828d55c10bb xen/arch/x86/hvm/vmx/vmx.c
> > --- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 21:42:10 2010 +0800
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:04:16 2010 +0800
> > @@ -2321,6 +2321,11 @@
> >      /* Now enable interrupts so it's safe to take locks. */
> >      local_irq_enable();
> >
> > +    /* XXX: This looks ugly, but we need a mechanism to ensure
> > +     * any pending vmresume has really happened
> > +     */
> > +    v->arch.hvm_vmx.nest.vmresume_in_progress = 0;
> > +
> >      if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
> >          return vmx_failed_vmentry(exit_reason, regs);
> >
> > @@ -2592,6 +2597,11 @@
> >          if ( vmx_nest_handle_vmclear(regs) == X86EMUL_OKAY )
> >              __update_guest_eip(inst_len);
> >          break;
> > +    case EXIT_REASON_VMLAUNCH:
> > +        inst_len = __get_instruction_length();
> > +        if ( vmx_nest_handle_vmlaunch(regs) == X86EMUL_OKAY )
> > +            __update_guest_eip(inst_len);
> > +        break;
> >      case EXIT_REASON_VMPTRLD:
> >          inst_len = __get_instruction_length();
> >          if ( vmx_nest_handle_vmptrld(regs) == X86EMUL_OKAY )
> > @@ -2607,6 +2617,11 @@
> >          if ( vmx_nest_handle_vmread(regs) == X86EMUL_OKAY )
> >              __update_guest_eip(inst_len);
> >          break;
> > +    case EXIT_REASON_VMRESUME:
> > +        inst_len = __get_instruction_length();
> > +        if ( vmx_nest_handle_vmresume(regs) == X86EMUL_OKAY )
> > +            __update_guest_eip(inst_len);
> > +        break;
> >      case EXIT_REASON_VMWRITE:
> >          inst_len = __get_instruction_length();
> >          if ( vmx_nest_handle_vmwrite(regs) == X86EMUL_OKAY )
> > @@ -2625,8 +2640,6 @@
> >
> >      case EXIT_REASON_MWAIT_INSTRUCTION:
> >      case EXIT_REASON_MONITOR_INSTRUCTION:
> > -    case EXIT_REASON_VMLAUNCH:
> > -    case EXIT_REASON_VMRESUME:
> >          vmx_inject_hw_exception(TRAP_invalid_op,
> > HVM_DELIVER_NO_ERROR_CODE); break;
> >
> > diff -r e828d55c10bb xen/include/asm-x86/hvm/vmx/nest.h
> > --- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:42:10 2010 +0800
> > +++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:04:16 2010 +0800
> > @@ -40,6 +40,20 @@
> >      void                *vvmcs;
> >      struct vmcs_struct  *svmcs;
> >      int                  vmcs_valid;
> > +
> > +    /*
> > +     * vmexit_pending and vmresume_pending is to mark pending
> > +     * switches, they are cleared when physical vmcs is changed.
> > +     */
> > +    int                  vmexit_pending;
> > +    int                  vmresume_pending;
>
> This is functional equal to the vmentry flag in struct nestedhvm.

Is it possible to have both vmexit and vmresume pending at the same time ?

Christoph

>
> > +    /*
> > +     * upon L1->L2, there is a window between context switch and
> > +     * the physical vmentry of the shadow vmcs, protect against it
> > +     * with vmresume_in_progress
> > +     */
> > +    int                  vmresume_in_progress;
> >  };
>
> What is the window you describe in the comment ? And which problem
> does it arise?
>
> You use this to check if you can inject interrupts into the l2 guest.
> To merge the interrupt code as Tim said in
> http://lists.xensource.com/archives/html/xen-devel/2010-09/msg00747.html
> you can probably use the nh_gif field for this.
>
>
> Christoph
>
> >  int vmx_nest_handle_vmxon(struct cpu_user_regs *regs);
> > @@ -52,4 +66,7 @@
> >  int vmx_nest_handle_vmread(struct cpu_user_regs *regs);
> >  int vmx_nest_handle_vmwrite(struct cpu_user_regs *regs);
> >
> > +int vmx_nest_handle_vmresume(struct cpu_user_regs *regs);
> > +int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs);
> > +
> >  #endif /* __ASM_X86_HVM_NEST_H__ */
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  9:08                 ` Dong, Eddie
@ 2010-09-15 11:39                   ` Keir Fraser
  2010-09-15 12:36                     ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Keir Fraser @ 2010-09-15 11:39 UTC (permalink / raw)
  To: Dong, Eddie, Christoph Egger, xen-devel; +Cc: Tim Deegan, He, Qing

On 15/09/2010 10:08, "Dong, Eddie" <eddie.dong@intel.com> wrote:

>> But yes I can see that emulation of L2 guest instructions is needed
>> in some other cases. Like instructions performing I/O in areas which
>> L1 thinks it has given L2 direct unmediated access to, but which Xen
>> is actually filtering or emulating.
> 
> That may be an issue in plan when we support virtual VT-d for nested I/O
> performance. But not now :)

Actually it is an issue now. This has nothing to do with VT-d (ie. IOMMU,
irq remapping, etc) but with basic core VMX functionality -- per I/O port
direct execute versus vmexit; per virtual-address page direct access versus
#PF vmexit; per physical-frame direct access versus nexted-paging vmexit. In
any of these cases the L1 may think it has given direct unfettered access to
the L2, but L0 (Xen) is actually blocking it. In this case any resulting
required instruction emulations have to be performed by Xen on behalf of the
L2, without L1's help or knowledge. Consider that even in Xen we give all
HVM guests direct access to thinks like port 0x80 for perf reasons. Maybe a
VMM running in L1 would give direct access to more even than that -- in such
cases Xen must be able to emulate those 'direct' accesses.

Now this shouldn't be hard to arrange anyhow. When you vmexit to Xen from
running L2 guest, the saved general-purpose CPU state will be the L2's
state, and that is what you would want x86_emulate to see. But it does
require some thought, and it is merely not an extension to be dealt with
later. It is core VMX stuff and hence core nested VMX stuff.

 -- Keir

> I suggest we leave that to future at least for nested VMX side where L0 VMM
> doesn't directly emulate L0 VMM instruction.
> 
> We can see if there are other needs for that kind of case.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 03/16] vmx: nest: nested availability and status flags
  2010-09-08 15:22 ` [PATCH 03/16] vmx: nest: nested availability and status flags Qing He
@ 2010-09-15 11:43   ` Christoph Egger
  2010-09-15 14:18     ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-15 11:43 UTC (permalink / raw)
  To: xen-devel; +Cc: Qing He

On Wednesday 08 September 2010 17:22:11 Qing He wrote:
> These are the vendor neutral availability and status flags of nested
> virtualization.
>
> The availability hvm parameter can be used to disable all reporting
> and functions of nested, improving guest security in certain circumstances.
>
> The per vcpu flag in_nesting is used to indicate fundamental status:
> the current mode.
>
> Signed-off-by: Qing He <qing.he@intel.com>
> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
>
> ---
> diff -r 11c98ab76326 xen/include/asm-x86/hvm/hvm.h
> --- a/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 20:35:38 2010 +0800
> +++ b/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 20:36:19 2010 +0800
> @@ -250,6 +250,10 @@
>  #define is_viridian_domain(_d)                                            
> \ (is_hvm_domain(_d) && ((_d)->arch.hvm_domain.params[HVM_PARAM_VIRIDIAN]))
>
> +#define is_nested_avail(_d)                                               
> \ + (is_hvm_domain(_d) &&
> ((_d)->arch.hvm_domain.params[HVM_PARAM_NESTEDHVM])) +
> +

That is functional equal to nestedhvm_enabled() in my patch series.
The is_hvm_domain() check is not necessary. The tools patch check
that nestedhvm is for hvm guests only.

>  void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx,
>                                     unsigned int *ecx, unsigned int *edx);
>  void hvm_migrate_timers(struct vcpu *v);
> diff -r 11c98ab76326 xen/include/asm-x86/hvm/vcpu.h
> --- a/xen/include/asm-x86/hvm/vcpu.h	Wed Sep 08 20:35:38 2010 +0800
> +++ b/xen/include/asm-x86/hvm/vcpu.h	Wed Sep 08 20:36:19 2010 +0800
> @@ -71,6 +71,8 @@
>      bool_t              debug_state_latch;
>      bool_t              single_step;
>
> +    bool_t              in_nesting;

This is functional equal to nestedhvm_vcpu_in_guestmode() in my patch series.

> +
>      u64                 asid_generation;
>      u32                 asid;
>
> diff -r 11c98ab76326 xen/include/public/hvm/params.h
> --- a/xen/include/public/hvm/params.h	Wed Sep 08 20:35:38 2010 +0800
> +++ b/xen/include/public/hvm/params.h	Wed Sep 08 20:36:19 2010 +0800
> @@ -113,6 +113,9 @@
>  #define HVM_PARAM_CONSOLE_PFN    17
>  #define HVM_PARAM_CONSOLE_EVTCHN 18
>
> -#define HVM_NR_PARAMS          19
> +/* Boolean: Enable nested virtualization (hvm only) */
> +#define HVM_PARAM_NESTEDHVM    19
> +
> +#define HVM_NR_PARAMS          20
>
>  #endif /* __XEN_PUBLIC_HVM_PARAMS_H__ */

I already have this part in my tools patch.

Christoph


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15  9:56                   ` Dong, Eddie
@ 2010-09-15 11:46                     ` Keir Fraser
  0 siblings, 0 replies; 68+ messages in thread
From: Keir Fraser @ 2010-09-15 11:46 UTC (permalink / raw)
  To: Dong, Eddie, Tim Deegan; +Cc: xen-devel, He, Qing

On 15/09/2010 10:56, "Dong, Eddie" <eddie.dong@intel.com> wrote:

>> Fair enough, but I'd like the memory leak fixed too (svmcs and vvmcs
>> are only freed if the N1 guest executes VMXOFF).
>> 
> Sure. Fixed it locally at vmx_destroy_vmcs.
> 
> BTW, how do you like CONFIG_VVMCS_MAPPING ? I feel a little bit more
> complicated.

Ah yes, presumably you will be picking one or the other and getting rid of
ifdefs in a future spin of this patchset? I don't personally care whether
you map or copy, though the former should be faster I guess? Anyway I think
your logic for mapping is too short and simplistic -- look at
hvm_map_entry() to see how it uses gfn_to_mfn_unshare (necessary if you will
be modifying the page) and handles the various return codes from that. You
could even factor out a common helper routine from hvm_map_entry() that you
could then use rather than open-coding very similar logic.

> And how about rename vvmcs to vmcs12 (VMCS used by L1 VMM for L2 guest), and
> rename svmcs as vmcs02 (VMCS used by L0 VMM for L2 guest).
> Of course hvmcs becomes vmcs01 then, (VMCS used by L0 VMM for L1 guest).

No opinion myself. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15 11:39                   ` Keir Fraser
@ 2010-09-15 12:36                     ` Dong, Eddie
  2010-09-15 13:12                       ` Keir Fraser
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15 12:36 UTC (permalink / raw)
  To: Keir Fraser, Christoph Egger, xen-devel; +Cc: Tim Deegan, Dong, Eddie, He, Qing

Keir Fraser wrote:
> On 15/09/2010 10:08, "Dong, Eddie" <eddie.dong@intel.com> wrote:
> 
>>> But yes I can see that emulation of L2 guest instructions is needed
>>> in some other cases. Like instructions performing I/O in areas which
>>> L1 thinks it has given L2 direct unmediated access to, but which Xen
>>> is actually filtering or emulating.
>> 
>> That may be an issue in plan when we support virtual VT-d for nested
>> I/O performance. But not now :)
> 
> Actually it is an issue now. This has nothing to do with VT-d (ie.
> IOMMU, irq remapping, etc) but with basic core VMX functionality --
> per I/O port direct execute versus vmexit; per virtual-address page

I see, for the I/O port, right now we are letting L1 handle it though it doesn't expect to :(
How about to remove the capability of CPU_BASED_ACTIVATE_IO_BITMAP in L1 VMM for now to focus on framework?


> direct access versus #PF vmexit; per physical-frame direct access
> versus nexted-paging vmexit. In any of these cases the L1 may think

Didn't quit catch. The memory direct access is always guarded by L0 shadow or nested EPT/NPT. Missing something?

> it has given direct unfettered access to the L2, but L0 (Xen) is
> actually blocking it. In this case any resulting required instruction
> emulations have to be performed by Xen on behalf of the L2, without
> L1's help or knowledge. Consider that even in Xen we give all HVM
> guests direct access to thinks like port 0x80 for perf reasons. Maybe
> a VMM running in L1 would give direct access to more even than that
> -- in such cases Xen must be able to emulate those 'direct' accesses. 

Agree!

> 
> Now this shouldn't be hard to arrange anyhow. When you vmexit to Xen
> from running L2 guest, the saved general-purpose CPU state will be
> the L2's state, and that is what you would want x86_emulate to see.
> But it does require some thought, and it is merely not an extension
> to be dealt with later. It is core VMX stuff and hence core nested
> VMX stuff. 

Yes. For I/O issue, emulating by L0 should be fine.

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 04/16] vmx: nest: nested control structure
  2010-09-15 11:27   ` Christoph Egger
@ 2010-09-15 13:06     ` Dong, Eddie
  2010-09-15 13:17       ` Christoph Egger
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15 13:06 UTC (permalink / raw)
  To: Christoph Egger, xen-devel; +Cc: Dong, Eddie, He, Qing

Christoph Egger wrote:
> On Wednesday 08 September 2010 17:22:12 Qing He wrote:
>> v->arch.hvm_vmx.nest as control structure
>> 
>> Signed-off-by: Qing He <qing.he@intel.com>
>> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
>> 
>> ---
>> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/nest.h
>> --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
>> +++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:03:41 2010
>> +0800 @@ -0,0 +1,45 @@ +/*
>> + * nest.h: nested virtualization for VMX.
>> + *
>> + * Copyright (c) 2010, Intel Corporation.
>> + * Author: Qing He <qing.he@intel.com>
>> + *
>> + * This program is free software; you can redistribute it and/or
>> modify it + * under the terms and conditions of the GNU General
>> Public License, + * version 2, as published by the Free Software
>> Foundation. + * + * This program is distributed in the hope it will
>> be useful, but WITHOUT + * ANY WARRANTY; without even the implied
>> warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE.
>> See the GNU General Public License for + * more details. + *
>> + * You should have received a copy of the GNU General Public
>> License along with + * this program; if not, write to the Free
>> Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston,
>> MA 02111-1307 USA. + * + */
>> +#ifndef __ASM_X86_HVM_NEST_H__
>> +#define __ASM_X86_HVM_NEST_H__
>> +
>> +struct vmcs_struct;
>> +
>> +struct vmx_nest_struct {
> 
> Is it ok to name it 'struct nestedvmx' ?

Fine, renamed to nested_vmx.

> 
>> +    paddr_t              guest_vmxon_pa;
>> +
>> +    /* Saved host vmcs for vcpu itself */
>> +    struct vmcs_struct  *hvmcs;
>> +
>> +    /*
>> +     * Guest's `current vmcs' of vcpu
>> +     *  - gvmcs_pa: guest VMCS region physical address
>> +     *  - vvmcs:    (guest) virtual vmcs
>> +     *  - svmcs:    effective vmcs for the guest of this vcpu
>> +     *  - valid:    launch state: invalid on clear, valid on ld +  
>> */ +    paddr_t              gvmcs_pa;
>> +    void                *vvmcs;
>> +    struct vmcs_struct  *svmcs;
>> +    int                  vmcs_valid;
>> +};
>> +
>> +#endif /* __ASM_X86_HVM_NEST_H__ */
>> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/vmcs.h
>> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:00:00 2010
>> +0800 +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:03:41
>>  2010 +0800 @@ -22,6 +22,7 @@ #include <asm/config.h>
>>  #include <asm/hvm/io.h>
>>  #include <asm/hvm/vpmu.h>
>> +#include <asm/hvm/vmx/nest.h>
>> 
>>  extern void vmcs_dump_vcpu(struct vcpu *v);
>>  extern void setup_vmcs_dump(void);
>> @@ -99,6 +100,9 @@
>>      u32                  secondary_exec_control;
>>      u32                  exception_bitmap;
>> 
>> +    /* nested virtualization */
>> +    struct vmx_nest_struct nest;
>> +
>>  #ifdef __x86_64__
>>      struct vmx_msr_state msr_state;
>>      unsigned long        shadow_gs;
> 
> 
> I think, the structure should be allocated in the
> nestedhvm_vcpu_initialise() function hook and assigned to the nh_arch
> pointer in struct nestedhvm. 

Well, the structure itself is pretty small, so dynamic allocation is really not a good idea.
Instead, the internal field such as vvmcs/svmcs are big and thus we use a pointer, but they are allocated on demand.
This follows the style of arch_vmx_struct in vcpu data structure.

I am fine with your nestedhvm_vcpu_initialise "design", but VMX doesn't need to use the wrapper so far.

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15 12:36                     ` Dong, Eddie
@ 2010-09-15 13:12                       ` Keir Fraser
  2010-09-20  3:13                         ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Keir Fraser @ 2010-09-15 13:12 UTC (permalink / raw)
  To: Dong, Eddie, Christoph Egger, xen-devel; +Cc: Tim Deegan, He, Qing

On 15/09/2010 13:36, "Dong, Eddie" <eddie.dong@intel.com> wrote:

>> Actually it is an issue now. This has nothing to do with VT-d (ie.
>> IOMMU, irq remapping, etc) but with basic core VMX functionality --
>> per I/O port direct execute versus vmexit; per virtual-address page
> 
> I see, for the I/O port, right now we are letting L1 handle it though it
> doesn't expect to :(
> How about to remove the capability of CPU_BASED_ACTIVATE_IO_BITMAP in L1 VMM
> for now to focus on framework?

Well. It'd be better if just worked really, wouldn't it? :-) How hard can it
be?

>> direct access versus #PF vmexit; per physical-frame direct access
>> versus nexted-paging vmexit. In any of these cases the L1 may think
> 
> Didn't quit catch. The memory direct access is always guarded by L0 shadow or
> nested EPT/NPT. Missing something?

L1 gives L2 direct access to, say, HPET (memory-mapped IO) which is actually
(unknown to L1) a virtual HPET emulated by Xen? Yeah, okay, that may be more
unlikely to happen in practice but it *is* allowable by the architecture and
it *should* be supported.

I would be inclined to add test cases for nestedhvm to hvmloader (we already
test various other tricky things in there) to test these kinds of cases.
Broadly speaking it's just a case of walking VVMCS structures to check
IO_BITMAP, or shadow pagetables, or EPT, and jump to the emulator with L2
state if the L1 would have permitted execution. It's really a core bit of
logic in properly doing nested VMX. The unfortunate thing is that the
necessary checks will slow down nested-hvm further, I guess, but perhaps
it's not too bad?

 -- Keir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 04/16] vmx: nest: nested control structure
  2010-09-15 13:06     ` Dong, Eddie
@ 2010-09-15 13:17       ` Christoph Egger
  2010-09-15 13:31         ` Christoph Egger
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-15 13:17 UTC (permalink / raw)
  To: xen-devel; +Cc: Dong, Eddie, He, Qing

On Wednesday 15 September 2010 15:06:03 Dong, Eddie wrote:
> Christoph Egger wrote:
> > On Wednesday 08 September 2010 17:22:12 Qing He wrote:
> >> v->arch.hvm_vmx.nest as control structure
> >>
> >> Signed-off-by: Qing He <qing.he@intel.com>
> >> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
> >>
> >> ---
> >> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/nest.h
> >> --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> >> +++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:03:41 2010
> >> +0800 @@ -0,0 +1,45 @@ +/*
> >> + * nest.h: nested virtualization for VMX.
> >> + *
> >> + * Copyright (c) 2010, Intel Corporation.
> >> + * Author: Qing He <qing.he@intel.com>
> >> + *
> >> + * This program is free software; you can redistribute it and/or
> >> modify it + * under the terms and conditions of the GNU General
> >> Public License, + * version 2, as published by the Free Software
> >> Foundation. + * + * This program is distributed in the hope it will
> >> be useful, but WITHOUT + * ANY WARRANTY; without even the implied
> >> warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE.
> >> See the GNU General Public License for + * more details. + *
> >> + * You should have received a copy of the GNU General Public
> >> License along with + * this program; if not, write to the Free
> >> Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston,
> >> MA 02111-1307 USA. + * + */
> >> +#ifndef __ASM_X86_HVM_NEST_H__
> >> +#define __ASM_X86_HVM_NEST_H__
> >> +
> >> +struct vmcs_struct;
> >> +
> >> +struct vmx_nest_struct {
> >
> > Is it ok to name it 'struct nestedvmx' ?
>
> Fine, renamed to nested_vmx.
>
> >> +    paddr_t              guest_vmxon_pa;
> >> +
> >> +    /* Saved host vmcs for vcpu itself */
> >> +    struct vmcs_struct  *hvmcs;
> >> +
> >> +    /*
> >> +     * Guest's `current vmcs' of vcpu
> >> +     *  - gvmcs_pa: guest VMCS region physical address
> >> +     *  - vvmcs:    (guest) virtual vmcs
> >> +     *  - svmcs:    effective vmcs for the guest of this vcpu
> >> +     *  - valid:    launch state: invalid on clear, valid on ld +
> >> */ +    paddr_t              gvmcs_pa;
> >> +    void                *vvmcs;
> >> +    struct vmcs_struct  *svmcs;
> >> +    int                  vmcs_valid;
> >> +};
> >> +
> >> +#endif /* __ASM_X86_HVM_NEST_H__ */
> >> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/vmcs.h
> >> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:00:00 2010
> >> +0800 +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:03:41
> >>  2010 +0800 @@ -22,6 +22,7 @@ #include <asm/config.h>
> >>  #include <asm/hvm/io.h>
> >>  #include <asm/hvm/vpmu.h>
> >> +#include <asm/hvm/vmx/nest.h>
> >>
> >>  extern void vmcs_dump_vcpu(struct vcpu *v);
> >>  extern void setup_vmcs_dump(void);
> >> @@ -99,6 +100,9 @@
> >>      u32                  secondary_exec_control;
> >>      u32                  exception_bitmap;
> >>
> >> +    /* nested virtualization */
> >> +    struct vmx_nest_struct nest;
> >> +
> >>  #ifdef __x86_64__
> >>      struct vmx_msr_state msr_state;
> >>      unsigned long        shadow_gs;
> >
> > I think, the structure should be allocated in the
> > nestedhvm_vcpu_initialise() function hook and assigned to the nh_arch
> > pointer in struct nestedhvm.
>
> Well, the structure itself is pretty small, so dynamic allocation is really
> not a good idea. Instead, the internal field such as vvmcs/svmcs are big
> and thus we use a pointer, but they are allocated on demand. This follows
> the style of arch_vmx_struct in vcpu data structure.
>
> I am fine with your nestedhvm_vcpu_initialise "design", but VMX doesn't
> need to use the wrapper so far.

It should be implemented good enough that vcpu creation doesn't fail.

Christoph


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 04/16] vmx: nest: nested control structure
  2010-09-15 13:17       ` Christoph Egger
@ 2010-09-15 13:31         ` Christoph Egger
  2010-09-15 13:46           ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-15 13:31 UTC (permalink / raw)
  To: xen-devel; +Cc: Dong, Eddie, He, Qing

On Wednesday 15 September 2010 15:17:42 Christoph Egger wrote:
> On Wednesday 15 September 2010 15:06:03 Dong, Eddie wrote:
> > Christoph Egger wrote:
> > > On Wednesday 08 September 2010 17:22:12 Qing He wrote:
> > >> v->arch.hvm_vmx.nest as control structure
> > >>
> > >> Signed-off-by: Qing He <qing.he@intel.com>
> > >> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
> > >>
> > >> ---
> > >> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/nest.h
> > >> --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> > >> +++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:03:41 2010
> > >> +0800 @@ -0,0 +1,45 @@ +/*
> > >> + * nest.h: nested virtualization for VMX.
> > >> + *
> > >> + * Copyright (c) 2010, Intel Corporation.
> > >> + * Author: Qing He <qing.he@intel.com>
> > >> + *
> > >> + * This program is free software; you can redistribute it and/or
> > >> modify it + * under the terms and conditions of the GNU General
> > >> Public License, + * version 2, as published by the Free Software
> > >> Foundation. + * + * This program is distributed in the hope it will
> > >> be useful, but WITHOUT + * ANY WARRANTY; without even the implied
> > >> warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE.
> > >> See the GNU General Public License for + * more details. + *
> > >> + * You should have received a copy of the GNU General Public
> > >> License along with + * this program; if not, write to the Free
> > >> Software Foundation, Inc., 59 Temple + * Place - Suite 330, Boston,
> > >> MA 02111-1307 USA. + * + */
> > >> +#ifndef __ASM_X86_HVM_NEST_H__
> > >> +#define __ASM_X86_HVM_NEST_H__
> > >> +
> > >> +struct vmcs_struct;
> > >> +
> > >> +struct vmx_nest_struct {
> > >
> > > Is it ok to name it 'struct nestedvmx' ?
> >
> > Fine, renamed to nested_vmx.
> >
> > >> +    paddr_t              guest_vmxon_pa;
> > >> +
> > >> +    /* Saved host vmcs for vcpu itself */
> > >> +    struct vmcs_struct  *hvmcs;
> > >> +
> > >> +    /*
> > >> +     * Guest's `current vmcs' of vcpu
> > >> +     *  - gvmcs_pa: guest VMCS region physical address
> > >> +     *  - vvmcs:    (guest) virtual vmcs
> > >> +     *  - svmcs:    effective vmcs for the guest of this vcpu
> > >> +     *  - valid:    launch state: invalid on clear, valid on ld +
> > >> */ +    paddr_t              gvmcs_pa;
> > >> +    void                *vvmcs;
> > >> +    struct vmcs_struct  *svmcs;
> > >> +    int                  vmcs_valid;
> > >> +};
> > >> +
> > >> +#endif /* __ASM_X86_HVM_NEST_H__ */
> > >> diff -r fc4de5eedd1d xen/include/asm-x86/hvm/vmx/vmcs.h
> > >> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:00:00 2010
> > >> +0800 +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h	Wed Sep 08 21:03:41
> > >>  2010 +0800 @@ -22,6 +22,7 @@ #include <asm/config.h>
> > >>  #include <asm/hvm/io.h>
> > >>  #include <asm/hvm/vpmu.h>
> > >> +#include <asm/hvm/vmx/nest.h>
> > >>
> > >>  extern void vmcs_dump_vcpu(struct vcpu *v);
> > >>  extern void setup_vmcs_dump(void);
> > >> @@ -99,6 +100,9 @@
> > >>      u32                  secondary_exec_control;
> > >>      u32                  exception_bitmap;
> > >>
> > >> +    /* nested virtualization */
> > >> +    struct vmx_nest_struct nest;
> > >> +
> > >>  #ifdef __x86_64__
> > >>      struct vmx_msr_state msr_state;
> > >>      unsigned long        shadow_gs;
> > >
> > > I think, the structure should be allocated in the
> > > nestedhvm_vcpu_initialise() function hook and assigned to the nh_arch
> > > pointer in struct nestedhvm.
> >
> > Well, the structure itself is pretty small, so dynamic allocation is
> > really not a good idea.

It's not a question of size. The point is that it is opaque to non-vmx code.

> > Instead, the internal field such as vvmcs/svmcs 
> > are big and thus we use a pointer, but they are allocated on demand. This
> > follows the style of arch_vmx_struct in vcpu data structure.

That's fine though. However you can reuse some fields of struct nestedhvm:

gvmcs_pa  is functional the same as nh_vmaddr
svmcs is functional the same as nh_vm

> > I am fine with your nestedhvm_vcpu_initialise "design", but VMX doesn't
> > need to use the wrapper so far.
>
> It should be implemented good enough that vcpu creation doesn't fail.
>
> Christoph



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 04/16] vmx: nest: nested control structure
  2010-09-15 13:31         ` Christoph Egger
@ 2010-09-15 13:46           ` Dong, Eddie
  2010-09-15 14:02             ` Christoph Egger
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15 13:46 UTC (permalink / raw)
  To: Christoph Egger, xen-devel; +Cc: Dong, Eddie, He, Qing


>>>>>  #ifdef __x86_64__
>>>>>      struct vmx_msr_state msr_state;
>>>>>      unsigned long        shadow_gs;
>>>> 
>>>> I think, the structure should be allocated in the
>>>> nestedhvm_vcpu_initialise() function hook and assigned to the
>>>> nh_arch pointer in struct nestedhvm.
>>> 
>>> Well, the structure itself is pretty small, so dynamic allocation is
>>> really not a good idea.
> 
> It's not a question of size. The point is that it is opaque to
> non-vmx code. 

I think it is anyway recycling again, that is not necessary.
Outside of VMX should not access those VMX specific code in theory. If it needs some parameters, vcpu can stand for all of them.

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 04/16] vmx: nest: nested control structure
  2010-09-15 13:46           ` Dong, Eddie
@ 2010-09-15 14:02             ` Christoph Egger
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Egger @ 2010-09-15 14:02 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: xen-devel, He, Qing

On Wednesday 15 September 2010 15:46:24 Dong, Eddie wrote:
> >>>>>  #ifdef __x86_64__
> >>>>>      struct vmx_msr_state msr_state;
> >>>>>      unsigned long        shadow_gs;
> >>>>
> >>>> I think, the structure should be allocated in the
> >>>> nestedhvm_vcpu_initialise() function hook and assigned to the
> >>>> nh_arch pointer in struct nestedhvm.
> >>>
> >>> Well, the structure itself is pretty small, so dynamic allocation is
> >>> really not a good idea.
> >
> > It's not a question of size. The point is that it is opaque to
> > non-vmx code.
>
> I think it is anyway recycling again, that is not necessary.
> Outside of VMX should not access those VMX specific code in theory.

Also practically. The point is if that mistake happens then gcc will complain
when the data structure is opaque to non-vmx code.

I already discussed exactly that point with Keir when he suggested
me to go your way but then agreed.
See:
http://lists.xensource.com/archives/html/xen-devel/2010-08/msg01024.html
http://lists.xensource.com/archives/html/xen-devel/2010-08/msg01028.html

Christoph


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 03/16] vmx: nest: nested availability and status flags
  2010-09-15 11:43   ` Christoph Egger
@ 2010-09-15 14:18     ` Dong, Eddie
  0 siblings, 0 replies; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15 14:18 UTC (permalink / raw)
  To: Christoph Egger, xen-devel; +Cc: Dong, Eddie, He, Qing

Christoph Egger wrote:
> On Wednesday 08 September 2010 17:22:11 Qing He wrote:
>> These are the vendor neutral availability and status flags of nested
>> virtualization. 
>> 
>> The availability hvm parameter can be used to disable all reporting
>> and functions of nested, improving guest security in certain
>> circumstances. 
>> 
>> The per vcpu flag in_nesting is used to indicate fundamental status:
>> the current mode.
>> 
>> Signed-off-by: Qing He <qing.he@intel.com>
>> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
>> 
>> ---
>> diff -r 11c98ab76326 xen/include/asm-x86/hvm/hvm.h
>> --- a/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 20:35:38 2010 +0800
>> +++ b/xen/include/asm-x86/hvm/hvm.h	Wed Sep 08 20:36:19 2010 +0800
>>  @@ -250,6 +250,10 @@ #define is_viridian_domain(_d)
>> \ (is_hvm_domain(_d) &&
>> ((_d)->arch.hvm_domain.params[HVM_PARAM_VIRIDIAN])) 
>> 
>> +#define is_nested_avail(_d)
>> \ + (is_hvm_domain(_d) &&
>> ((_d)->arch.hvm_domain.params[HVM_PARAM_NESTEDHVM])) +
>> +
> 
> That is functional equal to nestedhvm_enabled() in my patch series.
> The is_hvm_domain() check is not necessary. The tools patch check
> that nestedhvm is for hvm guests only.
> 
>>  void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int
>>                                     *ebx, unsigned int *ecx,
>>  unsigned int *edx); void hvm_migrate_timers(struct vcpu *v);
>> diff -r 11c98ab76326 xen/include/asm-x86/hvm/vcpu.h
>> --- a/xen/include/asm-x86/hvm/vcpu.h	Wed Sep 08 20:35:38 2010 +0800
>> +++ b/xen/include/asm-x86/hvm/vcpu.h	Wed Sep 08 20:36:19 2010 +0800
>>      @@ -71,6 +71,8 @@ bool_t              debug_state_latch;
>>      bool_t              single_step;
>> 
>> +    bool_t              in_nesting;
> 
> This is functional equal to nestedhvm_vcpu_in_guestmode() in my patch
> series. 
> 
>> +
>>      u64                 asid_generation;
>>      u32                 asid;
>> 
>> diff -r 11c98ab76326 xen/include/public/hvm/params.h
>> --- a/xen/include/public/hvm/params.h	Wed Sep 08 20:35:38 2010 +0800
>> +++ b/xen/include/public/hvm/params.h	Wed Sep 08 20:36:19 2010 +0800
>>  @@ -113,6 +113,9 @@ #define HVM_PARAM_CONSOLE_PFN    17
>>  #define HVM_PARAM_CONSOLE_EVTCHN 18
>> 
>> -#define HVM_NR_PARAMS          19
>> +/* Boolean: Enable nested virtualization (hvm only) */
>> +#define HVM_PARAM_NESTEDHVM    19
>> +
>> +#define HVM_NR_PARAMS          20
>> 
>>  #endif /* __XEN_PUBLIC_HVM_PARAMS_H__ */
> 
> I already have this part in my tools patch.
> 
> Christoph

This part is one of the consense one, we can merge one day in future.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 15/16] vmx: nest: capability reporting MSRs
  2010-09-15 10:05   ` Christoph Egger
@ 2010-09-15 14:28     ` Dong, Eddie
  2010-09-15 14:45       ` Christoph Egger
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-15 14:28 UTC (permalink / raw)
  To: Christoph Egger, xen-devel; +Cc: Dong, Eddie, He, Qing

Christoph Egger wrote:
> On Wednesday 08 September 2010 17:22:23 Qing He wrote:
>> handles VMX capability reporting MSRs.
>> Some features are masked so L1 would see a rather
>> simple configuration
>> 
>> Signed-off-by: Qing He <qing.he@intel.com>
>> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
> 
> 
> Are there any vmx capability features that are read out via cpuid ?
> If yes, then that code belongs into the tools patch.

That is in 16.txt, you can include in your code. Once you fixed the MAX LEAF issue, I can ack that patch.

> 
> In SVM the nestedhvm_vcpu_features hook is empty and for MSRs
> there are already two msr hooks namely msr_read_intercept and
> msr_write_intercept. I assume the functions below all called from
> there directly or indirectly.

No. this interception is for L1 guest. 

> 
> That renders the nestedhvm_vcpu_features hook useless and will remove
> it. 

A step toward my wish of light weight wrapper, glad to see! The more you removed, the more I can ack, untill the point only those necessary APIs were left, such as nested EPT/NPT, heavily revisited interrupt injection API after removing the new namespace.

Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 15/16] vmx: nest: capability reporting MSRs
  2010-09-15 14:28     ` Dong, Eddie
@ 2010-09-15 14:45       ` Christoph Egger
  2010-09-16 14:10         ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-15 14:45 UTC (permalink / raw)
  To: Dong, Eddie; +Cc: xen-devel, He, Qing

On Wednesday 15 September 2010 16:28:29 Dong, Eddie wrote:
> Christoph Egger wrote:
> > On Wednesday 08 September 2010 17:22:23 Qing He wrote:
> >> handles VMX capability reporting MSRs.
> >> Some features are masked so L1 would see a rather
> >> simple configuration
> >>
> >> Signed-off-by: Qing He <qing.he@intel.com>
> >> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
> >
> > Are there any vmx capability features that are read out via cpuid ?
> > If yes, then that code belongs into the tools patch.
>
> That is in 16.txt, you can include in your code. Once you fixed the MAX
> LEAF issue, I can ack that patch.

Andre tried you to explain why the MAXLEAF change is not a problem.
Is there an other problem ?

>
> > In SVM the nestedhvm_vcpu_features hook is empty and for MSRs
> > there are already two msr hooks namely msr_read_intercept and
> > msr_write_intercept. I assume the functions below all called from
> > there directly or indirectly.
>
> No. this interception is for L1 guest.

Yes, that is how I understood.
What I mean is that you can call the functions from
vmx_msr_read_intercept/vmx_msr_write_intercept.

>
> > That renders the nestedhvm_vcpu_features hook useless and will remove
> > it.
>
> A step toward my wish of light weight wrapper, glad to see!

I would have removed it earlier, if you were telling me what I said above
which is what I want to know.

> The more you removed, the more I can ack, untill the point only those
> necessary APIs were left, such as nested EPT/NPT, heavily revisited
> interrupt injection API after removing the new namespace.

When you tell me which adjustments you exactly need I will do the changes
w/o breaking SVM, of course.

Christoph


-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 15/16] vmx: nest: capability reporting MSRs
  2010-09-15 14:45       ` Christoph Egger
@ 2010-09-16 14:10         ` Dong, Eddie
  0 siblings, 0 replies; 68+ messages in thread
From: Dong, Eddie @ 2010-09-16 14:10 UTC (permalink / raw)
  To: Christoph Egger; +Cc: xen-devel, Dong, Eddie, He, Qing

Christoph Egger wrote:
> On Wednesday 15 September 2010 16:28:29 Dong, Eddie wrote:
>> Christoph Egger wrote:
>>> On Wednesday 08 September 2010 17:22:23 Qing He wrote:
>>>> handles VMX capability reporting MSRs.
>>>> Some features are masked so L1 would see a rather
>>>> simple configuration
>>>> 
>>>> Signed-off-by: Qing He <qing.he@intel.com>
>>>> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
>>> 
>>> Are there any vmx capability features that are read out via cpuid ?
>>> If yes, then that code belongs into the tools patch.
>> 
>> That is in 16.txt, you can include in your code. Once you fixed the
>> MAX LEAF issue, I can ack that patch.
> 
> Andre tried you to explain why the MAXLEAF change is not a problem.
> Is there an other problem ?
> 

I replied to him and figured out the problem.

That is assuming Intel future processor won't implement more than 0x8...8 leaf, that is not good thing to do.
And the name of MAX_LEAF is then wrong to Intel CPU.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-15 13:12                       ` Keir Fraser
@ 2010-09-20  3:13                         ` Dong, Eddie
  2010-09-20  8:08                           ` Keir Fraser
  0 siblings, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-20  3:13 UTC (permalink / raw)
  To: Keir Fraser, Christoph Egger, xen-devel; +Cc: Tim Deegan, Dong, Eddie, He, Qing

Keir Fraser wrote:
> On 15/09/2010 13:36, "Dong, Eddie" <eddie.dong@intel.com> wrote:
> 
>>> Actually it is an issue now. This has nothing to do with VT-d (ie.
>>> IOMMU, irq remapping, etc) but with basic core VMX functionality --
>>> per I/O port direct execute versus vmexit; per virtual-address page
>> 
>> I see, for the I/O port, right now we are letting L1 handle it
>> though it doesn't expect to :( How about to remove the capability of
>> CPU_BASED_ACTIVATE_IO_BITMAP in L1 VMM for now to focus on framework?
> 
> Well. It'd be better if just worked really, wouldn't it? :-) How hard
> can it be?

You are right. It is easy to do, but we have dillemma to either write-protect guest I/O bitmap page, or have to create the shadow I/O bitmap at each vmresume of L2 guest.

Currently we are injecting to L1 guest, but may be not correct in theory. For now, VMX can trap L2 guest I/O and emulate them in L0, we can revisit some time later to see if we need write-protection of guest I/O bitmap page :)

But, yes, L0 VMM needs to emulate L2 instruction here :)

MSR bitmap will have similar situation. Currently VMX removed MSR bitmap features, but we may do like I/O bitmap to write-protect the page, though it is slightly complicated.

> 
>>> direct access versus #PF vmexit; per physical-frame direct access
>>> versus nexted-paging vmexit. In any of these cases the L1 may think
>> 
>> Didn't quit catch. The memory direct access is always guarded by L0
>> shadow or nested EPT/NPT. Missing something?
> 
> L1 gives L2 direct access to, say, HPET (memory-mapped IO) which is
> actually (unknown to L1) a virtual HPET emulated by Xen? Yeah, okay,
> that may be more unlikely to happen in practice but it *is* allowable
> by the architecture and it *should* be supported.

Agree, thanks!

> 
> I would be inclined to add test cases for nestedhvm to hvmloader (we
> already test various other tricky things in there) to test these
> kinds of cases. Broadly speaking it's just a case of walking VVMCS
> structures to check IO_BITMAP, or shadow pagetables, or EPT, and jump
> to the emulator with L2 state if the L1 would have permitted
> execution. It's really a core bit of logic in properly doing nested
> VMX. The unfortunate thing is that the necessary checks will slow
> down nested-hvm further, I guess, but perhaps it's not too bad?

Agree. Thanks.
Need write-protection, otherwise generating shadow bitmap is expansive. Checking bitmap at I/O exit is fine.

Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 08/16] vmx: nest: vmresume/vmlaunch
  2010-09-15 11:30     ` Christoph Egger
@ 2010-09-20  5:19       ` Dong, Eddie
  0 siblings, 0 replies; 68+ messages in thread
From: Dong, Eddie @ 2010-09-20  5:19 UTC (permalink / raw)
  To: Christoph Egger, xen-devel; +Cc: Dong, Eddie, He, Qing

Christoph Egger wrote:
> On Wednesday 15 September 2010 11:52:26 Christoph Egger wrote:
>> On Wednesday 08 September 2010 17:22:16 Qing He wrote:
>>> vmresume and vmlaunch instructions and transitional states
>>> 
>>> Signed-off-by: Qing He <qing.he@intel.com>
>>> Signed-off-by: Eddie Dong <eddie.dong@intel.com>
>>> 
>>> ---
>>> 
>>> diff -r e828d55c10bb xen/arch/x86/hvm/vmx/nest.c
>>> --- a/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 21:42:10 2010 +0800
>>> +++ b/xen/arch/x86/hvm/vmx/nest.c	Wed Sep 08 22:04:16 2010 +0800 @@
>>>      -633,3 +633,33 @@ hvm_inject_exception(TRAP_invalid_op, 0, 0);
>>>      return X86EMUL_EXCEPTION;
>>>  }
>>> +
>>> +int vmx_nest_handle_vmresume(struct cpu_user_regs *regs) +{
>>> +    struct vcpu *v = current;
>>> +    struct vmx_nest_struct *nest = &v->arch.hvm_vmx.nest; +    int
>>> rc; +
>>> +    if ( unlikely(!nest->guest_vmxon_pa) )
>>> +        goto invalid_op;
>>> +
>>> +    rc = vmx_inst_check_privilege(regs);
>>> +    if ( rc != X86EMUL_OKAY )
>>> +        return rc;
>>> +
>>> +    if ( nest->vmcs_valid == 1 )
>>> +        nest->vmresume_pending = 1;
>>> +    else
>>> +        vmreturn(regs, VMFAIL_INVALID);
>>> +
>>> +    return X86EMUL_OKAY;
>>> +
>>> +invalid_op:
>>> +    hvm_inject_exception(TRAP_invalid_op, 0, 0);
>>> +    return X86EMUL_EXCEPTION;
>>> +}
>>> +
>>> +int vmx_nest_handle_vmlaunch(struct cpu_user_regs *regs) +{
>>> +    return vmx_nest_handle_vmresume(regs);
>>> +}
>>> diff -r e828d55c10bb xen/arch/x86/hvm/vmx/vmx.c
>>> --- a/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 21:42:10 2010 +0800
>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c	Wed Sep 08 22:04:16 2010 +0800 @@
>>>      -2321,6 +2321,11 @@ /* Now enable interrupts so it's safe to
>>> take locks. */      local_irq_enable(); 
>>> 
>>> +    /* XXX: This looks ugly, but we need a mechanism to ensure
>>> +     * any pending vmresume has really happened
>>> +     */
>>> +    v->arch.hvm_vmx.nest.vmresume_in_progress = 0; +
>>>      if ( unlikely(exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) )
>>>          return vmx_failed_vmentry(exit_reason, regs);
>>> 
>>> @@ -2592,6 +2597,11 @@
>>>          if ( vmx_nest_handle_vmclear(regs) == X86EMUL_OKAY )
>>>              __update_guest_eip(inst_len);
>>>          break;
>>> +    case EXIT_REASON_VMLAUNCH:
>>> +        inst_len = __get_instruction_length();
>>> +        if ( vmx_nest_handle_vmlaunch(regs) == X86EMUL_OKAY )
>>> +            __update_guest_eip(inst_len);
>>> +        break;
>>>      case EXIT_REASON_VMPTRLD:
>>>          inst_len = __get_instruction_length();
>>>          if ( vmx_nest_handle_vmptrld(regs) == X86EMUL_OKAY ) @@
>>>          -2607,6 +2617,11 @@ if ( vmx_nest_handle_vmread(regs) ==
>>>              X86EMUL_OKAY ) __update_guest_eip(inst_len);
>>>          break;
>>> +    case EXIT_REASON_VMRESUME:
>>> +        inst_len = __get_instruction_length();
>>> +        if ( vmx_nest_handle_vmresume(regs) == X86EMUL_OKAY )
>>> +            __update_guest_eip(inst_len);
>>> +        break;
>>>      case EXIT_REASON_VMWRITE:
>>>          inst_len = __get_instruction_length();
>>>          if ( vmx_nest_handle_vmwrite(regs) == X86EMUL_OKAY ) @@
>>> -2625,8 +2640,6 @@ 
>>> 
>>>      case EXIT_REASON_MWAIT_INSTRUCTION:
>>>      case EXIT_REASON_MONITOR_INSTRUCTION:
>>> -    case EXIT_REASON_VMLAUNCH:
>>> -    case EXIT_REASON_VMRESUME:
>>>          vmx_inject_hw_exception(TRAP_invalid_op,
>>> HVM_DELIVER_NO_ERROR_CODE); break;
>>> 
>>> diff -r e828d55c10bb xen/include/asm-x86/hvm/vmx/nest.h
>>> --- a/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 21:42:10 2010
>>> +0800 +++ b/xen/include/asm-x86/hvm/vmx/nest.h	Wed Sep 08 22:04:16
>>>      2010 +0800 @@ -40,6 +40,20 @@ void                *vvmcs;
>>>      struct vmcs_struct  *svmcs;
>>>      int                  vmcs_valid;
>>> +
>>> +    /*
>>> +     * vmexit_pending and vmresume_pending is to mark pending
>>> +     * switches, they are cleared when physical vmcs is changed. +
>>> */ +    int                  vmexit_pending;
>>> +    int                  vmresume_pending;
>> 
>> This is functional equal to the vmentry flag in struct nestedhvm.
> 
> Is it possible to have both vmexit and vmresume pending at the same
> time ? 
> 

Should not be. We may be able to use single variable for multiple state.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-20  3:13                         ` Dong, Eddie
@ 2010-09-20  8:08                           ` Keir Fraser
  2010-09-20  9:33                             ` Dong, Eddie
  2010-09-20  9:41                             ` Christoph Egger
  0 siblings, 2 replies; 68+ messages in thread
From: Keir Fraser @ 2010-09-20  8:08 UTC (permalink / raw)
  To: Dong, Eddie, Christoph Egger, xen-devel; +Cc: Tim Deegan, He, Qing

On 20/09/2010 04:13, "Dong, Eddie" <eddie.dong@intel.com> wrote:

>>>> Actually it is an issue now. This has nothing to do with VT-d (ie.
>>>> IOMMU, irq remapping, etc) but with basic core VMX functionality --
>>>> per I/O port direct execute versus vmexit; per virtual-address page
>>> 
>>> I see, for the I/O port, right now we are letting L1 handle it
>>> though it doesn't expect to :( How about to remove the capability of
>>> CPU_BASED_ACTIVATE_IO_BITMAP in L1 VMM for now to focus on framework?
>> 
>> Well. It'd be better if just worked really, wouldn't it? :-) How hard
>> can it be?
> 
> You are right. It is easy to do, but we have dillemma to either write-protect
> guest I/O bitmap page, or have to create the shadow I/O bitmap at each
> vmresume of L2 guest.

You need that anyway don't you, regardless of whether you are accurately
deciding whether to inject-to-L1 or emulate-L2 on vmexit to L0? Whether you
inject or emulate, ports that L1 has disallowed for L2 must be properly
represented in the shadow I/O bitmap page.

> Currently we are injecting to L1 guest, but may be not correct in theory. For
> now, VMX can trap L2 guest I/O and emulate them in L0, we can revisit some
> time later to see if we need write-protection of guest I/O bitmap page :)

Are you suggesting to always emulate instead of always inject-to-L1? That's
still not accurate virtualisation of this VMX feature.

Hmm... Are you currently setting up to always vmexit on I/O port accesses by
L2? Even if you are, that doesn't stop you looking at the virtual I/O bitmap
from in your L0 vmexit handler, and doing the right thing (emulate versus
inject-to-L1).

 -- Keir

> But, yes, L0 VMM needs to emulate L2 instruction here :)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-20  8:08                           ` Keir Fraser
@ 2010-09-20  9:33                             ` Dong, Eddie
  2010-09-20  9:41                               ` Keir Fraser
  2010-09-20  9:41                             ` Christoph Egger
  1 sibling, 1 reply; 68+ messages in thread
From: Dong, Eddie @ 2010-09-20  9:33 UTC (permalink / raw)
  To: Keir Fraser, Christoph Egger, xen-devel; +Cc: Tim Deegan, Dong, Eddie, He, Qing

Keir Fraser wrote:
> On 20/09/2010 04:13, "Dong, Eddie" <eddie.dong@intel.com> wrote:
> 
>>>>> Actually it is an issue now. This has nothing to do with VT-d (ie.
>>>>> IOMMU, irq remapping, etc) but with basic core VMX functionality
>>>>> -- per I/O port direct execute versus vmexit; per virtual-address
>>>>> page 
>>>> 
>>>> I see, for the I/O port, right now we are letting L1 handle it
>>>> though it doesn't expect to :( How about to remove the capability
>>>> of CPU_BASED_ACTIVATE_IO_BITMAP in L1 VMM for now to focus on
>>>> framework? 
>>> 
>>> Well. It'd be better if just worked really, wouldn't it? :-) How
>>> hard can it be?
>> 
>> You are right. It is easy to do, but we have dillemma to either
>> write-protect guest I/O bitmap page, or have to create the shadow
>> I/O bitmap at each vmresume of L2 guest.
> 
> You need that anyway don't you, regardless of whether you are
> accurately deciding whether to inject-to-L1 or emulate-L2 on vmexit
> to L0? Whether you inject or emulate, ports that L1 has disallowed
> for L2 must be properly represented in the shadow I/O bitmap page.

VMX has a feature "always exit" for PIO which doesn't use I/O bitmap.


> 
>> Currently we are injecting to L1 guest, but may be not correct in
>> theory. For now, VMX can trap L2 guest I/O and emulate them in L0,
>> we can revisit some time later to see if we need write-protection of
>> guest I/O bitmap page :) 
> 
> Are you suggesting to always emulate instead of always inject-to-L1?
> That's still not accurate virtualisation of this VMX feature.

L2 PIO is always exiting to L0. So we wither inject to L1, or emulate it in L0, base on L1 I/O exiting and bitmap setting.

> 
> Hmm... Are you currently setting up to always vmexit on I/O port
> accesses by L2? Even if you are, that doesn't stop you looking at the

Yes.

> virtual I/O bitmap from in your L0 vmexit handler, and doing the

No, we checked the L1 settings.

> right thing (emulate versus inject-to-L1).
> 

BTW, does SVM side already implemented the write-protection of I/O bitmap & MSR bitmap. it seems not.


Thx, Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-20  8:08                           ` Keir Fraser
  2010-09-20  9:33                             ` Dong, Eddie
@ 2010-09-20  9:41                             ` Christoph Egger
  2010-09-20 13:14                               ` Dong, Eddie
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Egger @ 2010-09-20  9:41 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Tim Deegan, xen-devel, Dong, Eddie, He, Qing

On Monday 20 September 2010 10:08:02 Keir Fraser wrote:
> On 20/09/2010 04:13, "Dong, Eddie" <eddie.dong@intel.com> wrote:
> >>>> Actually it is an issue now. This has nothing to do with VT-d (ie.
> >>>> IOMMU, irq remapping, etc) but with basic core VMX functionality --
> >>>> per I/O port direct execute versus vmexit; per virtual-address page
> >>>
> >>> I see, for the I/O port, right now we are letting L1 handle it
> >>> though it doesn't expect to :( How about to remove the capability of
> >>> CPU_BASED_ACTIVATE_IO_BITMAP in L1 VMM for now to focus on framework?
> >>
> >> Well. It'd be better if just worked really, wouldn't it? :-) How hard
> >> can it be?
> >
> > You are right. It is easy to do, but we have dillemma to either
> > write-protect guest I/O bitmap page, or have to create the shadow I/O
> > bitmap at each vmresume of L2 guest.
>
> You need that anyway don't you, regardless of whether you are accurately
> deciding whether to inject-to-L1 or emulate-L2 on vmexit to L0? Whether you
> inject or emulate, ports that L1 has disallowed for L2 must be properly
> represented in the shadow I/O bitmap page.

You need to do additional range-checking to determine if the guest actually
touched the IO bitmap page in case Xen uses a super page.

>
> > Currently we are injecting to L1 guest, but may be not correct in theory.
> > For now, VMX can trap L2 guest I/O and emulate them in L0, we can revisit
> > some time later to see if we need write-protection of guest I/O bitmap
> > page :)
>
> Are you suggesting to always emulate instead of always inject-to-L1? That's
> still not accurate virtualisation of this VMX feature.
>
> Hmm... Are you currently setting up to always vmexit on I/O port accesses
> by L2? Even if you are, that doesn't stop you looking at the virtual I/O
> bitmap from in your L0 vmexit handler, and doing the right thing (emulate
> versus inject-to-L1).
>
>  -- Keir
>
> > But, yes, L0 VMM needs to emulate L2 instruction here :)



-- 
---to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Alberto Bozzo, Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-20  9:33                             ` Dong, Eddie
@ 2010-09-20  9:41                               ` Keir Fraser
  2010-09-20 13:10                                 ` Dong, Eddie
  0 siblings, 1 reply; 68+ messages in thread
From: Keir Fraser @ 2010-09-20  9:41 UTC (permalink / raw)
  To: Dong, Eddie, Christoph Egger, xen-devel; +Cc: Tim Deegan, He, Qing

On 20/09/2010 10:33, "Dong, Eddie" <eddie.dong@intel.com> wrote:

>> Are you suggesting to always emulate instead of always inject-to-L1?
>> That's still not accurate virtualisation of this VMX feature.
> 
> L2 PIO is always exiting to L0. So we wither inject to L1, or emulate it in
> L0, base on L1 I/O exiting and bitmap setting.
> 
>> 
>> Hmm... Are you currently setting up to always vmexit on I/O port
>> accesses by L2? Even if you are, that doesn't stop you looking at the
> 
> Yes.
> 
>> virtual I/O bitmap from in your L0 vmexit handler, and doing the
> 
> No, we checked the L1 settings.

Right, okay, I think this issue about write-protecting the I/O bitmap and
maintaining an accurate shadow bitmap is entirely orthogonal to the issue of
emulation versus injection. The former issue is a performance optimisation
only afaics, and I wouldn't care about that in a first checkin of nestedhvm.

 -- Keir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-20  9:41                               ` Keir Fraser
@ 2010-09-20 13:10                                 ` Dong, Eddie
  0 siblings, 0 replies; 68+ messages in thread
From: Dong, Eddie @ 2010-09-20 13:10 UTC (permalink / raw)
  To: Keir Fraser, Christoph Egger, xen-devel; +Cc: Tim Deegan, Dong, Eddie, He, Qing

Keir Fraser wrote:
> On 20/09/2010 10:33, "Dong, Eddie" <eddie.dong@intel.com> wrote:
> 
>>> Are you suggesting to always emulate instead of always inject-to-L1?
>>> That's still not accurate virtualisation of this VMX feature.
>> 
>> L2 PIO is always exiting to L0. So we wither inject to L1, or
>> emulate it in L0, base on L1 I/O exiting and bitmap setting.
>> 
>>> 
>>> Hmm... Are you currently setting up to always vmexit on I/O port
>>> accesses by L2? Even if you are, that doesn't stop you looking at
>>> the 
>> 
>> Yes.
>> 
>>> virtual I/O bitmap from in your L0 vmexit handler, and doing the
>> 
>> No, we checked the L1 settings.
> 
> Right, okay, I think this issue about write-protecting the I/O bitmap
> and maintaining an accurate shadow bitmap is entirely orthogonal to
> the issue of emulation versus injection. The former issue is a
> performance optimisation only afaics, and I wouldn't care about that
> in a first checkin of nestedhvm. 

Yes, that is what I mean. I fixed it locally w/ emulation in L0 + "always exit", and we may enhance by maintaining an accurate bitmap.

Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH 06/16] vmx: nest: handling VMX instruction exits
  2010-09-20  9:41                             ` Christoph Egger
@ 2010-09-20 13:14                               ` Dong, Eddie
  0 siblings, 0 replies; 68+ messages in thread
From: Dong, Eddie @ 2010-09-20 13:14 UTC (permalink / raw)
  To: Christoph Egger, Keir Fraser; +Cc: Deegan

Christoph Egger wrote:
> On Monday 20 September 2010 10:08:02 Keir Fraser wrote:
>> On 20/09/2010 04:13, "Dong, Eddie" <eddie.dong@intel.com> wrote:
>>>>>> Actually it is an issue now. This has nothing to do with VT-d
>>>>>> (ie. IOMMU, irq remapping, etc) but with basic core VMX
>>>>>> functionality -- per I/O port direct execute versus vmexit; per
>>>>>> virtual-address page 
>>>>> 
>>>>> I see, for the I/O port, right now we are letting L1 handle it
>>>>> though it doesn't expect to :( How about to remove the capability
>>>>> of CPU_BASED_ACTIVATE_IO_BITMAP in L1 VMM for now to focus on
>>>>> framework? 
>>>> 
>>>> Well. It'd be better if just worked really, wouldn't it? :-) How
>>>> hard can it be?
>>> 
>>> You are right. It is easy to do, but we have dillemma to either
>>> write-protect guest I/O bitmap page, or have to create the shadow
>>> I/O bitmap at each vmresume of L2 guest.
>> 
>> You need that anyway don't you, regardless of whether you are
>> accurately deciding whether to inject-to-L1 or emulate-L2 on vmexit
>> to L0? Whether you inject or emulate, ports that L1 has disallowed
>> for L2 must be properly represented in the shadow I/O bitmap page.
> 
> You need to do additional range-checking to determine if the guest
> actually touched the IO bitmap page in case Xen uses a super page.
> 

We may have many alternatives to this. If we treat this address space as MMIO, we can hook handler for MMIO emulation.

Eddie

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2010-09-20 13:14 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-08 15:22 [PATCH 00/16] Nested virtualization for VMX Qing He
2010-09-08 15:22 ` [PATCH 01/16] vmx: nest: rename host_vmcs Qing He
2010-09-10 13:27   ` Christoph Egger
2010-09-08 15:22 ` [PATCH 02/16] vmx: nest: wrapper for control update Qing He
2010-09-10 13:29   ` Christoph Egger
2010-09-08 15:22 ` [PATCH 03/16] vmx: nest: nested availability and status flags Qing He
2010-09-15 11:43   ` Christoph Egger
2010-09-15 14:18     ` Dong, Eddie
2010-09-08 15:22 ` [PATCH 04/16] vmx: nest: nested control structure Qing He
2010-09-09  6:13   ` Dong, Eddie
2010-09-15 11:27   ` Christoph Egger
2010-09-15 13:06     ` Dong, Eddie
2010-09-15 13:17       ` Christoph Egger
2010-09-15 13:31         ` Christoph Egger
2010-09-15 13:46           ` Dong, Eddie
2010-09-15 14:02             ` Christoph Egger
2010-09-08 15:22 ` [PATCH 05/16] vmx: nest: virtual vmcs layout Qing He
2010-09-13 10:29   ` Tim Deegan
2010-09-08 15:22 ` [PATCH 06/16] vmx: nest: handling VMX instruction exits Qing He
2010-09-10  7:05   ` Dong, Eddie
2010-09-13 11:11     ` Tim Deegan
2010-09-13 14:29       ` Dong, Eddie
2010-09-13 14:46         ` Tim Deegan
2010-09-13 11:10   ` Tim Deegan
2010-09-15  4:55     ` Dong, Eddie
2010-09-15  6:40       ` Keir Fraser
2010-09-15  6:49         ` Dong, Eddie
2010-09-15  7:31           ` Keir Fraser
2010-09-15  8:15             ` Christoph Egger
2010-09-15  8:23               ` Keir Fraser
2010-09-15  9:08                 ` Dong, Eddie
2010-09-15 11:39                   ` Keir Fraser
2010-09-15 12:36                     ` Dong, Eddie
2010-09-15 13:12                       ` Keir Fraser
2010-09-20  3:13                         ` Dong, Eddie
2010-09-20  8:08                           ` Keir Fraser
2010-09-20  9:33                             ` Dong, Eddie
2010-09-20  9:41                               ` Keir Fraser
2010-09-20 13:10                                 ` Dong, Eddie
2010-09-20  9:41                             ` Christoph Egger
2010-09-20 13:14                               ` Dong, Eddie
2010-09-15  7:17         ` Qing He
2010-09-15  7:38           ` Keir Fraser
2010-09-15  7:56             ` Dong, Eddie
2010-09-15  8:15               ` Keir Fraser
2010-09-15  9:26                 ` Tim Deegan
2010-09-15  9:56                   ` Dong, Eddie
2010-09-15 11:46                     ` Keir Fraser
2010-09-08 15:22 ` [PATCH 07/16] vmx: nest: switch current vmcs Qing He
2010-09-08 15:22 ` [PATCH 08/16] vmx: nest: vmresume/vmlaunch Qing He
2010-09-15  9:52   ` Christoph Egger
2010-09-15 11:30     ` Christoph Egger
2010-09-20  5:19       ` Dong, Eddie
2010-09-08 15:22 ` [PATCH 09/16] vmx: nest: shadow controls Qing He
2010-09-08 15:22 ` [PATCH 10/16] vmx: nest: L1 <-> L2 context switch Qing He
2010-09-08 15:22 ` [PATCH 11/16] vmx: nest: interrupt handling Qing He
2010-09-08 15:22 ` [PATCH 12/16] vmx: nest: VMExit handler in L2 Qing He
2010-09-08 15:22 ` [PATCH 13/16] vmx: nest: L2 tsc Qing He
2010-09-08 15:22 ` [PATCH 14/16] vmx: nest: CR0.TS and #NM Qing He
2010-09-08 15:22 ` [PATCH 15/16] vmx: nest: capability reporting MSRs Qing He
2010-09-13 12:45   ` Tim Deegan
2010-09-15 10:05   ` Christoph Egger
2010-09-15 14:28     ` Dong, Eddie
2010-09-15 14:45       ` Christoph Egger
2010-09-16 14:10         ` Dong, Eddie
2010-09-08 15:22 ` [PATCH 16/16] vmx: nest: expose cpuid and CR4.VMXE Qing He
2010-09-15  9:43   ` Christoph Egger
2010-09-13 13:10 ` [PATCH 00/16] Nested virtualization for VMX Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.