Re: Nested virtualization off VMware vSphere 6.0 with EL6 guests crashes on Xen 4.6

From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: andrew.cooper3@citrix.com, kevin.tian@intel.com,
	wim.coekaerts@oracle.com, jun.nakajima@intel.com,
	xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: Nested virtualization off VMware vSphere 6.0 with EL6 guests crashes on Xen 4.6
Date: Fri, 15 Jan 2016 16:39:58 -0500	[thread overview]
Message-ID: <20160115213958.GA16118@char.us.oracle.com> (raw)
In-Reply-To: <5694D3CB02000078000C5D00@prv-mh.provo.novell.com>

On Tue, Jan 12, 2016 at 02:22:03AM -0700, Jan Beulich wrote:
> >>> On 12.01.16 at 04:38, <konrad.wilk@oracle.com> wrote:
> > (XEN) Assertion 'vapic_pg && !p2m_is_paging(p2mt)' failed at vvmx.c:698
> > (XEN) ----[ Xen-4.6.0  x86_64  debug=y  Tainted:    C ]----
> > (XEN) CPU:    39
> > (XEN) RIP:    e008:[<ffff82d0801ed053>] virtual_vmentry+0x487/0xac9
> > (XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor (d1v3)
> > (XEN) rax: 0000000000000000   rbx: ffff83007786c000   rcx: 0000000000000000
> > (XEN) rdx: 0000000000000e00   rsi: 000fffffffffffff   rdi: ffff83407f81e010
> > (XEN) rbp: ffff834008a47ea8   rsp: ffff834008a47e38   r8: 0000000000000000
> > (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
> > (XEN) r12: 0000000000000000   r13: ffff82c000341000   r14: ffff834008a47f18
> > (XEN) r15: ffff83407f7c4000   cr0: 0000000080050033   cr4: 00000000001526e0
> > (XEN) cr3: 000000407fb22000   cr2: 0000000000000000
> > (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> > (XEN) Xen stack trace from rsp=ffff834008a47e38:
> > (XEN)    ffff834008a47e68 ffff82d0801d2cde ffff834008a47e68 0000000000000d00
> > (XEN)    0000000000000000 0000000000000000 ffff834008a47e88 00000004801cc30e
> > (XEN)    ffff83007786c000 ffff83007786c000 ffff834008a40000 0000000000000000
> > (XEN)    ffff834008a47f18 0000000000000000 ffff834008a47f08 ffff82d0801edf94
> > (XEN)    ffff834008a47ef8 0000000000000000 ffff834008f62000 ffff834008a47f18
> > (XEN)    000000ae8c99eb8d ffff83007786c000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 ffff82d0801ee2ab
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > (XEN)    00000000078bfbff 0000000000000000 0000000000000000 0000beef0000beef
> > (XEN)    fffffffffc4b3440 000000bf0000beef 0000000000040046 fffffffffc607f00
> > (XEN)    000000000000beef 000000000000beef 000000000000beef 000000000000beef
> > (XEN)    000000000000beef 0000000000000027 ffff83007786c000 0000006f88716300
> > (XEN)    0000000000000000
> > (XEN) Xen call trace:
> > (XEN)    [<ffff82d0801ed053>] virtual_vmentry+0x487/0xac9
> > (XEN)    [<ffff82d0801edf94>] nvmx_switch_guest+0x8ff/0x915
> > (XEN)    [<ffff82d0801ee2ab>] vmx_asm_vmexit_handler+0x4b/0xc0
> > (XEN)
> > (XEN)
> > (XEN) ****************************************
> > (XEN) Panic on CPU 39:
> > (XEN) Assertion 'vapic_pg && !p2m_is_paging(p2mt)' failed at vvmx.c:698
> > (XEN) ****************************************
> > (XEN)
> > 
> > ..and then to my surprise the hypervisor stopped hitting this.
> 
> Since we can (I hope) pretty much exclude a paging type, the
> ASSERT() must have triggered because of vapic_pg being NULL.
> That might be verifiable without extra printk()s, just by checking
> the disassembly (assuming the value sits in a register). In which
> case vapic_gpfn would be of interest too.

The vapic_gpfn is 0xffffffffffff.

To be exact:

nvmx_update_virtual_apic_address:vCPU0 0xffffffffffffffff(vAPIC) 0x0(APIC), 0x0(TPR) ctrl=b5b9effe

Based on this:

diff --git a/xen/arch/x86/hvm/vmx/vvmx.c b/xen/arch/x86/hvm/vmx/vvmx.c
index cb6f9b8..8a0abfc 100644
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -695,7 +695,15 @@ static void nvmx_update_virtual_apic_address(struct vcpu *v)
 
         vapic_gpfn = __get_vvmcs(nvcpu->nv_vvmcx, VIRTUAL_APIC_PAGE_ADDR) >> PAGE_SHIFT;
         vapic_pg = get_page_from_gfn(v->domain, vapic_gpfn, &p2mt, P2M_ALLOC);
-        ASSERT(vapic_pg && !p2m_is_paging(p2mt));
+       if ( !vapic_pg ) {
+               printk("%s:vCPU%d 0x%lx(vAPIC) 0x%lx(APIC), 0x%lx(TPR) ctrl=%x\n", __func__,v->vcpu_id,
+                       __get_vvmcs(nvcpu->nv_vvmcx, VIRTUAL_APIC_PAGE_ADDR),
+                       __get_vvmcs(nvcpu->nv_vvmcx, APIC_ACCESS_ADDR),
+                       __get_vvmcs(nvcpu->nv_vvmcx, TPR_THRESHOLD),
+                       ctrl);
+       }
+        ASSERT(vapic_pg);
+       ASSERT(vapic_pg && !p2m_is_paging(p2mt));
         __vmwrite(VIRTUAL_APIC_PAGE_ADDR, page_to_maddr(vapic_pg));
         put_page(vapic_pg);
     }

> 
> What looks odd to me is the connection between
> CPU_BASED_TPR_SHADOW being set and the use of a (valid)
> virtual APIC page: Wouldn't this rather need to depend on
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES, just like in
> nvmx_update_apic_access_address()?

Could be. I added in an read for the secondary control:

nvmx_update_virtual_apic_address:vCPU2 0xffffffffffffffff(vAPIC) 0x0(APIC), 0x0(TPR) ctrl=b5b9effe sec=0

So trying your recommendation:
diff --git a/xen/arch/x86/hvm/vmx/vvmx.c b/xen/arch/x86/hvm/vmx/vvmx.c
index cb6f9b8..d291c91 100644
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -686,8 +686,8 @@ static void nvmx_update_virtual_apic_address(struct vcpu *v)
     struct nestedvcpu *nvcpu = &vcpu_nestedhvm(v);
     u32 ctrl;
 
-    ctrl = __n2_exec_control(v);
-    if ( ctrl & CPU_BASED_TPR_SHADOW )
+    ctrl = __n2_secondary_exec_control(v);
+    if ( ctrl & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES )
     {
         p2m_type_t p2mt;
         unsigned long vapic_gpfn;


Got me:
(XEN) stdvga.c:151:d1v0 leaving stdvga mode
(XEN) stdvga.c:147:d1v0 entering stdvga and caching modes
(XEN) stdvga.c:520:d1v0 leaving caching mode
(XEN) vvmx.c:2491:d1v0 Unknown nested vmexit reason 80000021.
(XEN) Failed vm entry (exit reason 0x80000021) caused by invalid guest state (0).
(XEN) ************* VMCS Area **************
(XEN) *** Guest State ***
(XEN) CR0: actual=0x0000000000000030, shadow=0x0000000000000000, gh_mask=ffffffffffffffff
(XEN) CR4: actual=0x0000000000002050, shadow=0x0000000000000000, gh_mask=ffffffffffffffff
(XEN) CR3 = 0x00000000800ed000
(XEN) RSP = 0x0000000000000000 (0x0000000000000000)  RIP = 0x0000000000000000 (0x0000000000000000)
(XEN) RFLAGS=0x00000002 (0x00000002)  DR7 = 0x0000000000000400
(XEN) Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000
(XEN)        sel  attr  limit   base
(XEN)   CS: 0000 00000 00000000 0000000000000000
(XEN)   DS: 0000 00000 00000000 0000000000000000
(XEN)   SS: 0000 00000 00000000 0000000000000000
(XEN)   ES: 0000 00000 00000000 0000000000000000
(XEN)   FS: 0000 00000 00000000 0000000000000000
(XEN)   GS: 0000 00000 00000000 0000000000000000
(XEN) GDTR:            00000000 0000000000000000
(XEN) LDTR: 0000 00000 00000000 0000000000000000
(XEN) IDTR:            00000000 0000000000000000
(XEN)   TR: 0000 00000 00000000 0000000000000000
(XEN) EFER = 0x0000000000000800  PAT = 0x0000000000000000
(XEN) PreemptionTimer = 0x00000000  SM Base = 0x00000000
(XEN) DebugCtl = 0x0000000000000000  DebugExceptions = 0x0000000000000000
(XEN) Interruptibility = 00000000  ActivityState = 00000000
(XEN) *** Host State ***
(XEN) RIP = 0xffff82d0801ee3a0 (vmx_asm_vmexit_handler)  RSP = 0xffff8340077d7f90
(XEN) CS=e008 SS=0000 DS=0000 ES=0000 FS=0000 GS=0000 TR=e040
(XEN) FSBase=0000000000000000 GSBase=0000000000000000 TRBase=ffff8340077dfc00
(XEN) GDTBase=ffff8340077d0000 IDTBase=ffff8340077dc000
(XEN) CR0=0000000080050033 CR3=000000400076c000 CR4=00000000001526e0
(XEN) Sysenter RSP=ffff8340077d7fc0 CS:RIP=e008:ffff82d080238870
(XEN) EFER = 0x0000000000000000  PAT = 0x0000050100070406
(XEN) *** Control State ***
(XEN) PinBased=0000003f CPUBased=b5b9effe SecondaryExec=000054eb
(XEN) EntryControls=000011fb ExitControls=001fefff
(XEN) ExceptionBitmap=00062042 PFECmask=00000000 PFECmatch=ffffffff
(XEN) VMEntry: intr_info=00000000 errcode=00000000 ilen=00000000
(XEN) VMExit: intr_info=00000000 errcode=00000000 ilen=00000006
(XEN)         reason=80000021 qualification=0000000000000000
(XEN) IDTVectoring: info=00000000 errcode=00000000
(XEN) TSC Offset = 0xfffd34adb2c3a149
(XEN) TPR Threshold = 0x00  PostedIntrVec = 0x00
(XEN) EPT pointer = 0x000000400079a01e  EPTP index = 0x0000
(XEN) PLE Gap=00000080 Window=00001000
(XEN) Virtual processor ID = 0x004e VMfunc controls = 0000000000000000
(XEN) **************************************
(XEN) domain_crash called from vmx.c:2729
(XEN) Domain 1 (vcpu#0) crashed on cpu#21:
(XEN) ----[ Xen-4.6.0  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    21
(XEN) RIP:    0000:[<0000000000000000>]
(XEN) RFLAGS: 0000000000000002   CONTEXT: hvm guest (d1v0)
(XEN) rax: 0000000000000000   rbx: 0000000000000000   rcx: 0000000000000000
(XEN) rdx: 00000000078bfbff   rsi: 0000000000000000   rdi: 0000000000000000
(XEN) rbp: 0000000000000000   rsp: 0000000000000000   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 0000000000000010   cr4: 0000000000000000
(XEN) cr3: 00000000800ed000   cr2: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: 0000

..

> 
> Anyway, the writing of the respective VMCS field to zero in the
> alternative worries me a little: Aren't we risking MFN zero to be
> wrongly accessed due to this?
> 
> Furthermore, nvmx_update_apic_access_address() having a
> similar ASSERT() seems entirely wrong: The APIC access
> page doesn't really need to match up with any actual page
> belonging to the guest - a guest could choose to point this
> into no-where (note that we've been at least considering this
> option recently for our own purposes, in the context of
> http://lists.xenproject.org/archives/html/xen-devel/2015-12/msg02191.html).
> 
> > Instead I started getting an even more bizzare crash:

Ignore this part please.
.. snip..
> this doesn't match the call stack. Something's pretty fishy here.

Yes. The hypervisor was modified alongside me and I hadn't connected
the dots...
> 
> Jan