From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Jan Beulich" <JBeulich@suse.com>
Subject: Re: Nested virtualization off VMware vSphere 6.0 with
 EL6 guests crashes on Xen 4.6
Date: Fri, 05 Feb 2016 03:33:44 -0700
Message-ID: <56B4889802000078000CEF57@prv-mh.provo.novell.com>
References: <20160112033844.GB15551@char.us.oracle.com>
	<5694D3CB02000078000C5D00@prv-mh.provo.novell.com>
	<20160115213958.GA16118@char.us.oracle.com>
	<569CC17002000078000C7D91@prv-mh.provo.novell.com>
	<20160202220545.GA9915@char.us.oracle.com>
	<56B1D7C702000078000CDDAA@prv-mh.provo.novell.com>
	<20160203150727.GC20732@char.us.oracle.com>
	<20160204183647.GA7205@char.us.oracle.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta5.messagelabs.com ([195.245.231.135])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <JBeulich@suse.com>) id 1aRdi0-00079p-8b
	for xen-devel@lists.xenproject.org; Fri, 05 Feb 2016 10:33:48 +0000
In-Reply-To: <20160204183647.GA7205@char.us.oracle.com>
Content-Disposition: inline
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: andrew.cooper3@citrix.com, kevin.tian@intel.com, wim.coekaerts@oracle.com, jun.nakajima@intel.com, xen-devel <xen-devel@lists.xenproject.org>
List-Id: xen-devel@lists.xenproject.org

>>> On 04.02.16 at 19:36, <konrad.wilk@oracle.com> wrote:
> (XEN) nvmx_handle_vmwrite 1: IO_BITMAP_A(2000)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 0: IO_BITMAP_A(2000)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 1: IO_BITMAP_B(2002)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 2: IO_BITMAP_A(2000)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 1: VIRTUAL_APIC_PAGE_ADDR(2012)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 2: IO_BITMAP_B(2002)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 1: (2006)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 2: VIRTUAL_APIC_PAGE_ADDR(2012)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 1: VM_EXIT_MSR_LOAD_ADDR(2008)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 3: IO_BITMAP_A(2000)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 3: IO_BITMAP_B(2002)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 2: MSR_BITMAP(2004)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 1: MSR_BITMAP(2004)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 0: MSR_BITMAP(2004)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 3: (2006)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 3: VM_EXIT_MSR_LOAD_ADDR(2008)[0=ffffffffffffffff]
> (XEN) nvmx_handle_vmwrite 3: MSR_BITMAP(2004)[0=ffffffffffffffff]

So there's a whole lot of "interesting" writes of all ones, and indeed
VIRTUAL_APIC_PAGE_ADDR is among them, and the code doesn't
handle that case (nor the equivalent for APIC_ACCESS_ADDR).
What's odd though is that the writes are for vCPU 1 and 2, while
the crash is on vCPU 3 (it would of course help if the guest had as
few vCPU-s as possible without making the issue disappear). While
you have circumvented the ASSERT() you've originally hit, the log
messages you've added there don't appear anywhere, which is
clearly confusing, so I wonder what other unintended effects your
debugging code has (there's clearly an uninitialized variable issue
in your additions to vmx_vmexit_handler(), but that shouldn't
matter here, albeit it should have cause build failure, making me
suspect the patch to be stale).

Oddly enough the various bitmap field VMWRITEs above should all
fail, yet the guest appears to recover from (ignore?) these
failures. (From all I can tell we're prone to NULL dereferences due
to that at least in _shadow_io_bitmap().)

> (XEN) Failed vm entry (exit reason 0x80000021) caused by invalid guest state (4).

4 means invalid VMCS link pointer - interesting.

Jan