From mboxrd@z Thu Jan  1 00:00:00 1970
From: Eric Shelton <eshelton@pobox.com>
Subject: Re: [RFC 7/7] libxl: Wait for QEMU startup in stubdomain
Date: Fri, 6 Feb 2015 10:46:15 -0500
Message-ID: <CAPQw5rkOfD6=xFPhWmkFSzUbT4iPLwea-_Bvr51bwb0jbVqZDA@mail.gmail.com>
References: <1423022775-7132-1-git-send-email-eshelton@pobox.com>
	<1423022775-7132-8-git-send-email-eshelton@pobox.com>
	<20150206111616.GD30821@zion.uk.xensource.com>
	<CAPQw5rnAWpX-gTC+D4itwP_fHF3Gj47eYZr7h4sYi9Gg61MBFw@mail.gmail.com>
	<20150206145940.GE30821@zion.uk.xensource.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <20150206145940.GE30821@zion.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Wei Liu <wei.liu2@citrix.com>
Cc: Anthony PERARD <anthony.perard@citrix.com>, xen-devel@lists.xensource.com, Ian Campbell <Ian.Campbell@citrix.com>, Stefano Stabellini <stefano.stabellini@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

On Fri, Feb 6, 2015 at 9:59 AM, Wei Liu <wei.liu2@citrix.com> wrote:
> On Fri, Feb 06, 2015 at 08:56:40AM -0500, Eric Shelton wrote:
>> On Fri, Feb 6, 2015 at 6:16 AM, Wei Liu <wei.liu2@citrix.com> wrote:
>>
>> I simply used the code already present in the QEMU upstream code,
>> which is writing to that particular ath to indicate "running."  Since
>> it is distinct from the path used by the QEMU instance running in
>> Dom0, it works for my intended purpose: ensuring the device model is
>> running before unpausing the HVM guest.  When you say it is "wrong,"
>> is that just because you ultimately intend to rearchitect this and use
>> something different?  If so, maybe the path I am using is "good
>> enough" until that happens.  Otherwise, can you suggest a better path
>> or mechanism?
>>
>
> It is not "good enough". It just happens to be working.
>
> Currently the path is hardcoded "/local/domain/0/BLAH". It's wrong,
> because the QEMU in stubdom is not running in 0. The correct prefix
> should be "/local/domain/$stubdom_id".

OK; that definitely makes more sense - I recall the same idea crossing
my mind when I first dug into this.  Although the revised protocol may
go in a different direction, I will adopt this approach for now.

>> I noticed some discussion about this on xen-devel.  Unfortunately, I
>> was unable to find anything that laid out specifically what the
>> problems are - can you point me to a bug report or such?  The libxl
>> startup code - with callbacks on top of callbacks, callbacks within
>> callbacks, and callbacks stashed away in little places only to be
>> called _much_ later - is really convoluted, I suspect particularly so
>> for stubdom startup.  I am not surprised it got broken - who can
>> remember how it works?
>>
>
> It's not how libxl is coded. It's the startup protocol that is broken.
> The breakage of stubdom in Xen 4.5 is a latent bug exposed by a new
> feature.
>
> I guess I should just send a bug report saying "Device model startup
> protocol is broken". But I don't have much to say at this point, because
> thorough research for both qemu-trad and qemu-upstream is required to
> produce a sensible report.

So, just where is the current protocol breaking down?  Is there a
contemplated bandaid for 4.5.1?  I'm just trying to figure out what I
might want to do differently.

> So prior to 4.5, when there is emulation request issued by a guest vcpu,
> that request is put on a ring, guest vcpu is paused. When a DM shows up
> it processes that request, posts response, then guest vcpu is unpaused.
> So there is implicit dependency on Xen's behaviour for DM to work.
>
> In 4.5, a new feature called ioreq server is added. When Xen sees an
> io request which no backing DM, it returns immediately. Guest sees some
> wired value and crashes. That is, Xen's behaviour has changed and a
> latent bug in stubdom's startup protocol is exposed.

So, is the approach that I took - waiting for the stubdom DM to finish
initializing - a reasonable short-term solution?  I guess I am
wondering whether the fix you are contemplating is in libxl, the
hypervisor, or both.

Thanks,
Eric