From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Shelton Subject: Re: [RFC 7/7] libxl: Wait for QEMU startup in stubdomain Date: Fri, 6 Feb 2015 10:46:15 -0500 Message-ID: References: <1423022775-7132-1-git-send-email-eshelton@pobox.com> <1423022775-7132-8-git-send-email-eshelton@pobox.com> <20150206111616.GD30821@zion.uk.xensource.com> <20150206145940.GE30821@zion.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150206145940.GE30821@zion.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu Cc: Anthony PERARD , xen-devel@lists.xensource.com, Ian Campbell , Stefano Stabellini List-Id: xen-devel@lists.xenproject.org On Fri, Feb 6, 2015 at 9:59 AM, Wei Liu wrote: > On Fri, Feb 06, 2015 at 08:56:40AM -0500, Eric Shelton wrote: >> On Fri, Feb 6, 2015 at 6:16 AM, Wei Liu wrote: >> >> I simply used the code already present in the QEMU upstream code, >> which is writing to that particular ath to indicate "running." Since >> it is distinct from the path used by the QEMU instance running in >> Dom0, it works for my intended purpose: ensuring the device model is >> running before unpausing the HVM guest. When you say it is "wrong," >> is that just because you ultimately intend to rearchitect this and use >> something different? If so, maybe the path I am using is "good >> enough" until that happens. Otherwise, can you suggest a better path >> or mechanism? >> > > It is not "good enough". It just happens to be working. > > Currently the path is hardcoded "/local/domain/0/BLAH". It's wrong, > because the QEMU in stubdom is not running in 0. The correct prefix > should be "/local/domain/$stubdom_id". OK; that definitely makes more sense - I recall the same idea crossing my mind when I first dug into this. Although the revised protocol may go in a different direction, I will adopt this approach for now. >> I noticed some discussion about this on xen-devel. Unfortunately, I >> was unable to find anything that laid out specifically what the >> problems are - can you point me to a bug report or such? The libxl >> startup code - with callbacks on top of callbacks, callbacks within >> callbacks, and callbacks stashed away in little places only to be >> called _much_ later - is really convoluted, I suspect particularly so >> for stubdom startup. I am not surprised it got broken - who can >> remember how it works? >> > > It's not how libxl is coded. It's the startup protocol that is broken. > The breakage of stubdom in Xen 4.5 is a latent bug exposed by a new > feature. > > I guess I should just send a bug report saying "Device model startup > protocol is broken". But I don't have much to say at this point, because > thorough research for both qemu-trad and qemu-upstream is required to > produce a sensible report. So, just where is the current protocol breaking down? Is there a contemplated bandaid for 4.5.1? I'm just trying to figure out what I might want to do differently. > So prior to 4.5, when there is emulation request issued by a guest vcpu, > that request is put on a ring, guest vcpu is paused. When a DM shows up > it processes that request, posts response, then guest vcpu is unpaused. > So there is implicit dependency on Xen's behaviour for DM to work. > > In 4.5, a new feature called ioreq server is added. When Xen sees an > io request which no backing DM, it returns immediately. Guest sees some > wired value and crashes. That is, Xen's behaviour has changed and a > latent bug in stubdom's startup protocol is exposed. So, is the approach that I took - waiting for the stubdom DM to finish initializing - a reasonable short-term solution? I guess I am wondering whether the fix you are contemplating is in libxl, the hypervisor, or both. Thanks, Eric