From mboxrd@z Thu Jan  1 00:00:00 1970
From: Zoltan Kiss <zoltan.kiss@citrix.com>
Subject: Re: driver domain crash and reconnect handling
Date: Wed, 23 Jan 2013 21:58:16 +0000
Message-ID: <51005CF8.402@citrix.com>
References: <81A73678E76EA642801C8F2E4823AD21014183065D27@LONPMAILBOX01.citrite.net>
	<1358770844.3279.194.camel@zakaz.uk.xensource.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <1358770844.3279.194.camel@zakaz.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Ian Campbell <Ian.Campbell@citrix.com>
Cc: "xen-api@lists.xen.org" <xen-api@lists.xen.org>, Dave Scott <Dave.Scott@eu.citrix.com>, "'xen-devel@lists.xen.org'" <xen-devel@lists.xen.org>
List-Id: xen-devel@lists.xenproject.org

Hi,

On 21/01/13 12:20, Ian Campbell wrote:
> On Mon, 2013-01-21 at 11:31 +0000, Dave Scott wrote:
>> Is the current xenstore protocol considered sufficient to
>> support reconnecting a frontend to a new backend? I did a few
>> simple experiments with an XCP driver domain prototype a while
>> back and I failed to make the frontend happy -- usually it would
>> become confused about the backend and become stuck. This might
>> just be because I didn't know what I was doing :-)
>
> I think the protocol is probably sufficient but the implementations of
> that protocol are not...
What kind of problems do you think about?

>> Zoltan (cc:d) also did a few simple experiments to see whether
>> we could re-use the existing suspend/resume infrastructure,
>> similar to the 'fast' resume we already use for live checkpoint.
>> As an experiment he modified libxc's xc_resume.c to allow the
>> guest's HYPERVISOR_suspend hypercall invocation to return with
>> '0' (success) rather than '1' (cancelled). The effect of this
>> was to leave the domain running, but since it thinks it has just
>> resumed in another domain, it explicitly reconnects its frontends.
>> With this change and one or two others (like fixing the
>> start_info->{store_,console.domU}.mfns) he made it work for a
>> number of oldish guests. I'm sure he can describe the changes
>> needed more accurately than I can!
>
> Would be interesting to know, especially if everything was achieved with
> toolstack side changes only!
Actually I've used the xc_domain_resume_any() function from libxc to 
resume the guests. It worked with PV guests, however with some hacks in 
the hypervisor to silently discarding the error condicions, and not 
returning from the hypercall with an error. The two guests I've used, 
and their problems with the hypercall return values:

- SLES 11 SP1 (2.6.32.12) crashes because VCPUOP_register_vcpu_info 
hypercall returns EINVAL, as ( v->arch.vcpu_info_mfn != INVALID_MFN )
- Debian Squeeze 6.0 (2.6.32-5) crashes because EVTCHNOP_bind_virq 
returns EEXISTS, as ( v->virq_to_evtchnvirq != 0 )
- (these hypercalls were made right after guest comes back from the 
suspend hypercall)

I suppose there will be similar problems with other PV guests, I intend 
to test other ones as well. My current problem is to architect a proper 
solution instead of my hacks in the hypervisor. I think we can't access 
those data areas outside the hypervisor (v is a "struct vcpu" equals 
current->domain->vcpu[vcpuid]), and unfortunately as I see Xen forgets 
the fact that the domain was suspended by the time these hypercalls comes.

Windows however seems to be less problematic, I've tested Windows 7 with 
XenServer 6.1 PV drivers, and it worked seamlessly. That driver doesn't 
care about the suspend hypercall return value, it just do a full 
close-open cycle. It worked with the fast/cooperative way, obviously.

>> What do you think of this approach? Since it's based on the
>> existing suspend/resume code it should hopefully work with all
>> guest types without having to update the frontends or hopefully even
>> fix bugs in them (because it looks just like a regular resume which
>> is pretty well tested everywhere). This is particularly important in
>> "cloud" scenarios because the people running clouds have usually
>> little or no control over the software their customers are running.
>> Unfortunately if we have to wait for a PV frontend change to trickle
>> into all the common distros it will be a while before we can fully
>> benefit from driver domain restart. If there is a better way
>> of doing this in the long term involving a frontend change, what
>> do you think about this as a stopgap until the frontends are updated?
>
> I think it could undoubtedly serve well as a stop gap.
>
> Longer term I guess it depends on the shortcomings of this approach
> whether we also want to do something more advanced in the PV drivers
> upstream and have them trickle through. The main downsides I suppose is
> the brief outage due to the proto-suspend plus the requirement to
> reconnect all devices and not just the failed one?
I think the current solution to reuse the suspend/resume is quite 
viable, however it has the mentioned drawbacks, the extra failure points 
of doing the suspend hypercall and reinit all the frontend devices, not 
just the affected ones. In the long term I think we should implement 
this as an extra feature, which could be controlled through xenstore. I 
already has a prototype version for Linux netfront, but it works through 
sysfs. It calls the same suspend resume callbacks, but only for the 
affected devices.

> I expect the outage due to the proto-suspend is dwarfed by the outage
> caused by a backend going away for however long it takes to notice,
> rebuild, reset the hardware, etc etc.
Indeed, probably the backend restoration would take at least 5 seconds. 
Compared to that, the suspend-resume and the frontend device reinit is 
much shorter.
Probably in storage driver domains it's better to suspend the guest 
immediately when the backend is gone, as the guest can easily crash if 
the block device is inaccessible for a long time. In case of network 
access, this isn't such a big problem.

Regards,

Zoli