Re: driver domain crash and reconnect handling

* Re: driver domain crash and reconnect handling
       [not found] <81A73678E76EA642801C8F2E4823AD21014183065D27@LONPMAILBOX01.citrite.net>
@ 2013-01-21 12:20 ` Ian Campbell
       [not found] ` <1358770844.3279.194.camel@zakaz.uk.xensource.com>
  1 sibling, 0 replies; 14+ messages in thread
From: Ian Campbell @ 2013-01-21 12:20 UTC (permalink / raw)
  To: Dave Scott; +Cc: xen-api, Zoltan Kiss, 'xen-devel@lists.xen.org'

On Mon, 2013-01-21 at 11:31 +0000, Dave Scott wrote:
> Hi,
> 
> [ my apologies if this has been discussed before but I couldn't
>   find a relevant thread ]

I don't think it has.

> In XCP we're hoping to make serious use of driver domains soon.
> We'd like to tell people that their xen-based cloud is even
> more robust than before, because even if a host driver crashes,
> there is only a slight interruption to guest I/O. For this to
> work smoothly, we need to figure out how to re-establish disk
> and network I/O after the driver restart -- this is where I'd
> appreciate some advice!
> 
> Is the current xenstore protocol considered sufficient to
> support reconnecting a frontend to a new backend? I did a few
> simple experiments with an XCP driver domain prototype a while
> back and I failed to make the frontend happy -- usually it would
> become confused about the backend and become stuck. This might
> just be because I didn't know what I was doing :-)

I think the protocol is probably sufficient but the implementations of
that protocol are not...

> Zoltan (cc:d) also did a few simple experiments to see whether
> we could re-use the existing suspend/resume infrastructure,
> similar to the 'fast' resume we already use for live checkpoint.
> As an experiment he modified libxc's xc_resume.c to allow the
> guest's HYPERVISOR_suspend hypercall invocation to return with
> '0' (success) rather than '1' (cancelled). The effect of this
> was to leave the domain running, but since it thinks it has just
> resumed in another domain, it explicitly reconnects its frontends.
> With this change and one or two others (like fixing the
> start_info->{store_,console.domU}.mfns) he made it work for a
> number of oldish guests. I'm sure he can describe the changes
> needed more accurately than I can!

Would be interesting to know, especially if everything was achieved with
toolstack side changes only!

> What do you think of this approach? Since it's based on the
> existing suspend/resume code it should hopefully work with all
> guest types without having to update the frontends or hopefully even
> fix bugs in them (because it looks just like a regular resume which
> is pretty well tested everywhere). This is particularly important in
> "cloud" scenarios because the people running clouds have usually
> little or no control over the software their customers are running.
> Unfortunately if we have to wait for a PV frontend change to trickle
> into all the common distros it will be a while before we can fully
> benefit from driver domain restart. If there is a better way
> of doing this in the long term involving a frontend change, what
> do you think about this as a stopgap until the frontends are updated?

I think it could undoubtedly serve well as a stop gap.

Longer term I guess it depends on the shortcomings of this approach
whether we also want to do something more advanced in the PV drivers
upstream and have them trickle through. The main downsides I suppose is
the brief outage due to the proto-suspend plus the requirement to
reconnect all devices and not just the failed one?

I expect the outage due to the proto-suspend is dwarfed by the outage
caused by a backend going away for however long it takes to notice,
rebuild, reset the hardware, etc etc.

The "it's just a normal-ish suspend" argument is pretty compelling since
you are correct that it is likely to be better tested than a crashing
driver domain.

Ian.

^ permalink raw reply	[flat|nested] 14+ messages in thread