All of lore.kernel.org
 help / color / mirror / Atom feed
* driver domain crash and reconnect handling
@ 2013-01-21 11:31 Dave Scott
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Scott @ 2013-01-21 11:31 UTC (permalink / raw)
  To: 'xen-devel@lists.xen.org'; +Cc: Zoltan Kiss, xen-api

Hi,

[ my apologies if this has been discussed before but I couldn't
  find a relevant thread ]

In XCP we're hoping to make serious use of driver domains soon.
We'd like to tell people that their xen-based cloud is even
more robust than before, because even if a host driver crashes,
there is only a slight interruption to guest I/O. For this to
work smoothly, we need to figure out how to re-establish disk
and network I/O after the driver restart -- this is where I'd
appreciate some advice!

Is the current xenstore protocol considered sufficient to
support reconnecting a frontend to a new backend? I did a few
simple experiments with an XCP driver domain prototype a while
back and I failed to make the frontend happy -- usually it would
become confused about the backend and become stuck. This might
just be because I didn't know what I was doing :-)

Zoltan (cc:d) also did a few simple experiments to see whether
we could re-use the existing suspend/resume infrastructure,
similar to the 'fast' resume we already use for live checkpoint.
As an experiment he modified libxc's xc_resume.c to allow the
guest's HYPERVISOR_suspend hypercall invocation to return with
'0' (success) rather than '1' (cancelled). The effect of this
was to leave the domain running, but since it thinks it has just
resumed in another domain, it explicitly reconnects its frontends.
With this change and one or two others (like fixing the
start_info->{store_,console.domU}.mfns) he made it work for a
number of oldish guests. I'm sure he can describe the changes
needed more accurately than I can!

What do you think of this approach? Since it's based on the
existing suspend/resume code it should hopefully work with all
guest types without having to update the frontends or hopefully even
fix bugs in them (because it looks just like a regular resume which
is pretty well tested everywhere). This is particularly important in
"cloud" scenarios because the people running clouds have usually
little or no control over the software their customers are running.
Unfortunately if we have to wait for a PV frontend change to trickle
into all the common distros it will be a while before we can fully
benefit from driver domain restart. If there is a better way
of doing this in the long term involving a frontend change, what
do you think about this as a stopgap until the frontends are updated?

Cheers,
Dave

^ permalink raw reply	[flat|nested] 14+ messages in thread
[parent not found: <81A73678E76EA642801C8F2E4823AD21014183065D27@LONPMAILBOX01.citrite.net>]

end of thread, other threads:[~2013-01-24 19:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-21 11:31 driver domain crash and reconnect handling Dave Scott
     [not found] <81A73678E76EA642801C8F2E4823AD21014183065D27@LONPMAILBOX01.citrite.net>
2013-01-21 12:20 ` Ian Campbell
     [not found] ` <1358770844.3279.194.camel@zakaz.uk.xensource.com>
2013-01-23 21:58   ` Zoltan Kiss
2013-01-24  9:59     ` Ian Campbell
2013-01-24 11:45     ` George Shuklin
     [not found]     ` <51011EF5.9080708@gmail.com>
2013-01-24 12:57       ` Zoltan Kiss
2013-01-24 13:25       ` Paul Durrant
     [not found]       ` <291EDFCB1E9E224A99088639C4762022013F451DCB22@LONPMAILBOX01.citrite.net>
2013-01-24 14:06         ` George Shuklin
     [not found]         ` <51013FED.60201@gmail.com>
2013-01-24 15:01           ` Zoltan Kiss
     [not found]           ` <51014CDC.1030501@citrix.com>
2013-01-24 15:10             ` Andrew Cooper
2013-01-24 19:42               ` Zoltan Kiss
2013-01-24 17:14             ` Ian Campbell
     [not found]             ` <1359047667.32057.31.camel@zakaz.uk.xensource.com>
2013-01-24 19:39               ` Zoltan Kiss
     [not found]     ` <1359021585.17440.93.camel@zakaz.uk.xensource.com>
2013-01-24 19:51       ` Zoltan Kiss

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.