On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
> >On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
> >>On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> >>>On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
> >I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
> >time if the disk is slow/failing.  bdrv_drain_all() blocks until all
> >in-flight I/O requests have completed.  What does the Primary do if the
> >Secondary becomes unresponsive?
> 
> Actually, we knew this problem. But currently, there seems no better way to
> resolve it. If you have any ideas?

Is it possible to hold the checkpoint information and acknowledge the
checkpoint right away, without waiting for bdrv_drain_all() or any
Secondory guest activity to complete?

I think this really means falling back to microcheckpointing until the
Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
would prevent vcpus from running and when the last pending I/O finishes
the Secondary could apply the last checkpoint.  This approach does not
block QEMU (the monitor, etc).