On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote: > On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote: > >On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote: > >>On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote: > >>>On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote: > >I'm concerned that the bdrv_drain_all() in vm_stop() can take a long > >time if the disk is slow/failing. bdrv_drain_all() blocks until all > >in-flight I/O requests have completed. What does the Primary do if the > >Secondary becomes unresponsive? > > Actually, we knew this problem. But currently, there seems no better way to > resolve it. If you have any ideas? Is it possible to hold the checkpoint information and acknowledge the checkpoint right away, without waiting for bdrv_drain_all() or any Secondory guest activity to complete? I think this really means falling back to microcheckpointing until the Secondary guest can checkpoint. Instead of a blocking vm_stop() we would prevent vcpus from running and when the last pending I/O finishes the Secondary could apply the last checkpoint. This approach does not block QEMU (the monitor, etc).